Embedding Edit Distance Learning¶
Implements embedding edit distance learning as described in the paper
Paaßen, B., Gallicchio, C., Micheli, A., and Hammer, B. (2018). Tree Edit Distance Learning via Adaptive Symbol Embeddings. Proceedings of the 35th International Conference on Machine Learning (ICML 2018). URL: http://proceedings.mlr.press/v80/paassen18a.html
-
class
edist.bedl.
BEDL
(K, T=5, phi=None, phi_grad=None, distance=None, distance_backtrace=None)¶ Implements the embedding edit distance learning (BEDL) scheme for metric learning on edit distances.
In more detail, this learns a median generalized learning vector quantization (MGLVQ) classifier on the data and an embedding of the input symbols which yields an edit distance that makes classification with this classifier easier.
-
K
¶ The number of prototypes for the MGLVQ classifier.
Type: int
-
T
¶ The number of learning epochs we use at most. Defaults to 5.
Type: int
-
phi
¶ A squashing function to post-process each error term. Defaults to the identity.
Type: function (default = identity)
-
phi_grad
¶ The gradient function corresponding to phi.
Type: function (default = one)
-
distance
¶ The edit distance function that shall be learned. Defaults to the sequence edit distance sed.sed.
Type: function (default = sed.sed)
-
distance_backtrace
¶ The matrix backtracing function for the distance. Defaults to sed.sed_backtrace_matrix. Note that this currently does NOT support ADP because ADP returns a different backtracing format.
Type: function (default = sed.sed_backtrace_matrix)
-
_classifier
¶ The learned MGLVQ classifier model.
Type: class proto_dist_ml.MGLVQ
-
_idx
¶ A mapping from alphabet to indices.
Type: dictionary
-
_embedding
¶ A len(alphabet) x len(alphabet) - 1 embedding matrix for all symbols in the alphabet.
Type: array_like
-
_delta_obj
¶ An internal object to make storing of the delta function more efficient.
Type: class bedl.EmbeddingDelta
-
_delta
¶ The learned delta function.
Type: function
-
fit
(X, y)¶ Trains a BEDL model on the given input data.
In more detail, we iterate the following steps in each training epoch:
- We re-compute the distance matrix.
- We adapt the MGLVQ classifier to the current distance matrix.
- We compute the matrix backtraces for the data-to-prototype distances.
- We optimize the embedding for the current data-to-prototype distances.
For more details, please refer to the ICML 2018 paper.
Parameters: - X (list) – a list of data points, each being either a list or a tree, depending on the edit distance that shall be learned.
- y (array_like or list) – an array-like or list-like structure with labels for each data point.
Returns: self
Return type: class bedl.BEDL
-
-
class
edist.bedl.
EmbeddingDelta
(embedding)¶ This class serves as a storage for an embedding to make the embedding delta function pickleable.
Parameters: embedding (array_like) – An embedding matrix. -
_Delta
¶ The current symbol-to-symbol distance matrix
Type: array_like
-
delta
(x, y)¶ Computes the distance between two embedding vectors identified by their index.
Parameters: - x (int) – the left-hand-side index or None.
- y (int) – the right-hand-side index or None.
Returns: d – The Euclidean distance between the embedding for x and for y.
Return type: float
-
delta_with_indexing
(x, y)¶ Computes the distance between two symbols based on their embedding vectors.
Parameters: - x (str) – a symbol.
- y (str) – another symbol.
Returns: d – The Euclidean distance between the embedding for x and for y.
Return type: float
-
-
edist.bedl.
create_index
(lst)¶ Creates a map of list elements to indices.
Parameters: lst (list) – A list. Returns: idx – A map of list elements to indices. Return type: dictionary
-
edist.bedl.
index_data
(Xs, idx)¶ Indexes all data in the input dataset according to the given index.
Parameters: - Xs (list) – A list of data, each being either a list or a tree in node list/adjacency list format.
- idx (dictionary) – A map from symbols to indices.
Returns: Ys – A copy of Xs, where each symbol is replaced by a symbol index.
Return type: list
-
edist.bedl.
initialize_embedding
(size)¶ Sets up a size-dimensional simplex with side length 1 and size+1 vertices (i.e. an equilateral hyper-triangle).
Parameters: size (int) – the dimensionality of the simplex. Returns: embedding – A size x size matrix, where each row contains one vertex of the simplex. The origin is the last vertex. Return type: array_like
-
edist.bedl.
reduce_backtrace
(P, x, y, size)¶ Transforms the input backtrace matrix P of size len(x) + 1 x len(y) + 1 into a reduced matrix Phat of size size + 1 x size + 1 which accumulates the probabilities for same symbols being replaced, i.e. Phat[k, l] is the sum over all entries P[i, j] where x[i] = k and y[j] = l.
Parameters: - P (array_like) – A len(x) + 1 x len(y) + 1 backtracing matrix between sequence x and y, where P[i, j] is the probability of x[i] being replaced with y[j] in a co-optimal alignment, P[i, len(y)] is the deletion probability for x[i] and P[len(x), j] is the insertion probability for y[j]. The last row and column are optional (as in case of DTW).
- X (list) – A list of objects, either sequences or trees.
- Y (list) – A list of objects, either sequences or trees.
- size (int) – The alphabet size.
Returns: Phat – A size+1 x size+1 matrix which accumulates the probabilities for same symbols being replaced, i.e. Phat[k, l] is the sum over all entries P[i, j] where x[i] = k and y[j] = l.
Return type: array_like