Embedding Edit Distance Learning

Implements embedding edit distance learning as described in the paper

Paaßen, B., Gallicchio, C., Micheli, A., and Hammer, B. (2018). Tree Edit Distance Learning via Adaptive Symbol Embeddings. Proceedings of the 35th International Conference on Machine Learning (ICML 2018). URL: http://proceedings.mlr.press/v80/paassen18a.html

class edist.bedl.BEDL(K, T=5, phi=None, phi_grad=None, distance=None, distance_backtrace=None)

Implements the embedding edit distance learning (BEDL) scheme for metric learning on edit distances.

In more detail, this learns a median generalized learning vector quantization (MGLVQ) classifier on the data and an embedding of the input symbols which yields an edit distance that makes classification with this classifier easier.

K

The number of prototypes for the MGLVQ classifier.

Type:int
T

The number of learning epochs we use at most. Defaults to 5.

Type:int
phi

A squashing function to post-process each error term. Defaults to the identity.

Type:function (default = identity)
phi_grad

The gradient function corresponding to phi.

Type:function (default = one)
distance

The edit distance function that shall be learned. Defaults to the sequence edit distance sed.sed.

Type:function (default = sed.sed)
distance_backtrace

The matrix backtracing function for the distance. Defaults to sed.sed_backtrace_matrix. Note that this currently does NOT support ADP because ADP returns a different backtracing format.

Type:function (default = sed.sed_backtrace_matrix)
_classifier

The learned MGLVQ classifier model.

Type:class proto_dist_ml.MGLVQ
_idx

A mapping from alphabet to indices.

Type:dictionary
_embedding

A len(alphabet) x len(alphabet) - 1 embedding matrix for all symbols in the alphabet.

Type:array_like
_delta_obj

An internal object to make storing of the delta function more efficient.

Type:class bedl.EmbeddingDelta
_delta

The learned delta function.

Type:function
fit(X, y)

Trains a BEDL model on the given input data.

In more detail, we iterate the following steps in each training epoch:

  1. We re-compute the distance matrix.
  2. We adapt the MGLVQ classifier to the current distance matrix.
  3. We compute the matrix backtraces for the data-to-prototype distances.
  4. We optimize the embedding for the current data-to-prototype distances.

For more details, please refer to the ICML 2018 paper.

Parameters:
  • X (list) – a list of data points, each being either a list or a tree, depending on the edit distance that shall be learned.
  • y (array_like or list) – an array-like or list-like structure with labels for each data point.
Returns:

self

Return type:

class bedl.BEDL

class edist.bedl.EmbeddingDelta(embedding)

This class serves as a storage for an embedding to make the embedding delta function pickleable.

Parameters:embedding (array_like) – An embedding matrix.
_Delta

The current symbol-to-symbol distance matrix

Type:array_like
delta(x, y)

Computes the distance between two embedding vectors identified by their index.

Parameters:
  • x (int) – the left-hand-side index or None.
  • y (int) – the right-hand-side index or None.
Returns:

d – The Euclidean distance between the embedding for x and for y.

Return type:

float

delta_with_indexing(x, y)

Computes the distance between two symbols based on their embedding vectors.

Parameters:
  • x (str) – a symbol.
  • y (str) – another symbol.
Returns:

d – The Euclidean distance between the embedding for x and for y.

Return type:

float

edist.bedl.create_index(lst)

Creates a map of list elements to indices.

Parameters:lst (list) – A list.
Returns:idx – A map of list elements to indices.
Return type:dictionary
edist.bedl.index_data(Xs, idx)

Indexes all data in the input dataset according to the given index.

Parameters:
  • Xs (list) – A list of data, each being either a list or a tree in node list/adjacency list format.
  • idx (dictionary) – A map from symbols to indices.
Returns:

Ys – A copy of Xs, where each symbol is replaced by a symbol index.

Return type:

list

edist.bedl.initialize_embedding(size)

Sets up a size-dimensional simplex with side length 1 and size+1 vertices (i.e. an equilateral hyper-triangle).

Parameters:size (int) – the dimensionality of the simplex.
Returns:embedding – A size x size matrix, where each row contains one vertex of the simplex. The origin is the last vertex.
Return type:array_like
edist.bedl.reduce_backtrace(P, x, y, size)

Transforms the input backtrace matrix P of size len(x) + 1 x len(y) + 1 into a reduced matrix Phat of size size + 1 x size + 1 which accumulates the probabilities for same symbols being replaced, i.e. Phat[k, l] is the sum over all entries P[i, j] where x[i] = k and y[j] = l.

Parameters:
  • P (array_like) – A len(x) + 1 x len(y) + 1 backtracing matrix between sequence x and y, where P[i, j] is the probability of x[i] being replaced with y[j] in a co-optimal alignment, P[i, len(y)] is the deletion probability for x[i] and P[len(x), j] is the insertion probability for y[j]. The last row and column are optional (as in case of DTW).
  • X (list) – A list of objects, either sequences or trees.
  • Y (list) – A list of objects, either sequences or trees.
  • size (int) – The alphabet size.
Returns:

Phat – A size+1 x size+1 matrix which accumulates the probabilities for same symbols being replaced, i.e. Phat[k, l] is the sum over all entries P[i, j] where x[i] = k and y[j] = l.

Return type:

array_like