Tree Edit Distance

Implements the tree edit distance of Zhang and Shasha (1989) and its backtracing in cython.

edist.ted.extract_from_tuple_input()

Assumes that both x and y are tuples and unpacks those tuples.

x: tuple
a tuple
y: tuple
another tuple
Returns:
  • x_nodes (list) – x[0]
  • x_adj (list) – x[1]
  • y_nodes (list) – y[0]
  • y_adj (list) – y[1]
Raises:ValueError – if the input is not tuple-shaped.
edist.ted.keyroots()

Computes the keyroots of a tree based on its outermost right leaf array. The keyroot for a node i is defined as the lowest k, such that orl[i] = orl[k].

Parameters:orl (array_like) – An outermost right leaf array as computed by the outermost_right_leaves function above.
Returns:keyroots – An array of keyroots in descending order.
Return type:int array
edist.ted.outermost_right_leaves()

Computes the outermost right leaves of a tree based on its adjacency list. The outermost right leaf of a tree is defined as recursively accessing the right-most child of a node until we hit a leaf.

Note that we assume a proper depth-first-search order of adj, i.e. for every node i, the following indices are all part of the subtree rooted at i until we hit the index of i’s right sibling or the end of the tree.

Parameters:adj (list) – An adjacency list representation of the tree, i.e. an array such that for every i, adj[i] is the list of child indices for node i.
Returns:orl – An array containing the outermost right leaf index for every node in the tree.
Return type:int array
edist.ted.standard_ted()

Computes the standard tree edit distance between the trees x and y, each described by a list of nodes and an adjacency list adj, where adj[i] is a list of indices pointing to children of node i.

The ‘standard’ refers to the fact that we use the kronecker distance as delta, i.e. this call computes the same as

ted(x_nodes, x_adj, y_nodes, y_adj, kronecker_distance) where

kronecker_distance(x, y) = 1 if x != y and 0 if x = y.

However, this implementation here is notably faster because we can apply integer arithmetic.

Note that we assume a proper depth-first-search order of adj, i.e. for every node i, the following indices are all part of the subtree rooted at i until we hit the index of i’s right sibling or the end of the tree.

Parameters:
  • x_nodes (list or tuple) – a list of nodes for tree x OR a tuple of the form (x_nodes, x_adj).
  • x_adj (list or tuple) – an adjacency list for tree x OR a tuple of the form (y_nodes, y_adj).
  • y_nodes (list (default = x_adj[0])) – a list of nodes for tree y.
  • y_adj (list (default = x_adj[1])) – an adjacency list for tree y.
Returns:

d – the standard tree edit distance between x and y according.

Return type:

int

edist.ted.standard_ted_backtrace()

Computes the standard tree edit distance between the trees x and y, each described by a list of nodes and an adjacency list adj, where adj[i] is a list of indices pointing to children of node i. This function returns an alignment representation of the distance.

The ‘standard’ refers to the fact that we use the kronecker distance as delta, i.e. this call computes the same as

ted(x_nodes, x_adj, y_nodes, y_adj, kronecker_distance) where

kronecker_distance(x, y) = 1 if x != y and 0 if x = y.

However, this implementation here is notably faster because we can apply integer arithmetic.

Note that we assume a proper depth-first-search order of adj, i.e. for every node i, the following indices are all part of the subtree rooted at i until we hit the index of i’s right sibling or the end of the tree.

Parameters:
  • x_nodes (list or tuple) – a list of nodes for tree x OR a tuple of the form (x_nodes, x_adj).
  • x_adj (list or tuple) – an adjacency list for tree x OR a tuple of the form (y_nodes, y_adj).
  • y_nodes (list (default = x_adj[0])) – a list of nodes for tree y.
  • y_adj (list (default = x_adj[1])) – an adjacency list for tree y.
Returns:

alignment – A co-optimal alignment between x and y according to the tree edit distance.

Return type:

class alignment.Alignment

edist.ted.standard_ted_backtrace_matrix()

Computes a matrix P where entry P[i, j] represents how often node i in tree x was aligned with node j in tree y in co-optimal alignments according to the standard tree edit distance.

Note that we assume a proper depth-first-search order of adj, i.e. for every node i, the following indices are all part of the subtree rooted at i until we hit the index of i’s right sibling or the end of the tree.

Parameters:
  • x_nodes (list or tuple) – a list of nodes for tree x OR a tuple of the form (x_nodes, x_adj).
  • x_adj (list or tuple) – an adjacency list for tree x OR a tuple of the form (y_nodes, y_adj).
  • y_nodes (list (default = x_adj[0])) – a list of nodes for tree y.
  • y_adj (list (default = x_adj[1])) – an adjacency list for tree y.
Returns:

  • P (array_like) – a matrix, where entry P[i, j] specifies the fraction of co-optimal alignments in which node x[i] has been aligned with node y[j]. P[i, n] contains the fraction of deletions of node x[i] and P[m, j] the fraction of insertions of node y[j].
  • K (array_like) – a matrix that contains the counts for all co-optimal alignments in which node x[i] has been aligned with node y[j].
  • k (int) – the number of co-optimal alignments overall, such that P = K / k.

edist.ted.ted()

Computes the tree edit distance between the trees x and y, each described by a list of nodes and an adjacency list adj, where adj[i] is a list of indices pointing to children of node i.

Note that we assume a proper depth-first-search order of adj, i.e. for every node i, the following indices are all part of the subtree rooted at i until we hit the index of i’s right sibling or the end of the tree.

Parameters:
  • x_nodes (list or tuple) – a list of nodes for tree x OR a tuple of the form (x_nodes, x_adj).
  • x_adj (list or tuple) – an adjacency list for tree x OR a tuple of the form (y_nodes, y_adj).
  • y_nodes (list (default = x_adj[0])) – a list of nodes for tree y.
  • y_adj (list (default = x_adj[1])) – an adjacency list for tree y.
  • delta (function (default = None)) – a function that takes two nodes as inputs and returns their pairwise distance, where delta(x, None) should be the cost of deleting x and delta(None, y) should be the cost of inserting y. If undefined, this method calls standard_ted instead.
Returns:

d – the tree edit distance between x and y according to delta.

Return type:

float

edist.ted.ted_backtrace()

Computes the tree edit distance between the trees x and y, each described by a list of nodes and an adjacency list adj, where adj[i] is a list of indices pointing to children of node i. This function returns an alignment representation of the distance.

Note that we assume a proper depth-first-search order of adj, i.e. for every node i, the following indices are all part of the subtree rooted at i until we hit the index of i’s right sibling or the end of the tree.

Parameters:
  • x_nodes (list or tuple) – a list of nodes for tree x OR a tuple of the form (x_nodes, x_adj).
  • x_adj (list or tuple) – an adjacency list for tree x OR a tuple of the form (y_nodes, y_adj).
  • y_nodes (list (default = x_adj[0])) – a list of nodes for tree y.
  • y_adj (list (default = x_adj[1])) – an adjacency list for tree y.
  • delta (function (default = None)) – a function that takes two nodes as inputs and returns their pairwise distance, where delta(x, None) should be the cost of deleting x and delta(None, y) should be the cost of inserting y. If undefined, this method calls standard_ted instead.
Returns:

alignment – A co-optimal alignment between x and y according to the tree edit distance.

Return type:

class alignment.Alignment

edist.ted.ted_backtrace_matrix()

Computes a matrix P where entry P[i, j] represents how often node i in tree x was aligned with node j in tree y in co-optimal alignments according to the tree edit distance.

Note that we assume a proper depth-first-search order of adj, i.e. for every node i, the following indices are all part of the subtree rooted at i until we hit the index of i’s right sibling or the end of the tree.

Parameters:
  • x_nodes (list or tuple) – a list of nodes for tree x OR a tuple of the form (x_nodes, x_adj).
  • x_adj (list or tuple) – an adjacency list for tree x OR a tuple of the form (y_nodes, y_adj).
  • y_nodes (list (default = x_adj[0])) – a list of nodes for tree y.
  • y_adj (list (default = x_adj[1])) – an adjacency list for tree y.
  • delta (function (default = None)) – a function that takes two nodes as inputs and returns their pairwise distance, where delta(x, None) should be the cost of deleting x and delta(None, y) should be the cost of inserting y. If undefined, this method calls standard_ted instead.
Returns:

  • P (array_like) – a matrix, where entry P[i, j] specifies the fraction of co-optimal alignments in which node x[i] has been aligned with node y[j]. P[i, n] contains the fraction of deletions of node x[i] and P[m, j] the fraction of insertions of node y[j].
  • K (array_like) – a matrix that contains the counts for all co-optimal alignments in which node x[i] has been aligned with node y[j].
  • k (int) – the number of co-optimal alignments overall, such that P = K / k.