Parallel Pairwise Edit Distance Computation

Provides general utility functions to compute pairwise edit distances in parallel.

edist.multiprocess.pairwise_backtraces(Xs, Ys, dist_backtrace, delta=None, num_jobs=8)

Computes the pairwise backtraces between the objects in Xs and the objects in Ys. Each object in Xs and Ys needs to be a valid input for the given distance function, i.e. a sequence or a tree.

Optionally, it is possible to specify a component-wise distance function delta, which will then be forwarded to the input distance function

Parameters:
  • Xs (list) – a list of sequences or trees.
  • Ys (list) – another list of sequences or trees.
  • dist_backtrace (function) – a function that takes an element of Xs as first and an element of Ys as second input and returns an arbitrary object.
  • delta (function (default = None)) – a function that takes two elements of the input sequences or trees as inputs and returns their pairwise distance, where delta(x, None) should be the cost of deleting x and delta(None, y) should be the cost of inserting y. If this is not None, dist needs to accept an optional argument ‘delta’ as well. Defaults to None.
  • num_jobs (int (default = 8)) – The number of jobs to be used for parallel processing. Defaults to 8.
Returns:

B – a len(Xs) x len(Ys) list of lists of pairwise backtraces.

Return type:

list

edist.multiprocess.pairwise_distances(Xs, Ys, dist, delta=None, num_jobs=8)

Computes the pairwise edit distances between the objects in Xs and the objects in Ys. Each object in Xs and Ys needs to be a valid input for the given distance function, i.e. a sequence or a tree.

Optionally, it is possible to specify a component-wise distance function delta, which will then be forwarded to the input distance function

Parameters:
  • Xs (list) – a list of sequences or trees.
  • Ys (list) – another list of sequences or trees.
  • dist (function) – a function that takes an element of Xs as first and an element of Ys as second input and returns a scalar distance value between them.
  • delta (function (default = None)) – a function that takes two elements of the input sequences or trees as inputs and returns their pairwise distance, where delta(x, None) should be the cost of deleting x and delta(None, y) should be the cost of inserting y. If this is not None, dist needs to accept an optional argument ‘delta’ as well. Defaults to None.
  • num_jobs (int (default = 8)) – The number of jobs to be used for parallel processing. Defaults to 8.
Returns:

D – a len(Xs) x len(Ys) matrix of pairwise edit distance values.

Return type:

array_like

edist.multiprocess.pairwise_distances_symmetric(Xs, dist, delta=None, num_jobs=8)

Computes the pairwise edit distances between the objects in Xs, assuming that the distance measure is symmetric. Each object in Xs needs to be a valid input for the given distance function, i.e. a sequence or a tree. Due to symmetry, this method is about double as fast compared to pairwise_distances.

Optionally, it is possible to specify a component-wise distance function delta, which will then be forwarded to the input distance function

Parameters:
  • Xs (list) – a list of sequences or trees.
  • dist (function) – a function that takes two elements of Xs as inputs and returns a scalar distance value between them.
  • delta (function (default = None)) – a function that takes two elements of the input sequences or trees as inputs and returns their pairwise distance, where delta(x, None) should be the cost of deleting x and delta(None, y) should be the cost of inserting y. If this is not None, dist needs to accept an optional argument ‘delta’ as well. Defaults to None.
  • num_jobs (int (default = 8)) – The number of jobs to be used for parallel processing. Defaults to 8.
Returns:

D – a symmetric len(Xs) x len(Xs) matrix of pairwise edit distance values.

Return type:

array_like