Levenshtein Distance: Inferring the edit operations from the matrix

borebardha picture borebardha · May 1, 2011 · Viewed 8.7k times · Source

I wrote Levenshtein algorithm in in C++

If I input:
string s: democrat
string t: republican

I get the matrix D filled-up and the number of operations (the Levenshtein distance) can be read in D[10][8] = 8
Beyond the filled matrix I want to construct the optimal solution. How must look this solution? I don't have an idea.
Please only write me HOW MUST LOOK for this example.

Answer

mjv picture mjv · May 2, 2011

The question is
Given the matrix produced by the Levenshtein algorithm, how can one find "the optimal solution"?
i.e. how can we find the precise sequence of string operations: inserts, deletes and substitution [of a single letter], necessary to convert the 's string' into the 't string'?

First, it should be noted that in many cases there are SEVERAL optimal solutions.
While the Levenshtein algorithm supplies the minimum number of operations (8 in democrat/republican example) there are many sequences (of 8 operations) which can produce this conversion.

By "decoding" the Levenshtein matrix, one can enumerate ALL such optimal sequences.
The general idea is that the optimal solutions all follow a "path", from top left corner to bottom right corner (or in the other direction), whereby the matrix cell values on this path either remain the same or increase by one (or decrease by one in the reverse direction), starting at 0 and ending at the optimal number of operations for the strings in question (0 thru 8 democrat/republican case). The number increases when an operation is necessary, it stays the same when the letter at corresponding positions in the strings are the same.

It is easy to produce an algorithm which produces such a path (slightly more complicated to produce all possible paths), and from such path deduce the sequence of operations.

This path finding algorithm should start at the lower right corner and work its way backward. The reason for this approach is that we know for a fact that to be an optimal solution it must end in this corner, and to end in this corner, it must have come from one of the 3 cells either immediately to its left, immediately above it or immediately diagonally. By selecting a cell among these three cells, one which satisfies our "same value or decreasing by one" requirement, we effectively pick a cell on one of the optimal paths. By repeating the operation till we get on upper left corner (or indeed until we reach a cell with a 0 value), we effectively backtrack our way on an optimal path.

Illustration with the democrat - republican example

It should also be noted that one can build the matrix in one of two ways: with 'democrat' horizontally or vertically. This doesn't change the computation of the Levenshtein distance nor does it change the list of operations needed; it only changes the way we interpret the matrix, for example moving horizontally on the "path" either means inserting a character [from the t string] or deleting a character [off the s string] depending whether 'string s' is "horizontal" or "vertical" in the matrix.

I'll use the following matrix. The conventions are therefore (only going in the left-to-right and/or top-to-bottom directions)

  • an horizontal move is an INSERTION of a letter from the 't string'
  • an vertical move is a DELETION of a letter from the 's string'
  • a diagonal move is either:
    • a no-operation (both letters at respective positions are the same); the number doesn't change
    • a SUBSTITUTION (letters at respective positions are distinct); the number increase by one.

Levenshtein matrix for s = "democrat", t="republican"

      r  e  p  u  b  l  i  c  a  n
   0  1  2  3  4  5  6  7  8  9  10
d  1  1  2  3  4  5  6  7  8  9  10
e  2  2  1  2  3  4  5  6  7  8  9
m  3  3  2  2  3  4  5  6  7  8  9
o  4  4  3  3  3  4  5  6  7  8  9
c  5  5  4  4  4  4  5  6  6  7  8
r  6  5  5  5  5  5  5  6  7  7  8
a  7  6  6  6  6  6  6  6  7  7  8
t  8  7  7  7  7  7  7  7  7  8  8

The arbitrary approach I use to select one path among several possible optimal paths is loosely described below:

Starting at the bottom-rightmost cell, and working our way backward toward 
the top left.

For each "backward" step, consider the 3 cells directly adjacent to the current
cell   (in the left, top or left+top directions)

   if the value in the diagonal cell (going up+left) is smaller or equal to the
      values found in the other two cells
   AND 
      if this is same or 1 minus the value of the current cell 
   then  "take the diagonal cell"
         if the value of the diagonal cell is one less than the current cell:
            Add a SUBSTITUTION operation (from the letters corresponding to
            the _current_ cell)
         otherwise: do not add an operation this was a no-operation.

   elseif the value in the cell to the left is smaller of equal to the value of
       the of the cell above current cell
   AND 
       if this value is same or 1 minus the value of the current cell 
   then "take the cell to left", and
        add an INSERTION of the letter corresponding to the cell
   else
       take the cell above, add
       Add a DELETION operation of the letter in 's string'

Following this informal pseudo-code, we get the following:

Start on the "n", "t" cell at bottom right.
Pick the [diagonal] "a", "a" cell as next destination since it is less than the other two (and satisfies the same or -1 condition).
Note that the new cell is one less than current cell
therefore the step 8 is substitute "t" with "n":     democra N

Continue with "a", "a" cell,
Pick the [diagonal] "c", "r" cell as next destination...
Note that the new cell is same value as current cell ==> no operation needed.

Continue with "c", "r" cell,
Pick the [diagonal] "i", "c" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 7 is substitute "r" with "c":     democ C an

Continue with "i", "c" cell,
Pick the [diagonal] "l", "o" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 6 is substitute "c" with "i":     demo I can

Continue with "l", "o" cell,
Pick the [diagonal] "b", "m" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 5 is substitute "o" with "l":     dem L ican

Continue with "b", "m" cell,
Pick the [diagonal]"u", "e" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 4 is substitute "m" with "b":     de B lican

Continue with "u", "e" cell,
Note the "diagonal" cell doesn't qualify, because the "left" cell is less than it. Pick the [left] "p", "e" cell as next destination...
therefore the step 3 is instert "u" after "e":     de U blican

Continue with "p", "e" cell,
again the "diagonal" cell doesn't qualify Pick the [left] "e", "e" cell as next destination...
therefore the step 2 is instert "p" after "e":     de P ublican

Continue with "e", "e" cell,
Pick the [diagonal] "r", "d" cell as next destination...
Note that the new cell is same value as current cell ==> no operation needed.

Continue with "r", "d" cell,
Pick the [diagonal] "start" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 1 is substitute "d" with "r":     R epublican

You've arrived at a cell which value is 0 : your work is done!