Most efficient way to calculate Levenshtein distance

efficiencyIsBliss picture efficiencyIsBliss · Jul 6, 2010 · Viewed 30.8k times · Source

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same.

Here's my implementation:

public int calculate(String root, String query)
{
  int arr[][] = new int[root.length() + 2][query.length() + 2];

  for (int i = 2; i < root.length() + 2; i++)
  {
    arr[i][0] = (int) root.charAt(i - 2);
    arr[i][1] = (i - 1);
  }

  for (int i = 2; i < query.length() + 2; i++)
  {
    arr[0][i] = (int) query.charAt(i - 2);
    arr[1][i] = (i - 1);
  }

  for (int i = 2; i < root.length() + 2; i++)
  {
    for (int j = 2; j < query.length() + 2; j++)
    {
      int diff = 0;
      if (arr[0][j] != arr[i][0])
      {
        diff = 1;
      }
      arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
    }
  }
  return arr[root.length() + 1][query.length() + 1];
}

public int min(int n1, int n2, int n3)
{
  return (int) Math.min(n1, Math.min(n2, n3));
}

Answer

Alex Martelli picture Alex Martelli · Jul 6, 2010

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).

Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.

I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).