According to the python-Levenshtein.ratio
source:
https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L722
it's computed as (lensum - ldist) / lensum
. This works for
# pip install python-Levenshtein
import Levenshtein
Levenshtein.distance('ab', 'a') # returns 1
Levenshtein.ratio('ab', 'a') # returns 0.666666
However, it seems to break with
Levenshtein.distance('ab', 'ac') # returns 1
Levenshtein.ratio('ab', 'ac') # returns 0.5
I feel I must be missing something very simple.. but why not 0.75
?
Levenshtein distance for 'ab'
and 'ac'
as below:
so alignment is:
a c
a b
Alignment length = 2
number of mismatch = 1
Levenshtein Distance
is 1
because only one substitutions is required to transfer ac
into ab
(or reverse)
Distance ratio = (Levenshtein Distance)/(Alignment length ) = 0.5
EDIT
you are writing
(lensum - ldist) / lensum
= (1 - ldist/lensum)
= 1 - 0.5 = 0.5.
But this is matching (not distance)
REFFRENCE, you may notice its written
Matching %
p = (1 - l/m) × 100
Where l
is the levenshtein distance
and m
is the length of the longest of the two
words:
(notice: some author use longest of the two, I used alignment length)
(1 - 3/7) × 100 = 57.14...
(Word 1 Word 2 RATIO Mis-Match Match%
AB AB 0 0 (1 - 0/2 )*100 = 100%
CD AB 1 2 (1 - 2/2 )*100 = 0%
AB AC .5 1 (1 - 1/2 )*100 = 50%
Why some authors divide by alignment length,other by max length of one of both?.., because Levenshtein don't consider gap. Distance = number of edits (insertion + deletion + replacement), While Needleman–Wunsch algorithm that is standard global alignment consider gap. This is (gap) difference between Needleman–Wunsch and Levenshtein, so much of paper use max distance between two sequences (BUT THIS IS MY OWN UNDERSTANDING, AND IAM NOT SURE 100%)
Here is IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:
Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d( X , Y ) is defined as the minimum of W( P ) / L ( P )w, here P is an editing path between X and Y , W ( P ) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P).