I am a little puzzled by two different answers returned by SequenceMatcher
depending on the order of the arguments. Why is it so?
SequenceMatcher is not commutative:
>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo").ratio()
0.6086956521739131
>>> SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm").ratio()
0.5217391304347826
SequenceMatcher.ratio
internally uses SequenceMatcher.get_matching_blocks
to calculate the ratio, I will walk you through the steps to see how that happens:
SequenceMatcher.get_matching_blocks
Return list of triples describing matching subsequences. Each triple is of the form
(i, j, n)
, and means thata[i:i+n] == b[j:j+n]
. The triples are monotonically increasing ini
andj
.The last triple is a dummy, and has the value
(len(a), len(b), 0)
. It is the only triple withn == 0
.If (i, j, n)
and(i', j', n')
are adjacent triples in the list, and the second is not the last triple in the list, theni+n != i'
orj+n != j'
; in other words, adjacent triples always describe non-adjacent equal blocks.
ratio
internally uses SequenceMatcher.get_matching_blocks
's results, and sums the sizes of all matched sequences returned bySequenceMatcher.get_matching_blocks
. This is the exact source code from difflib.py
:
matches = sum(triple[-1] for triple in self.get_matching_blocks())
The above line is critical, because the result of the above expression is used to compute the ratio. We'll see that shortly and how it impacts the calculation of the ratio.
>>> m1 = SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo")
>>> m2 = SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm")
>>> matches1 = sum(triple[-1] for triple in m1.get_matching_blocks())
>>> matches1
7
>>> matches2 = sum(triple[-1] for triple in m2.get_matching_blocks())
>>> matches2
6
As you can see, we have 7 and 6. These are simply the sums of the matched subsequences as returned by get_matching_blocks
. Why does this matter? Here's why, the ratio is computed in the following way, (this is from difflib
source code):
def _calculate_ratio(matches, length):
if length:
return 2.0 * matches / length
return 1.0
length
is len(a) + len(b)
where a
is the first sequence and b
being the second sequence.
Okay, enough talk, we need actions:
>>> length = len("Ebojfm Mzpm") + len("Ebfo ef Mfpo")
>>> m1.ratio()
0.6086956521739131
>>> (2.0 * matches1 / length) == m1.ratio()
True
Similarly for m2
:
>>> 2.0 * matches2 / length
0.5217391304347826
>>> (2.0 * matches2 / length) == m2.ratio()
True
Note: Not all SequenceMatcher(None a,b).ratio() == SequenceMatcher(None b,a).ratio()
are False
, sometimes they can be True
:
>>> s1 = SequenceMatcher(None, "abcd", "bcde").ratio()
>>> s2 = SequenceMatcher(None, "bcde", "abcd").ratio()
>>> s1 == s2
True
In case you're wondering why, this is because
sum(triple[-1] for triple in self.get_matching_blocks())
is the same for both SequenceMatcher(None, "abcd", "bcde")
and SequenceMatcher(None, "bcde", "abcd")
which is 3.