How does Pythons SequenceMatcher work?

user2399453 picture user2399453 · Feb 20, 2016 · Viewed 10.5k times · Source

I am a little puzzled by two different answers returned by SequenceMatcher depending on the order of the arguments. Why is it so?

Example

SequenceMatcher is not commutative:

>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo").ratio()
0.6086956521739131

>>> SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm").ratio()
0.5217391304347826

Answer

direprobs picture direprobs · Aug 2, 2017

SequenceMatcher.ratio internally uses SequenceMatcher.get_matching_blocks to calculate the ratio, I will walk you through the steps to see how that happens:

SequenceMatcher.get_matching_blocks

Return list of triples describing matching subsequences. Each triple is of the form (i, j, n), and means that a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in i and j.

The last triple is a dummy, and has the value (len(a), len(b), 0). It is the only triple with n == 0. If (i, j, n) and (i', j', n') are adjacent triples in the list, and the second is not the last triple in the list, then i+n != i' or j+n != j'; in other words, adjacent triples always describe non-adjacent equal blocks.

ratio internally uses SequenceMatcher.get_matching_blocks 's results, and sums the sizes of all matched sequences returned bySequenceMatcher.get_matching_blocks. This is the exact source code from difflib.py:

matches = sum(triple[-1] for triple in self.get_matching_blocks())

The above line is critical, because the result of the above expression is used to compute the ratio. We'll see that shortly and how it impacts the calculation of the ratio.


>>> m1 = SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo")
>>> m2 = SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm")

>>> matches1 = sum(triple[-1] for triple in m1.get_matching_blocks())
>>> matches1
7
>>> matches2 = sum(triple[-1] for triple in m2.get_matching_blocks())
>>> matches2
6

As you can see, we have 7 and 6. These are simply the sums of the matched subsequences as returned by get_matching_blocks. Why does this matter? Here's why, the ratio is computed in the following way, (this is from difflib source code):

def _calculate_ratio(matches, length):
    if length:
        return 2.0 * matches / length
    return 1.0

length is len(a) + len(b) where a is the first sequence and b being the second sequence.

Okay, enough talk, we need actions:

>>> length = len("Ebojfm Mzpm") + len("Ebfo ef Mfpo") 
>>> m1.ratio()
0.6086956521739131
>>> (2.0 * matches1 / length)  == m1.ratio()
True

Similarly for m2:

>>> 2.0 * matches2 / length
0.5217391304347826 
>>> (2.0 * matches2 / length) == m2.ratio()
True

Note: Not all SequenceMatcher(None a,b).ratio() == SequenceMatcher(None b,a).ratio() are False, sometimes they can be True:

>>> s1 = SequenceMatcher(None, "abcd", "bcde").ratio()
>>> s2 = SequenceMatcher(None, "bcde", "abcd").ratio()
>>> s1 == s2
True

In case you're wondering why, this is because

sum(triple[-1] for triple in self.get_matching_blocks())

is the same for both SequenceMatcher(None, "abcd", "bcde") and SequenceMatcher(None, "bcde", "abcd") which is 3.