Using python efficiently to calculate hamming distances

schoon picture schoon · Jul 4, 2014 · Viewed 16.5k times · Source

I need to compare a large number of strings similar to 50358c591cef4d76. I have a Hamming distance function (using pHash) I can use. How do I do this efficiently? My pseudocode would be:

For each string
    currentstring= string
    For each string other than currentstring
        Calculate Hamming distance

I'd like to output the results as a matrix and be able to retrieve values. I'd also like to run it via Hadoop Streaming!

Any pointers are gratefully received.

Here is what i have tried but it is slow:

import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()
setOfFiles = set(files)
print len(setOfFiles)
i=0
j=0
for fname in files:
    print 'fname',fname, 'setOfFiles', len(setOfFiles)
    oneLessSetOfFiles=setOfFiles
    oneLessSetOfFiles.remove(fname)
    i+=1

    for compareFile in oneLessSetOfFiles:
        j+=1
        hash1 = pHash.imagehash( fname )
        hash2 = pHash.imagehash( compareFile)
        print ...     

Answer

Matthew Franglen picture Matthew Franglen · Jul 4, 2014

The distance package in python provides a hamming distance calculator:

import distance

distance.levenshtein("lenvestein", "levenshtein")
distance.hamming("hamming", "hamning")

There is also a levenshtein package which provides levenshtein distance calculations. Finally difflib can provide some simple string comparisons.

There is more information and example code for all of these on this old question.

Your existing code is slow because you recalculate the file hash in the most inner loop, which means every file gets hashed many times. If you calculate the hash first then the process will be much more efficient:

files = ...
files_and_hashes = [(f, pHash.imagehash(f)) for f in files]
file_comparisons = [
    (hamming(first[0], second[0]), first, second)
    for second in files
    for first in files
    if first[1] != second[1]
]

This process fundamentally involves O(N^2) comparisons so to distribute this in a way suitable for a map reduce problem involves taking the complete set of strings and dividing them into B blocks where B^2 = M (B = number of string blocks, M = number of workers). So if you had 16 strings and 4 workers you would split the list of strings into two blocks (so a block size of 8). An example of dividing the work follows:

all_strings = [...]
first_8 = all_strings[:8]
last_8 = all_strings[8:]
compare_all(machine_1, first_8, first_8)
compare_all(machine_2, first_8, last_8)
compare_all(machine_3, last_8, first_8)
compare_all(machine_4, last_8, last_8)