I would like to approximately match Strings using Locality sensitive hashing. I have many Strings>10M that may contain typos. For every String I would like to make a comparison with all the other strings and select those with an edit distance according to some threshold.
That is, the naive solution requires O(n^2) comparisons. In order to avoid that issue I was thinking of using Locality Sensitive Hashing. Then near similar strings would result to the same buckets and I need to do only inside bucket search. So it is O(n*C) where C is the bucket size.
However, I do not understand how to represent the strings. If it was text I would represented in vector space. My main question is if this is tractable using LSH and then an appropriate vector representation of the string.
Am I able to use an already implemented library for this task? or it depends on my problem so I must implement it myself? Is there any python package that does this?
The best academic resource I've found on the subject is Chapter 3 of Mining of Massive Datasets, which gives an awesome overview of locality sensitive hashing and minhashing.
Very briefly, the idea is to take several strings, vectorize those strings, then pass a sliding window over the resulting vectors. If two vectors have the same value in the same window position, mark them as candidates for more fine-grained similarity analysis.
There's a great implementation in the Python datasketch library (pip install datasketch
). Here's an example that shows you can catch fuzzy string similarity:
from datasketch import MinHash, MinHashLSH
from nltk import ngrams
data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
'weights controls the relative importance between minizing false positive',
'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]
# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.4, num_perm=128)
# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
minhash = MinHash(num_perm=128)
for d in ngrams(i, 3):
minhash.update("".join(d).encode('utf-8'))
lsh.insert(c, minhash)
minhashes[c] = minhash
for i in xrange(len(minhashes.keys())):
result = lsh.query(minhashes[i])
print "Candidates with Jaccard similarity > 0.4 for input", i, ":", result
This returns:
Candidates with Jaccard similarity > 0.4 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.4 for input 3 : [2, 3]