I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is
dmin=len(trans[0])
for i in xrange(len(trans)):
for j in xrange(i+1,len(trans)):
dist=hamdist(trans[i][:-1], trans[j][:-1])
if dist < dmin:
dmin = dist
Is there a more efficient method to do this? Here hamdist is a function I wrote to find hamming distances. It is
def hamdist(str1, str2):
diffs = 0
if len(str1) != len(str2):
return max(len(str1),len(str2))
for ch1, ch2 in zip(str1, str2):
if ch1 != ch2:
diffs += 1
return diffs
You could optimize your hamdist
function by adding an optional parameter containing the minimum distance you have got so far, this way if diffs
reaches that value you stop calculating the distance because this comparison will give you a greater distance than the minimum:
def hamdist(str1, str2,prevMin=None):
diffs = 0
if len(str1) != len(str2):
return max(len(str1),len(str2))
for ch1, ch2 in zip(str1, str2):
if ch1 != ch2:
diffs += 1
if prevMin is not None and diffs>prevMin:
return None
return diffs
You will need to adapt your main loop to work with None
return value from hamdist
:
dmin=len(trans[0])
for i in xrange(len(trans)):
for j in xrange(i+1,len(trans)):
dist=hamdist(trans[i][:-1], trans[j][:-1])
if dist is not None and dist < dmin:
dmin = dist