Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:

KaziJehangir picture KaziJehangir · Aug 2, 2016 · Viewed 15k times · Source

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters.

I am facing some problems using Scikit-learn's implementation of dbscan.

This snippet below works on small datasets in the format I an using, but since it is precomputing the entire distance matrix, that takes O(n^2) space and time and is way too much for my large datasets. I have run this for many hours but it just ends up taking all the memory of my PC.

lev_similarity = -1*np.array([[distance.levenshtein(w1[0],w2[0]) for w1 in words] for w2 in words])
dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1)
dbscan.fit(lev_similarity)

So I figured I needed some way to compute the similarity on the fly and hence tried this method.

dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1, metric = distance.levenshtein)
dbscan.fit(words)

But this method ends up giving me an error:

ValueError: could not convert string to float: URL

Which I realize means that its trying to convert the inputs to the similarity function to floats. But I don't want it to do that. As far as I understand, it just needs a function that can take two arguments and return a float value that it can then compare to eps, which the levenshtein distance should do.

I am stuck at this point, as I do not know the implementation details of sklearn's dbscan to find why it is trying to convert it to float, and neither do I have any better idea on how to avoid the O(n^2) matrix computation.

Please let me know if there is any better or faster way to cluster these many strings that I may have overlooked.

Answer

Luke picture Luke · Nov 2, 2016

From the scikit-learn faq you can do this by making a custom metric:

from leven import levenshtein       
import numpy as np
from sklearn.cluster import dbscan
data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return levenshtein(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)