scikit-learn: clustering text documents using DBSCAN

Christian picture Christian · Aug 9, 2014 · Viewed 11.9k times · Source

I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.

My minimal code is as follows:

docs = []
for item in [database]:
    docs.append(item)

vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)

X = X.todense() # <-- This line was needed to resolve the isse

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...

(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)

When I run this code, I get the following error when instatiating DBSCAN and calling fit():

...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range

Clicking on the line in dbscan_.py that throws the error, I noticed the following line

...
X = np.asarray(X)
n = X.shape[0]
...

When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:

...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.

Answer

cyniphile picture cyniphile · Oct 27, 2015

It looks like sparse representations for DBSCAN are supported as of Jan. 2015.

I upgraded sklearn to 0.16.1 and it worked for me on text.