Scikit-Learn: Predicting new points with DBSCAN

slaw picture slaw · Jan 7, 2015 · Viewed 17k times · Source

I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7):

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(random_state=0)
dbscan.fit(X)

However, I found that there was no built-in function (aside from "fit_predict") that could assign the new data points, Y, to the clusters identified in the original data, X. The K-means method has a "predict" function but I want to be able to do the same with DBSCAN. Something like this:

dbscan.predict(X, Y)

So that the density can be inferred from X but the return values (cluster assignments/labels) are only for Y. From what I can tell, this capability is available in R so I assume that it is also somehow available in Python. I just can't seem to find any documentation for this.

Also, I have tried searching for reasons as to why DBSCAN may not be used for labeling new data but I haven't found any justifications.

Answer

kidmose picture kidmose · Feb 17, 2016

While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it's usefulness. *

Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within eps of you new point. Then it is guaranteed that your point will at least be a border point of the assigned cluster according to the definitions used for the clustering. (Be aware that your point might be deemed noise and not assigned to a cluster)

I've done a quick implementation:

import numpy as np
import scipy as sp

def dbscan_predict(dbscan_model, X_new, metric=sp.spatial.distance.cosine):
    # Result is noise by default
    y_new = np.ones(shape=len(X_new), dtype=int)*-1 

    # Iterate all input samples for a label
    for j, x_new in enumerate(X_new):
        # Find a core sample closer than EPS
        for i, x_core in enumerate(dbscan_model.components_): 
            if metric(x_new, x_core) < dbscan_model.eps:
                # Assign label of x_core to x_new
                y_new[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]]
                break

    return y_new

The labels obtained by clustering (dbscan_model = DBSCAN(...).fit(X) and the labels obtained from the same model on the same data (dbscan_predict(dbscan_model, X)) sometimes differ. I'm not quite certain if this is a bug somewhere or a result of randomness.

EDIT: I Think the above problem of differing prediction outcomes could stem from the possibility that a border point can be close to multiple clusters. Please update if you test this and find an answer. Ambiguity might be solved by shuffling core points every time or by picking the closest instead of the first core point.

*) Case at hand: I'd like to evaluate if the clusters obtained from a subset of my data makes sense for other subset or is simply a special case. If it generalises it supports the validity of the clusters and the earlier steps of pre-processing applied.