How to allow sklearn K Nearest Neighbors to take custom distance metric?

makansij picture makansij · Dec 22, 2015 · Viewed 7.4k times · Source

I have a custom distance metric that I need to use for KNN, K Nearest Neighbors.

I tried following this, but I cannot get it to work for some reason.

I would assume that the distance metric is supposed to take two vectors/arrays of the same length, as I have written below:

import sklearn 
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

def d(a,b,L):
    # Inputs: a and b are rows from a data matrix   
    return a+b+2+L

knn=NearestNeighbors(n_neighbors=1,
                 algorithm='auto',
                 metric='pyfunc',
                 func=lambda a,b: d(a,b,L)
                 )


X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)

However, when I call: knn.kneighbors(), it doesn't seem to like the custom function. Here is the bottom of the error stack:

ValueError: Unknown metric pyfunc. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski'], or 'precomputed', or a callable

However, I see the exact same in the question I cited. Any ideas on how to make this work on sklearn version 0.14? I'm not aware of any differences in the versions.

Thanks.

Answer

user707650 picture user707650 · Dec 22, 2015

The documentation is actually pretty clear on the use of the metric argument:

metric : string or callable, default ‘minkowski’

metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.

Thus (as also per the error message), metric should be a callable, not a string. And it should accept two arguments (arrays), and return one. Which is your lambda function.

Thus, your code can be simplified to:

import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

def d(a,b,L):
    return a+b+2+L

knn=NearestNeighbors(n_neighbors=1,
                 algorithm='auto',
                 metric=lambda a,b: d(a,b,L)
                 )
X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)