I have a custom distance metric that I need to use for KNN
, K Nearest Neighbors
.
I tried following this, but I cannot get it to work for some reason.
I would assume that the distance metric is supposed to take two vectors/arrays of the same length, as I have written below:
import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
def d(a,b,L):
# Inputs: a and b are rows from a data matrix
return a+b+2+L
knn=NearestNeighbors(n_neighbors=1,
algorithm='auto',
metric='pyfunc',
func=lambda a,b: d(a,b,L)
)
X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)
However, when I call: knn.kneighbors()
, it doesn't seem to like the custom function. Here is the bottom of the error stack:
ValueError: Unknown metric pyfunc. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski'], or 'precomputed', or a callable
However, I see the exact same in the question I cited. Any ideas on how to make this work on sklearn version 0.14
? I'm not aware of any differences in the versions.
Thanks.
The documentation is actually pretty clear on the use of the metric argument:
metric : string or callable, default ‘minkowski’
metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.
If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.
Thus (as also per the error message), metric
should be a callable, not a string. And it should accept two arguments (arrays), and return one. Which is your lambda
function.
Thus, your code can be simplified to:
import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
def d(a,b,L):
return a+b+2+L
knn=NearestNeighbors(n_neighbors=1,
algorithm='auto',
metric=lambda a,b: d(a,b,L)
)
X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)