Faster kNN algorithm in Python

Eoin Ó Coinnigh picture Eoin Ó Coinnigh · Aug 4, 2018 · Viewed 8.2k times · Source

I want to code my own kNN algorithm from scratch, the reason is that I need to weight the features. The problem is that my program is still really slow despite removing for loops and using built in numpy functionality.

Can anyone suggest a way to speed this up? I don't use np.sqrt for the L2 distance because it's unnecessary and actually slows it all up quite a bit.

class GlobalWeightedKNN:
    """
    A k-NN classifier with feature weights

    Returns: predictions of k-NN.
    """

    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.k = None
        self.weights = None
        self.predictions = list()

    def fit(self, X_train, y_train, k, weights):        
        self.X_train = X_train
        self.y_train = y_train
        self.k = k
        self.weights = weights

    def predict(self, testing_data):
        """
        Takes a 2d array of query cases.

        Returns a list of predictions for k-NN classifier
        """

        np.fromiter((self.__helper(qc) for qc in testing_data), float)  
        return self.predictions


    def __helper(self, qc):
        neighbours = np.fromiter((self.__weighted_euclidean(qc, x) for x in self.X_train), float)
        neighbours = np.array([neighbours]).T 
        indexes = np.array([range(len(self.X_train))]).T
        neighbours = np.append(indexes, neighbours, axis=1)

        # Sort by second column - distances
        neighbours = neighbours[neighbours[:,1].argsort()]  
        k_cases = neighbours[ :self.k]
        indexes = [x[0] for x in k_cases]

        y_answers = [self.y_train[int(x)] for x in indexes]
        answer = max(set(y_answers), key=y_answers.count)  # get most common value
        self.predictions.append(answer)


    def __weighted_euclidean(self, qc, other):
        """
        Custom weighted euclidean distance

        returns: floating point number
        """

        return np.sum( ((qc - other)**2) * self.weights )

Answer

jakevdp picture jakevdp · Aug 4, 2018

Scikit-learn uses a KD Tree or Ball Tree to compute nearest neighbors in O[N log(N)] time. Your algorithm is a direct approach that requires O[N^2] time, and also uses nested for-loops within Python generator expressions which will add significant computational overhead compared to optimized code.

If you'd like to compute weighted k-neighbors classification using a fast O[N log(N)] implementation, you can use sklearn.neighbors.KNeighborsClassifier with the weighted minkowski metric, setting p=2 (for euclidean distance) and setting w to your desired weights. For example:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(metric='wminkowski', p=2,
                             metric_params=dict(w=weights))
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)