I've been testing out how well PCA and LDA works for classifying 3 different types of image tags I want to automatically identify. In my code, X is my data matrix where each row are the pixels from an image and y is a 1D array stating the classification of each row.
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.lda import LDA
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
plt.figure(figsize = (35, 20))
plt.scatter(X_r[:, 0], X_r[:, 1], c=y, s=200)
lda = LDA(n_components=2)
X_lda = lda.fit(X, y).transform(X)
plt.figure(figsize = (35, 20))
plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, s=200)
With the LDA, I end up with 3 clearly distinguishable clusters with only slight overlap between them. Now if I have a new image I want to classify, once I turn it into a 1D array, how do I predict which cluster it should fall into and if it falls too far from the centre how can I say that the classification is "inconclusive"? I was also curious what the ".transform(X)" function did to my data once I had fit it.
After you trained your LDA model with some data X
, you may want to project some other data, Z
. in this case what you should do is:
lda = LDA(n_components=2) #creating a LDA object
lda = lda.fit(X, y) #learning the projection matrix
X_lda = lda.transform(X) #using the model to project X
# .... getting Z as test data....
Z = lda.transform(Z) #using the model to project Z
z_labels = lda.predict(Z) #gives you the predicted label for each sample
z_prob = lda.predict_proba(Z) #the probability of each sample to belong to each class
Note that 'fit' is used for fitting the model, not fitting the data.
So transform
is used in order to build the representation (projection in this case), and predict
is used for predicting the label of each sample. (this is used for ALL classes that inherits from BaseEstimator
in sklearn.
You can read the documentation for farther options and properties.
Also, sklearn's API allow you to do pca.fit_transform(X)
instead of pca.fit(X).transform(X)
. Use this version when you are not interested in model itself after this point in the code.
A few comments: Since PCA is an Unsupervised approach, LDA is a better approach for doing this "visual" classification you are currently doing.
Moreover, If you are interested in classification, you may consider using different type of classifiers, not necessarily LDA, although it is a great approach for visualization.