I am trying to use the mca package to do multiple correspondence analysis in Python.
I am a bit confused as to how to use it. With PCA
I would expect to fit some data (i.e. find principal components for those data) and then later I would be able to use the principal components that I found to transform unseen data.
Based on the MCA documentation, I cannot work out how to do this last step. I also don't understand what any of the weirdly cryptically named properties and methods do (i.e. .E
, .L
, .K
, .k
etc).
So far if I have a DataFrame with a column containing strings (assume this is the only column in the DF) I would do something like
import mca
ca = mca.MCA(pd.get_dummies(df, drop_first=True))
from what I can gather
ca.fs_r(1)
is the transformation of the data in df
and
ca.L
is supposed to be the eigenvalues (although I get a vector of 1
s that is one element fewer that my number of features?).
now if I had some more data with the same features, let's say df_new
and assuming I've already converted this correctly to dummy variables, how do I find the equivalent of ca.fs_r(1)
for the new data
One other method is to use the library prince which enables easy usage of tools such as:
You can begin first by installing with:
pip install --user prince
To use MCA
, it is fairly simple and can be done in a couple of steps (just like sklearn PCA
method.) We first build our dataframe.
import pandas as pd
import prince
X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
print(X.head())
mca = prince.MCA()
# outputs
>> Color Size Action Age Inflated
0 YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH CHILD F
2 YELLOW SMALL DIP ADULT F
3 YELLOW SMALL DIP CHILD F
4 YELLOW LARGE STRETCH ADULT T
Followed by calling the fit
and transform
method.
mca = mca.fit(X) # same as calling ca.fs_r(1)
mca = mca.transform(X) # same as calling ca.fs_r_sup(df_new) for *another* test set.
print(mca)
# outputs
>> 0 1
0 0.705387 8.373126e-15
1 -0.386586 8.336230e-15
2 -0.386586 6.335675e-15
3 -0.852014 6.726393e-15
4 0.783539 -6.333333e-01
5 0.783539 -6.333333e-01
6 -0.308434 -6.333333e-01
7 -0.308434 -6.333333e-01
8 -0.773862 -6.333333e-01
9 0.783539 6.333333e-01
10 0.783539 6.333333e-01
11 -0.308434 6.333333e-01
12 -0.308434 6.333333e-01
13 -0.773862 6.333333e-01
14 0.861691 -5.893240e-15
15 0.861691 -5.893240e-15
16 -0.230282 -5.930136e-15
17 -0.230282 -7.930691e-15
18 -0.695710 -7.539973e-15
You can even print out the picture diagram of it, since it incorporates matplotlib
library.
ax = mca.plot_coordinates(
X=X,
ax=None,
figsize=(6, 6),
show_row_points=True,
row_points_size=10,
show_row_labels=False,
show_column_points=True,
column_points_size=30,
show_column_labels=False,
legend_n_cols=1
)
ax.get_figure().savefig('images/mca_coordinates.svg')