I am struggling to figure out how to develop a square matrix given a format like
a a 0
a b 3
a c 4
a d 12
b a 3
b b 0
b c 2
...
To something like:
a b c d e
a 0 3 4 12 ...
b 3 0 2 7 ...
c 4 3 0 .. .
d 12 ...
e . ..
in pandas. I developed a method which I thinks works but takes forever to run because it has to iterate through each column and row for every value starting from the beginning each time using for loops. I feel like I'm definitely reinventing the wheel here. This also isnt realistic for my dataset given how many columns and rows there are. Is there something similar to R's cast function in python which can do this significantly faster?
You could use df.pivot
:
import pandas as pd
df = pd.DataFrame([['a', 'a', 0],
['a', 'b', 3],
['a', 'c', 4],
['a', 'd', 12],
['b', 'a', 3],
['b', 'b', 0],
['b', 'c', 2]], columns=['X','Y','Z'])
print(df.pivot(index='X', columns='Y', values='Z'))
yields
Y a b c d
X
a 0.0 3.0 4.0 12.0
b 3.0 0.0 2.0 NaN
Here, index='X'
tells df.pivot
to use the column labeled 'X'
as the index, and columns='Y'
tells it to use the column labeled 'Y'
as the column index.
See the docs for more on pivot
and other reshaping methods.
Alternatively, you could use pd.crosstab
:
print(pd.crosstab(index=df.iloc[:,0], columns=df.iloc[:,1],
values=df.iloc[:,2], aggfunc='sum'))
Unlike df.pivot
which expects each (a1, a2)
pair to be unique, pd.crosstab
(with agfunc='sum'
) will aggregate duplicate pairs by summing the associated
values. Although there are no duplicate pairs in your posted example, specifying
how duplicates are supposed to be aggregated is required when the values
parameter is used.
Also, whereas df.pivot
is passed column labels, pd.crosstab
is passed
array-likes (such as whole columns of df
). df.iloc[:, i]
is the i
th column
of df
.