How to create a square dataframe/matrix given 3 columns - Python

WolVes picture WolVes · Dec 6, 2017 · Viewed 9.7k times · Source

I am struggling to figure out how to develop a square matrix given a format like

a a 0
a b 3
a c 4
a d 12
b a 3 
b b 0
b c 2
...

To something like:

  a b c d e
a 0 3 4 12 ... 
b 3 0 2 7 ... 
c 4 3 0 .. .
d 12 ...  
e . .. 

in pandas. I developed a method which I thinks works but takes forever to run because it has to iterate through each column and row for every value starting from the beginning each time using for loops. I feel like I'm definitely reinventing the wheel here. This also isnt realistic for my dataset given how many columns and rows there are. Is there something similar to R's cast function in python which can do this significantly faster?

Answer

unutbu picture unutbu · Dec 6, 2017

You could use df.pivot:

import pandas as pd

df = pd.DataFrame([['a', 'a', 0],
                   ['a', 'b', 3],
                   ['a', 'c', 4],
                   ['a', 'd', 12],
                   ['b', 'a', 3],
                   ['b', 'b', 0],
                   ['b', 'c', 2]], columns=['X','Y','Z'])

print(df.pivot(index='X', columns='Y', values='Z'))

yields

Y    a    b    c     d
X                     
a  0.0  3.0  4.0  12.0
b  3.0  0.0  2.0   NaN

Here, index='X' tells df.pivot to use the column labeled 'X' as the index, and columns='Y' tells it to use the column labeled 'Y' as the column index.

See the docs for more on pivot and other reshaping methods.


Alternatively, you could use pd.crosstab:

print(pd.crosstab(index=df.iloc[:,0], columns=df.iloc[:,1], 
                  values=df.iloc[:,2], aggfunc='sum'))

Unlike df.pivot which expects each (a1, a2) pair to be unique, pd.crosstab (with agfunc='sum') will aggregate duplicate pairs by summing the associated values. Although there are no duplicate pairs in your posted example, specifying how duplicates are supposed to be aggregated is required when the values parameter is used.

Also, whereas df.pivot is passed column labels, pd.crosstab is passed array-likes (such as whole columns of df). df.iloc[:, i] is the ith column of df.