I use pandas.cut() to discretise a continuous variable into a range, and then group by the result.
After a lot of swearing because I couldn't figure out what was wrong, I have learnt that, if I don't supply custom labels to the cut() function, but rely on the default, then the output cannot be exported to excel. If I try this:
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('test.xlsx')
wk = writer.book.add_worksheet('Test')
df= df= pd.DataFrame(np.random.randint(1,10,(10000,5)), columns=['a','b','c','d','e'])
df['range'] = pd.cut( df['a'],[-np.inf,3,8,np.inf] )
grouped=df.groupby('range').sum()
grouped.to_excel(writer, 'Export')
writer.close()
I get:
raise TypeError("Unsupported type %s in write()" % type(token))
TypeError: Unsupported type <class 'pandas._libs.interval.Interval'> in write()
which it took me a while to decypher.
If instead I do assign labels:
df['range'] = pd.cut( df['a'],[-np.inf,3,8,np.inf], labels =['<3','3-8','>8'] )
then it all runs fine. Any suggestions on how to handle this without assigning custom labels? In the initial phase of my work I tend not to assign labels, because I still don't know how many bins I want - it's a trial and error approach, and assigning labels at each attempt would be time-consuming.
I am not sure if this can count as a bug, but at the very least it seems like a poorly documented annoyance!
Use astype(str)
:
writer = pd.ExcelWriter('test.xlsx')
wk = writer.book.add_worksheet('Test')
df= df= pd.DataFrame(np.random.randint(1,10,(10000,5)), columns=['a','b','c','d','e'])
df['range'] = pd.cut( df['a'],[-np.inf,3,8,np.inf] ).astype(str)
grouped=df.groupby('range').sum()
grouped.to_excel(writer, 'Export')
writer.close()
Output in excel:
range a b c d e
(-inf, 3.0] 6798 17277 16979 17266 16949
(3.0, 8.0] 33150 28051 27551 27692 27719
(8.0, inf] 9513 5153 5318 5106 5412