I have a pandas dataframe where each observation has a date (as a column of entries in datetime[64] format). These dates are spread over a period of about 5 years. I would like to plot a kernel-density plot of the dates of all the observations, with the years labelled on the x-axis.
I have figured out how to create a time-delta relative to some reference date and then create a density plot of the number of hours/days/years between each observation and the reference date:
df['relativeDate'].astype('timedelta64[D]').plot(kind='kde')
But this isn't exactly what I want: If I convert to year-deltas, then the x-axis is right but I lose the within-year variation. But if I take a smaller unit of time like hour or day, the x-axis labels are much harder to interpret.
What's the simplest way to make this work in Pandas?
Inspired by @JohnE 's answer, an alternative approach to convert date to numeric value is to use .toordinal()
.
import pandas as pd
import numpy as np
# simulate some artificial data
# ===============================
np.random.seed(0)
dates = pd.date_range('2010-01-01', periods=31, freq='D')
df = pd.DataFrame(np.random.choice(dates,100), columns=['dates'])
# use toordinal() to get datenum
df['ordinal'] = [x.toordinal() for x in df.dates]
print(df)
dates ordinal
0 2010-01-13 733785
1 2010-01-16 733788
2 2010-01-22 733794
3 2010-01-01 733773
4 2010-01-04 733776
5 2010-01-28 733800
6 2010-01-04 733776
7 2010-01-08 733780
8 2010-01-10 733782
9 2010-01-20 733792
.. ... ...
90 2010-01-19 733791
91 2010-01-28 733800
92 2010-01-01 733773
93 2010-01-15 733787
94 2010-01-04 733776
95 2010-01-22 733794
96 2010-01-13 733785
97 2010-01-26 733798
98 2010-01-11 733783
99 2010-01-21 733793
[100 rows x 2 columns]
# plot non-parametric kde on numeric datenum
ax = df['ordinal'].plot(kind='kde')
# rename the xticks with labels
x_ticks = ax.get_xticks()
ax.set_xticks(x_ticks[::2])
xlabels = [datetime.datetime.fromordinal(int(x)).strftime('%Y-%m-%d') for x in x_ticks[::2]]
ax.set_xticklabels(xlabels)