What is the best way to make a series of scatter plots using matplotlib
from a pandas
dataframe in Python?
For example, if I have a dataframe df
that has some columns of interest, I find myself typically converting everything to arrays:
import matplotlib.pylab as plt
# df is a DataFrame: fetch col1 and col2
# and drop na rows if any of the columns are NA
mydata = df[["col1", "col2"]].dropna(how="any")
# Now plot with matplotlib
vals = mydata.values
plt.scatter(vals[:, 0], vals[:, 1])
The problem with converting everything to array before plotting is that it forces you to break out of dataframes.
Consider these two use cases where having the full dataframe is essential to plotting:
For example, what if you wanted to now look at all the values of col3
for the corresponding values that you plotted in the call to scatter
, and color each point (or size) it by that value? You'd have to go back, pull out the non-na values of col1,col2
and check what their corresponding values.
Is there a way to plot while preserving the dataframe? For example:
mydata = df.dropna(how="any", subset=["col1", "col2"])
# plot a scatter of col1 by col2, with sizes according to col3
scatter(mydata(["col1", "col2"]), s=mydata["col3"])
Similarly, imagine that you wanted to filter or color each point differently depending on the values of some of its columns. E.g. what if you wanted to automatically plot the labels of the points that meet a certain cutoff on col1, col2
alongside them (where the labels are stored in another column of the df), or color these points differently, like people do with dataframes in R. For example:
mydata = df.dropna(how="any", subset=["col1", "col2"])
myscatter = scatter(mydata[["col1", "col2"]], s=1)
# Plot in red, with smaller size, all the points that
# have a col2 value greater than 0.5
myscatter.replot(mydata["col2"] > 0.5, color="red", s=0.5)
How can this be done?
EDIT Reply to crewbum:
You say that the best way is to plot each condition (like subset_a
, subset_b
) separately. What if you have many conditions, e.g. you want to split up the scatters into 4 types of points or even more, plotting each in different shape/color. How can you elegantly apply condition a, b, c, etc. and make sure you then plot "the rest" (things not in any of these conditions) as the last step?
Similarly in your example where you plot col1,col2
differently based on col3
, what if there are NA values that break the association between col1,col2,col3
? For example if you want to plot all col2
values based on their col3
values, but some rows have an NA value in either col1
or col3
, forcing you to use dropna
first. So you would do:
mydata = df.dropna(how="any", subset=["col1", "col2", "col3")
then you can plot using mydata
like you show -- plotting the scatter between col1,col2
using the values of col3
. But mydata
will be missing some points that have values for col1,col2
but are NA for col3
, and those still have to be plotted... so how would you basically plot "the rest" of the data, i.e. the points that are not in the filtered set mydata
?
Try passing columns of the DataFrame
directly to matplotlib, as in the examples below, instead of extracting them as numpy arrays.
df = pd.DataFrame(np.random.randn(10,2), columns=['col1','col2'])
df['col3'] = np.arange(len(df))**2 * 100 + 100
In [5]: df
Out[5]:
col1 col2 col3
0 -1.000075 -0.759910 100
1 0.510382 0.972615 200
2 1.872067 -0.731010 500
3 0.131612 1.075142 1000
4 1.497820 0.237024 1700
plt.scatter(df.col1, df.col2, s=df.col3)
# OR (with pandas 0.13 and up)
df.plot(kind='scatter', x='col1', y='col2', s=df.col3)
colors = np.where(df.col3 > 300, 'r', 'k')
plt.scatter(df.col1, df.col2, s=120, c=colors)
# OR (with pandas 0.13 and up)
df.plot(kind='scatter', x='col1', y='col2', s=120, c=colors)
However, the easiest way I've found to create a scatter plot with legend is to call plt.scatter
once for each point type.
cond = df.col3 > 300
subset_a = df[cond].dropna()
subset_b = df[~cond].dropna()
plt.scatter(subset_a.col1, subset_a.col2, s=120, c='b', label='col3 > 300')
plt.scatter(subset_b.col1, subset_b.col2, s=60, c='r', label='col3 <= 300')
plt.legend()
From what I can tell, matplotlib simply skips points with NA x/y coordinates or NA style settings (e.g., color/size). To find points skipped due to NA, try the isnull
method: df[df.col3.isnull()]
To split a list of points into many types, take a look at numpy select
, which is a vectorized if-then-else implementation and accepts an optional default value. For example:
df['subset'] = np.select([df.col3 < 150, df.col3 < 400, df.col3 < 600],
[0, 1, 2], -1)
for color, label in zip('bgrm', [0, 1, 2, -1]):
subset = df[df.subset == label]
plt.scatter(subset.col1, subset.col2, s=120, c=color, label=str(label))
plt.legend()