在 Python 中,使用 matplotlib
从 pandas
数据帧制作一系列散点图的最佳方法是什么?
例如,如果我有一个数据框架 df
,其中有一些感兴趣的列,我发现自己通常会将所有内容转换为数组:
import matplotlib.pylab as plt
# df is a DataFrame: fetch col1 and col2
# and drop na rows if any of the columns are NA
mydata = df[["col1", "col2"]].dropna(how="any")
# Now plot with matplotlib
vals = mydata.values
plt.scatter(vals[:, 0], vals[:, 1])
在绘图之前将所有内容转换为数组的问题在于,它会迫使您脱离数据框架。
考虑以下两个用例,其中拥有完整的数据框架对于绘图至关重要:
例如,如果您现在希望查看 col3
的所有值,以获得在对 scatter
的调用中绘制的对应值,并根据该值给每个点(或大小)上色,该怎么办?您必须返回,取出 col1,col2
的非 na 值并检查它们的对应值。
是否有一种方法可以在保留数据框架的同时进行绘图? 例如:
mydata = df.dropna(how="any", subset=["col1", "col2"])
# plot a scatter of col1 by col2, with sizes according to col3
scatter(mydata(["col1", "col2"]), s=mydata["col3"])
Similarly, imagine that you wanted to filter or color each point differently depending on the values of some of its columns. E.g. what if you wanted to automatically plot the labels of the points that meet a certain cutoff on col1, col2
alongside them (where the labels are stored in another column of the df), or color these points differently, like people do with dataframes in R. For example:
mydata = df.dropna(how="any", subset=["col1", "col2"])
myscatter = scatter(mydata[["col1", "col2"]], s=1)
# Plot in red, with smaller size, all the points that
# have a col2 value greater than 0.5
myscatter.replot(mydata["col2"] > 0.5, color="red", s=0.5)
How can this be done?
EDIT Reply to crewbum:
You say that the best way is to plot each condition (like subset_a
, subset_b
) separately. What if you have many conditions, e.g. you want to split up the scatters into 4 types of points or even more, plotting each in different shape/color. How can you elegantly apply condition a, b, c, etc. and make sure you then plot "the rest" (things not in any of these conditions) as the last step?
Similarly in your example where you plot col1,col2
differently based on col3
, what if there are NA values that break the association between col1,col2,col3
? For example if you want to plot all col2
values based on their col3
values, but some rows have an NA value in either col1
or col3
, forcing you to use dropna
first. So you would do:
mydata = df.dropna(how="any", subset=["col1", "col2", "col3")
然后您可以使用 mydata
绘图,如图所示——使用 col3
的值绘制 col1,col2
之间的散点。但是 mydata
将会丢失一些点,这些点有 col1,col2
的值,但是对于 col3
是 NA,而且这些仍然需要被绘制出来... ... 那么你基本上如何绘制数据的“其余部分”,也就是过滤集 mydata
中的 没有的点?