在 Python 的熊猫中从数据框架中制作 matplotlib 散点图

在 Python 中,使用 matplotlibpandas数据帧制作一系列散点图的最佳方法是什么?

例如,如果我有一个数据框架 df,其中有一些感兴趣的列,我发现自己通常会将所有内容转换为数组:

import matplotlib.pylab as plt
# df is a DataFrame: fetch col1 and col2
# and drop na rows if any of the columns are NA
mydata = df[["col1", "col2"]].dropna(how="any")
# Now plot with matplotlib
vals = mydata.values
plt.scatter(vals[:, 0], vals[:, 1])

在绘图之前将所有内容转换为数组的问题在于,它会迫使您脱离数据框架。

考虑以下两个用例,其中拥有完整的数据框架对于绘图至关重要:

  1. 例如,如果您现在希望查看 col3的所有值,以获得在对 scatter的调用中绘制的对应值,并根据该值给每个点(或大小)上色,该怎么办?您必须返回,取出 col1,col2的非 na 值并检查它们的对应值。

    是否有一种方法可以在保留数据框架的同时进行绘图? 例如:

    mydata = df.dropna(how="any", subset=["col1", "col2"])
    # plot a scatter of col1 by col2, with sizes according to col3
    scatter(mydata(["col1", "col2"]), s=mydata["col3"])
    
  2. Similarly, imagine that you wanted to filter or color each point differently depending on the values of some of its columns. E.g. what if you wanted to automatically plot the labels of the points that meet a certain cutoff on col1, col2 alongside them (where the labels are stored in another column of the df), or color these points differently, like people do with dataframes in R. For example:

    mydata = df.dropna(how="any", subset=["col1", "col2"])
    myscatter = scatter(mydata[["col1", "col2"]], s=1)
    # Plot in red, with smaller size, all the points that
    # have a col2 value greater than 0.5
    myscatter.replot(mydata["col2"] > 0.5, color="red", s=0.5)
    

How can this be done?

EDIT Reply to crewbum:

You say that the best way is to plot each condition (like subset_a, subset_b) separately. What if you have many conditions, e.g. you want to split up the scatters into 4 types of points or even more, plotting each in different shape/color. How can you elegantly apply condition a, b, c, etc. and make sure you then plot "the rest" (things not in any of these conditions) as the last step?

Similarly in your example where you plot col1,col2 differently based on col3, what if there are NA values that break the association between col1,col2,col3? For example if you want to plot all col2 values based on their col3 values, but some rows have an NA value in either col1 or col3, forcing you to use dropna first. So you would do:

mydata = df.dropna(how="any", subset=["col1", "col2", "col3")

然后您可以使用 mydata绘图,如图所示——使用 col3的值绘制 col1,col2之间的散点。但是 mydata将会丢失一些点,这些点有 col1,col2的值,但是对于 col3是 NA,而且这些仍然需要被绘制出来... ... 那么你基本上如何绘制数据的“其余部分”,也就是过滤集 mydata中的 没有的点?

157169 次浏览

Try passing columns of the DataFrame directly to matplotlib, as in the examples below, instead of extracting them as numpy arrays.

df = pd.DataFrame(np.random.randn(10,2), columns=['col1','col2'])
df['col3'] = np.arange(len(df))**2 * 100 + 100


In [5]: df
Out[5]:
col1      col2  col3
0 -1.000075 -0.759910   100
1  0.510382  0.972615   200
2  1.872067 -0.731010   500
3  0.131612  1.075142  1000
4  1.497820  0.237024  1700

Vary scatter point size based on another column

plt.scatter(df.col1, df.col2, s=df.col3)
# OR (with pandas 0.13 and up)
df.plot(kind='scatter', x='col1', y='col2', s=df.col3)

enter image description here

Vary scatter point color based on another column

colors = np.where(df.col3 > 300, 'r', 'k')
plt.scatter(df.col1, df.col2, s=120, c=colors)
# OR (with pandas 0.13 and up)
df.plot(kind='scatter', x='col1', y='col2', s=120, c=colors)

enter image description here

Scatter plot with legend

However, the easiest way I've found to create a scatter plot with legend is to call plt.scatter once for each point type.

cond = df.col3 > 300
subset_a = df[cond].dropna()
subset_b = df[~cond].dropna()
plt.scatter(subset_a.col1, subset_a.col2, s=120, c='b', label='col3 > 300')
plt.scatter(subset_b.col1, subset_b.col2, s=60, c='r', label='col3 <= 300')
plt.legend()

enter image description here

Update

From what I can tell, matplotlib simply skips points with NA x/y coordinates or NA style settings (e.g., color/size). To find points skipped due to NA, try the isnull method: df[df.col3.isnull()]

To split a list of points into many types, take a look at numpy select, which is a vectorized if-then-else implementation and accepts an optional default value. For example:

df['subset'] = np.select([df.col3 < 150, df.col3 < 400, df.col3 < 600],
[0, 1, 2], -1)
for color, label in zip('bgrm', [0, 1, 2, -1]):
subset = df[df.subset == label]
plt.scatter(subset.col1, subset.col2, s=120, c=color, label=str(label))
plt.legend()

enter image description here

There is little to be added to Garrett's great answer, but pandas also has a scatter method. Using that, it's as easy as

df = pd.DataFrame(np.random.randn(10,2), columns=['col1','col2'])
df['col3'] = np.arange(len(df))**2 * 100 + 100
df.plot.scatter('col1', 'col2', df['col3'])

plotting sizes in col3 to col1-col2

I will recommend to use an alternative method using seaborn which more powerful tool for data plotting. You can use seaborn scatterplot and define colum 3 as hue and size.

Working code:

import pandas as pd
import seaborn as sns
import numpy as np


#creating sample data
sample_data={'col_name_1':np.random.rand(20),
'col_name_2': np.random.rand(20),'col_name_3': np.arange(20)*100}
df= pd.DataFrame(sample_data)
sns.scatterplot(x="col_name_1", y="col_name_2", data=df, hue="col_name_3",size="col_name_3")

enter image description here