在熊猫数据框中创建 value_count 列

小开

df['Counts'] = df.groupby(['Color'])['Value'].transform('count')

比如说,

In [102]: df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]})


In [103]: df
Out[103]:
Color  Value
0   Red    100
1   Red    150
2  Blue     50


In [104]: df['Counts'] = df.groupby(['Color'])['Value'].transform('count')


In [105]: df
Out[105]:
Color  Value  Counts
0   Red    100       2
1   Red    150       2
2  Blue     50       1

请注意，transform('count')忽略 NaNs。如果要计算 NaNs，请使用 transform(len)。

致匿名编辑: 如果您在使用 transform('count')时出现错误，可能是因为您的熊猫版本太老了。以上工程与熊猫版本0.15或更新。

小开

我最初的想法是使用如下所示的列表内涵，但是，正如评论中指出的那样，这比 groupby和 transform方法要慢。我将留下这个答案来演示 不该做什么:

In [94]: df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]})
In [95]: df['Counts'] = [sum(df['Color'] == df['Color'][i]) for i in xrange(len(df))]
In [96]: df
Out[100]:
Color  Value  Counts
0   Red    100       2
1   Red    150       2
2  Blue     50       1


[3 rows x 3 columns]

@ unutbu 的方法对于带有多个列的 DataFrames 来说变得复杂，这使得编码更加简单。如果您使用的是小型数据帧，那么速度会更快(见下文) ，但是如果不是这样，您应该使用没有。

In [97]: %timeit df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]}); df['Counts'] = df.groupby(['Color']).transform('count')
100 loops, best of 3: 2.87 ms per loop
In [98]: %timeit df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]}); df['Counts'] = [sum(df['Color'] == df['Color'][i]) for i in xrange(len(df))]
1000 loops, best of 3: 1.03 ms per loop

小开

df['Counts'] = df.Color.groupby(df.Color).transform('count')

您可以对任何系列执行以下操作: 单独对它进行分组，然后调用 transform('count'):

>>> series = pd.Series(['Red', 'Red', 'Blue'])
>>> series.groupby(series).transform('count')
0    2
1    2
2    1
dtype: int64

小开

还有一个选择:

z = df['Color'].value_counts


z1 = z.to_dict() #converts to dictionary


df['Count_Column'] = df['Color'].map(z1)

此选项将为您提供一个包含重复计数值的列，该值与“ Color”列中每个值的频率相对应。

小开

这个答案使用了强 > Series.map强和 Series.value_counts ，并用熊猫1.1进行了测试。

df['counts'] = df['attribute'].map(df['attribute'].value_counts())

提供者: 评论 by sacuL

小开

创建一个包含重复值计数的列。这些值是从其他列计算得到的临时计算结果。非常快。感谢@ZakS。

sum_A_B = df['A']+df['B'] sum_A_B_dict = sum_A_B.value_counts().to_dict() df['sum_A_B'] = sum_A_B.map(sum_A_B_dict)