使用熊猫 groupby 计算唯一值

我有以下表格的数据:

df = pd.DataFrame({
'group': [1, 1, 2, 3, 3, 3, 4],
'param': ['a', 'a', 'b', np.nan, 'a', 'a', np.nan]
})
print(df)


#    group param
# 0      1     a
# 1      1     a
# 2      2     b
# 3      3   NaN
# 4      3     a
# 5      3     a
# 6      4   NaN

组内的非空值总是相同的。我想为每个组(存在它的地方)计算一次非空值,然后找到每个值的总计数。

我现在正在用下面的方法(笨拙而低效)做这件事:

param = []
for _, group in df[df.param.notnull()].groupby('group'):
param.append(group.param.unique()[0])
print(pd.DataFrame({'param': param}).param.value_counts())


# a    2
# b    1

我确信有一种方法可以更干净地完成这项工作,而且不需要使用循环,但我似乎就是不能解决这个问题。如果你能帮忙,我将不胜感激。

242559 次浏览

I think you can use SeriesGroupBy.nunique:

print (df.groupby('param')['group'].nunique())
param
a    2
b    1
Name: group, dtype: int64

Another solution with unique, then create new df by DataFrame.from_records, reshape to Series by stack and last value_counts:

a = df[df.param.notnull()].groupby('group')['param'].unique()
print (pd.DataFrame.from_records(a.values.tolist()).stack().value_counts())
a    2
b    1
dtype: int64

This is just an add-on to the solution in case you want to compute not only unique values but other aggregate functions:

df.groupby(['group']).agg(['min', 'max', 'count', 'nunique'])

I know it has been a while since this was posted, but I think this will help too. I wanted to count unique values and filter the groups by number of these unique values, this is how I did it:

df.groupby('group').agg(['min','max','count','nunique']).reset_index(drop=False)

The above answers work too, but in case you want to add a column with unique_counts to your existing data frame, you can do that using transform

df['distinct_count'] = df.groupby(['param'])['group'].transform('nunique')

output:

   group param  distinct_count
0      1     a             2.0
1      1     a             2.0
2      2     b             1.0
3      3   NaN             NaN
4      3     a             2.0
5      3     a             2.0
6      4   NaN             NaN

and to check the group counts as highted by @jezrael.

print (df.groupby('param')['group'].nunique())
param
a    2
b    1
Name: group, dtype: int64

This way is faster and is more convenient:

df.groupby('param').agg({'group':lambda x: len(pd.unique(x))})