GroupBy 添加 DataFrame 并选择最常见的值

我有一个包含三个字符串列的数据框架。我知道第3列中的唯一一个值对于前两个值的每个组合都是有效的。为了清理数据,我必须按照数据框架按照前两列进行分组,并为每个组合选择第三列中最常见的值。

我的代码:

import pandas as pd
from scipy import stats


source = pd.DataFrame({
'Country': ['USA', 'USA', 'Russia', 'USA'],
'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name': ['NY', 'New', 'Spb', 'NY']})


source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

最后一行代码不起作用,它显示的是 KeyError: 'Short name',如果我试图仅按城市分组,那么我得到了一个 AssertionError。我能做什么?

219812 次浏览

对于 agg,lambba 函数得到一个 Series,它没有 'Short name'属性。

stats.mode返回一个由两个数组组成的元组,因此您必须获取该元组中第一个数组的第一个元素。

通过这两个简单的改变:

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

报税表

                         Short name
Country City
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

您可以使用 value_counts()获得一个计数序列,并获得第一行:

source.groupby(['Country','City']).agg(lambda x: x.value_counts().index[0])

如果您想在 .agg()中执行其他 agg 函数, 试试这个。

# Let's add a new col, "account"
source['account'] = [1, 2, 3, 3]


source.groupby(['Country','City']).agg(
mod=('Short name', lambda x: x.value_counts().index[0]),
avg=('account', 'mean'))

这里的游戏有点晚了,但是我在 HYRY 的解决方案中遇到了一些性能问题,所以我不得不想出另一个解决方案。

它通过查找每个键值的频率来工作,然后,对于每个键,只保留最常出现的值。

还有一个额外的解决方案,支持多种模式。

在一个代表我正在处理的数据的规模测试中,这将运行时间从37.4秒减少到0.5秒!

下面是解决方案的代码、一些示例用法和规模测试:

import numpy as np
import pandas as pd
import random
import time


test_input = pd.DataFrame(columns=[ 'key',          'value'],
data=  [[ 1,              'A'    ],
[ 1,              'B'    ],
[ 1,              'B'    ],
[ 1,              np.nan ],
[ 2,              np.nan ],
[ 3,              'C'    ],
[ 3,              'C'    ],
[ 3,              'D'    ],
[ 3,              'D'    ]])


def mode(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.


The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the mode.


The output is a DataFrame with a record per group that has at least one mode
(null values are not counted). The `key_cols` are included as columns, `value_col`
contains a mode (ties are broken arbitrarily and deterministically) for each
group, and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size() \
.to_frame(count_col).reset_index() \
.sort_values(count_col, ascending=False) \
.drop_duplicates(subset=key_cols)


def modes(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.


The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the modes.


The output is a DataFrame with a record per group that has at least
one mode (null values are not counted). The `key_cols` are included as
columns, `value_col` contains lists indicating the modes for each group,
and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size() \
.to_frame(count_col).reset_index() \
.groupby(key_cols + [count_col])[value_col].unique() \
.to_frame().reset_index() \
.sort_values(count_col, ascending=False) \
.drop_duplicates(subset=key_cols)


print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')


scale_test_data = [[random.randint(1, 100000),
str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
data=scale_test_data)


start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start


start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start


start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

运行此代码将打印如下内容:

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
key value  count
1    1     B      2
2    3     C      2
key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

希望这个能帮上忙!

形式上,正确答案是@eumiro 解决方案。 @ HYRY 解决方案的问题是,当你有一系列数字,比如[1,2,3,4]时,解是错误的,也就是说,你没有 模式。 例如:

>>> import pandas as pd
>>> df = pd.DataFrame(
{
'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'],
'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4],
'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
}
)

如果你像@HYRY 那样计算,你会得到:

>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
total  bla
client
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

这显然是错误的(参见应该是 1而不是 4A值) ,因为它不能处理唯一的值。

因此,另一种解决方案是正确的:

>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
total  bla
client
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20

对于较大的数据集,一种稍微笨拙但更快的方法包括获取感兴趣的列的计数,从最高到最低排序计数,然后在子集上去重复,以仅保留最大的情况。代码示例如下:

>>> import pandas as pd
>>> source = pd.DataFrame(
{
'Country': ['USA', 'USA', 'Russia', 'USA'],
'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name': ['NY', 'New', 'Spb', 'NY']
}
)
>>> grouped_df = source\
.groupby(['Country','City','Short name'])[['Short name']]\
.count()\
.rename(columns={'Short name':'count'})\
.reset_index()\
.sort_values('count', ascending=False)\
.drop_duplicates(subset=['Country', 'City'])\
.drop('count', axis=1)
>>> print(grouped_df)
Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb

给你的问题在于性能,如果有很多行,那就是个问题。

如果是你的案子,请试试这个:

import pandas as pd


source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short_name' : ['NY','New','Spb','NY']})


source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])


source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()

如果您想要另一种不依赖于 value_countsscipy.stats的方法来解决这个问题,您可以使用 Counter集合

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

它可以像下面这样应用于上面的示例

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short_name' : ['NY','New','Spb','NY']})


src.groupby(['Country','City']).agg(get_most_common)

熊猫 > = 0.16

pd.Series.mode可用!

使用 groupbyGroupBy.agg,并将 pd.Series.mode函数应用于每个组:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)


Country  City
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

如果需要将其作为数据框架,请使用

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()


Short name
Country City
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

Series.mode的有用之处在于它总是返回 Series,这使它与 aggapply非常兼容,特别是在重新构造 groupby 输出时。它也更快。

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)


5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

处理多种模式

当有 多个模式时,Series.mode也做得很好:

source2 = source.append(
pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
ignore_index=True)


# Now `source2` has two modes for the
# ("USA", "New-York") group, they are "NY" and "New".
source2


Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New
source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)


Country  City
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

或者,如果您希望每种模式都有一个单独的行,可以使用 GroupBy.apply:

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)


Country  City
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
1    New
Name: Short name, dtype: object

如果 我不在乎返回哪种模式,只要它是其中之一,那么您将需要一个调用 mode并提取第一个结果的 lambda。

source2.groupby(['Country','City'])['Short name'].agg(
lambda x: pd.Series.mode(x)[0])


Country  City
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

替代(不)考虑

您也可以从 python 中使用 statistics.mode,但是..。

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)


Country  City
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

... 它不能很好地处理多种模式时,一个 StatisticsError被提出。文件中提到了这一点:

如果数据为空,或者没有一个最常见的值, 统计错误被提出。

但你可以亲眼看看。

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

这里的两个最佳答案表明:

df.groupby(cols).agg(lambda x:x.value_counts().index[0])

或者,最好是

df.groupby(cols).agg(pd.Series.mode)

然而,在简单的边缘情况下,这两种方法都会失败,如下所示:

df = pd.DataFrame({
'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

第一:

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

产生 IndexError(因为组 C返回的是空系列)。第二个:

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

返回 ValueError: Function does not reduce,因为第一组返回两个列表(因为有两种模式)。(如文档 给你所示,如果第一组返回一个单一模式,这将工作!)

这种情况有两种可能的解决办法:

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

以及 cs95在评论 给你中给我的解决方案:

def foo(x):
m = pd.Series.mode(x);
return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

但是,所有这些都很慢,不适合大型数据集。我最终使用的解决方案是 a)可以处理这些情况,b)要快得多得多,这是对 abw33的答案稍作修改的版本(应该更高一些) :

def get_mode_per_column(dataframe, group_cols, col):
return (dataframe.fillna(-1)  # NaN placeholder to keep group
.groupby(group_cols + [col])
.size()
.to_frame('count')
.reset_index()
.sort_values('count', ascending=False)
.drop_duplicates(subset=group_cols)
.drop(columns=['count'])
.sort_values(group_cols)
.replace(-1, np.NaN))  # restore NaNs


group_cols = ['client_id', 'date']
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

实际上,这个方法一次处理一个 coll 并输出一个 df,因此不使用密集型的 concat,而是将第一个 coll 视为 df,然后迭代地将输出数组(values.flatten())添加为 df 中的列。

如果你不想包含 NaN 值 ,使用 Counter要比使用 pd.Series.modepd.Series.value_counts()[0]快得多:

def get_most_common(srs):
x = list(srs)
my_counter = Counter(x)
return my_counter.most_common(1)[0][0]


df.groupby(col).agg(get_most_common)

应该可以 当您有 NaN 值时,这将失败,因为每个 NaN 将单独计数。

使用 DataFrame.value_counts快速解决方案

排名前三的答案是:

  • source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
  • source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
  • source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0])

对于大型数据集来说是极其缓慢的。

使用 collections.Counter的解决方案要快得多(比前3种方法快20-40倍)

  • source.groupby(['Country', 'City'])['Short name'].agg(lambda srs: Counter(list(srs)).most_common(1)[0][0])

但还是很慢。

由 abw333和 Josh Friedlander 提供的解决方案要快得多(大约比使用 Counter的方法快10倍)。这些解决方案可以通过使用 value_counts来进一步优化(DataFrame.value_counts从大熊猫1.1.0开始就可用了).

source.value_counts(['Country', 'City', 'Short name']).pipe(lambda x: x[~x.droplevel('Short name').index.duplicated()]).reset_index(name='Count')

要使函数像 Josh Friedlander 的函数那样适用于 NaNs,只需关闭 dropna参数:

source.value_counts(['Country', 'City', 'Short name'], dropna=False).pipe(lambda x: x[~x.droplevel('Short name').index.duplicated()]).reset_index(name='Count')

使用 abw333的设置,如果我们测试运行时差异,对于具有100万行的 DataFrame,value_counts比 abw333的解决方案快约10% 。

scale_test_data = [[random.randint(1, 100),
str(random.randint(100, 900)),
str(random.randint(0,2))] for i in range(1000000)]
source = pd.DataFrame(data=scale_test_data, columns=['Country', 'City', 'Short name'])
keys = ['Country', 'City']
vals = ['Short name']


%timeit source.value_counts(keys+vals).pipe(lambda x: x[~x.droplevel(vals).index.duplicated()]).reset_index(name='Count')
# 376 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit mode(source, ['Country', 'City'], 'Short name', 'Count')
# 415 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

为了便于使用,我将该解决方案包装在一个函数中,您可以轻松地复制粘贴并在自己的环境中使用该函数。此函数还可以查找多列的组模式。

def get_groupby_modes(source, keys, values, dropna=True, return_counts=False):
"""
A function that groups a pandas dataframe by some of its columns (keys) and
returns the most common value of each group for some of its columns (values).
The output is sorted by the counts of the first column in values (because it
uses pd.DataFrame.value_counts internally).
An equivalent one-liner if values is a singleton list is:
(
source
.value_counts(keys+values)
.pipe(lambda x: x[~x.droplevel(values).index.duplicated()])
.reset_index(name=f"{values[0]}_count")
)
If there are multiple modes for some group, it returns the value with the
lowest Unicode value (because under the hood, it drops duplicate indexes in a
sorted dataframe), unlike, e.g. df.groupby(keys)[values].agg(pd.Series.mode).
Must have Pandas 1.1.0 or later for the function to work and must have
Pandas 1.3.0 or later for the dropna parameter to work.
-----------------------------------------------------------------------------
Parameters:
-----------
source: pandas dataframe.
A pandas dataframe with at least two columns.
keys: list.
A list of column names of the pandas dataframe passed as source. It is
used to determine the groups for the groupby.
values: list.
A list of column names of the pandas dataframe passed as source.
If it is a singleton list, the output contains the mode of each group
for this column. If it is a list longer than 1, then the modes of each
group for the additional columns are assigned as new columns.
dropna: bool, default: True.
Whether to count NaN values as the same or not. If True, NaN values are
treated by their default property, NaN != NaN. If False, NaN values in
each group are counted as the same values (NaN could potentially be a
most common value).
return_counts: bool, default: False.
Whether to include the counts of each group's mode. If True, the output
contains a column for the counts of each mode for every column in values.
If False, the output only contains the modes of each group for each
column in values.
-----------------------------------------------------------------------------
Returns:
--------
a pandas dataframe.
-----------------------------------------------------------------------------
Example:
--------
get_groupby_modes(source=df,
keys=df.columns[:2].tolist(),
values=df.columns[-2:].tolist(),
dropna=True,
return_counts=False)
"""
    

def _get_counts(df, keys, v, dropna):
c = df.value_counts(keys+v, dropna=dropna)
return c[~c.droplevel(v).index.duplicated()]
    

counts = _get_counts(source, keys, values[:1], dropna)
    

if len(values) == 1:
if return_counts:
final = counts.reset_index(name=f"{values[0]}_count")
else:
final = counts.reset_index()[keys+values[:1]]
else:
final = counts.reset_index(name=f"{values[0]}_count", level=values[0])
if not return_counts:
final = final.drop(columns=f"{values[0]}_count")
for v in values:
counts = _get_counts(source, keys, [v], dropna).reset_index(level=v)
if return_counts:
final[[v, f"{v}_count"]] = counts
else:
final[v] = counts[v]
final = final.reset_index()
return final

不要使用“ . agg”try“ . application”,这样可以更快地在列之间给出结果。

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})
source.groupby(['Country', 'City'])['Short name'].apply(pd.Series.mode).reset_index()