在执行 groupby 时保留其他列

我在熊猫数据框架上使用 groupby删除所有没有最小特定列的行。就像这样:

df1 = df.groupby("item", as_index=False)["diff"].min()

但是,如果我有多于这两个列,其他列(例如我的示例中的 otherstuff)将被删除。我是否可以使用 groupby保留这些列,还是必须找到删除行的其他方法?

我的数据是这样的:

    item    diff   otherstuff
0   1       2            1
1   1       1            2
2   1       3            7
3   2      -1            0
4   2       1            3
5   2       4            9
6   2      -6            2
7   3       0            0
8   3       2            9

结果应该是这样的:

    item   diff  otherstuff
0   1      1           2
1   2     -6           2
2   3      0           0

但我得到的是:

    item   diff
0   1      1
1   2     -6
2   3      0

我看了所有的文件,什么都没找到,我试过了:

df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()


df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]


df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()

但是这些都不起作用(最后一个我意识到语法是用来在创建组之后进行聚合的)。

83566 次浏览

Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:

>>> df.loc[df.groupby("item")["diff"].idxmin()]
item  diff  otherstuff
1     1     1           2
6     2    -6           2
7     3     0           0


[3 rows x 3 columns]

Method #2: sort by diff, and then take the first element in each item group:

>>> df.sort_values("diff").groupby("item", as_index=False).first()
item  diff  otherstuff
0     1     1           2
1     2    -6           2
2     3     0           0


[3 rows x 3 columns]

Note that the resulting indices are different even though the row content is the same.

The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked

def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]


df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))

As an aside, .filter() is also relevant to this question but didn't work for me.

If you know that all of your "items" have more than one record you can sort, then use duplicated:

df.sort_values(by='diff').duplicated(subset='item', keep='first')

You can use DataFrame.sort_values with DataFrame.drop_duplicates:

df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item  diff  otherstuff
6     2    -6           2
7     3     0           0
1     1     1           2

If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:

print (df)
item  diff  otherstuff
0     1     2           1
1     1     1           2 <-multiple min
2     1     1           7 <-multiple min
3     2    -1           0
4     2     1           3
5     2     4           9
6     2    -6           2
7     3     0           0
8     3     2           9


print (df.groupby("item")["diff"].transform('min'))
0    1
1    1
2    1
3   -6
4   -6
5   -6
6   -6
7    0
8    0
Name: diff, dtype: int64


df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item  diff  otherstuff
1     1     1           2
2     1     1           7
6     2    -6           2
7     3     0           0

I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.

df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)

For a little more explanation:

  1. Sort items by the minimum value you want
  2. Drop the duplicates of the column you want to sort with
  3. Resort the data because the data is still sorted by the minimum values