group by pandas dataframe and select latest in each group

小开

最佳答案

在 groupby中使用 idxmax，将 df与 loc切片

df.loc[df.groupby('id').date.idxmax()]


id  product       date
2  220     6647 2014-10-16
5  826     3380 2015-05-19
8  901     4555 2014-11-01

小开

您还可以使用 tail和 groupby 来获取组的最后 n 个值:

df.sort_values('date').groupby('id').tail(1)


id  product date
2   220 6647    2014-10-16
8   901 4555    2014-11-01
5   826 3380    2015-05-19

小开

使用 .tail()作为聚合方法并保持分组完整:

df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))


id  product date
id
220 2   220 6647    2014-10-16
826 5   826 3380    2015-05-19
901 8   901 4555    2014-11-01

小开

我有一个类似的问题，最终使用 drop_duplicates而不是 groupby。

It seems to run significatively faster on large datasets when compared with other methods suggested above.

df.sort_values(by="date").drop_duplicates(subset=["id"], keep="last")


id  product        date
2  220     6647  2014-10-16
8  901     4555  2014-11-01
5  826     3380  2015-05-19

小开

Given a dataframe sorted by date, you can obtain what you ask for in a number of ways:

像这样:

df.groupby(['id','product']).last()

像这样:

df.groupby(['id','product']).nth(-1)

或者像这样:

df.groupby(['id','product']).max()

如果不希望 id和 product显示为索引，请使用 groupby(['id', 'product'], as_index=False)。可供选择的用途:

df.groupby(['id','product']).tail(1)

小开

#import datetime library
from datetime import datetime as dt


#transform the date column to ordinal, or create a temp column converting to ordinal.
df['date'] = df.date.apply(lambda date: date.toordinal())


#apply aggregation function depending your desire. Earliest or Latest date.
latest_date = df.groupby('id').agg(latest=('date', max))
earliest_date = df.groupby('id').agg(earliest=('date', min))


#convert it from ordinal back to date.
df['date'] = df.date.apply(lambda date: dt.fromordinal(date))




#This operation may take seconds on millions of records.