如何在非唯一列中按日期对熊猫数据框条目进行分组

熊猫 DataFrame包含名为 "date"的列,该列包含非唯一的 datetime值。 我可以使用以下方法对这个框架中的线条进行分组:

data.groupby(data['date'])

但是,这会将数据按 datetime值进行分割。我想按“日期”列中存储的年份对这些数据进行分组。这一页展示了如何在使用时间戳作为索引的情况下按年进行分组,在我的例子中并非如此。

如何实现这种分组?

164601 次浏览

This should work:

data.groupby(lambda x: data['date'][x].year)

ecatmur's solution will work fine. This will be better performance on large datasets, though:

data.groupby(data['date'].map(lambda x: x.year))

I'm using pandas 0.16.2. This has better performance on my large dataset:

data.groupby(data.date.dt.year)

Using the dt option and playing around with weekofyear, dayofweek etc. becomes far easier.

this will also work

data.groupby(data['date'].dt.year)

This might be easier to explain with a sample dataset.

Create Sample Data

Let's assume we have a single column of Timestamps, date and another column we would like to perform an aggregation on, a.

df = pd.DataFrame({'date':pd.DatetimeIndex(['2012-1-1', '2012-6-1', '2015-1-1', '2015-2-1', '2015-3-1']),
'a':[9,5,1,2,3]}, columns=['date', 'a'])


df


date  a
0 2012-01-01  9
1 2012-06-01  5
2 2015-01-01  1
3 2015-02-01  2
4 2015-03-01  3

There are several ways to group by year

  • Use the dt accessor with year property
  • Put date in index and use anonymous function to access year
  • Use resample method
  • Convert to pandas Period

.dt accessor with year property

When you have a column (and not an index) of pandas Timestamps, you can access many more extra properties and methods with the dt accessor. For instance:

df['date'].dt.year


0    2012
1    2012
2    2015
3    2015
4    2015
Name: date, dtype: int64

We can use this to form our groups and calculate some aggregations on a particular column:

df.groupby(df['date'].dt.year)['a'].agg(['sum', 'mean', 'max'])


sum  mean  max
date
2012   14     7    9
2015    6     2    3

put date in index and use anonymous function to access year

If you set the date column as the index, it becomes a DateTimeIndex with the same properties and methods as the dt accessor gives normal columns

df1 = df.set_index('date')
df1.index.year


Int64Index([2012, 2012, 2015, 2015, 2015], dtype='int64', name='date')

Interestingly, when using the groupby method, you can pass it a function. This function will be implicitly passed the DataFrame's index. So, we can get the same result from above with the following:

df1.groupby(lambda x: x.year)['a'].agg(['sum', 'mean', 'max'])


sum  mean  max
2012   14     7    9
2015    6     2    3

Use the resample method

If your date column is not in the index, you must specify the column with the on parameter. You also need to specify the offset alias as a string.

df.resample('AS', on='date')['a'].agg(['sum', 'mean', 'max'])


sum  mean  max
date
2012-01-01  14.0   7.0  9.0
2013-01-01   NaN   NaN  NaN
2014-01-01   NaN   NaN  NaN
2015-01-01   6.0   2.0  3.0

Convert to pandas Period

You can also convert the date column to a pandas Period object. We must pass in the offset alias as a string to determine the length of the Period.

df['date'].dt.to_period('A')


0   2012
1   2012
2   2015
3   2015
4   2015
Name: date, dtype: object

We can then use this as a group

df.groupby(df['date'].dt.to_period('Y'))['a'].agg(['sum', 'mean', 'max'])




sum  mean  max
2012   14     7    9
2015    6     2    3

Use:

data.groupby(['col1', data.date.dt.year]).agg({'col2': 'agg_func'}).reset_index()

In case you want to group by year from datetime column and also by another different type column (col1)