熊猫按月和按年分组

我有以下数据框:

Date        abc    xyz
01-Jun-13   100    200
03-Jun-13   -20    50
15-Aug-13   40     -5
20-Jan-14   25     15
21-Feb-14   60     80

我需要按年份和月份对数据进行分组,比如2013年1月,2013年2月,2013年3月等等。

我将使用新分组的数据来创建一个图表,显示每年/月的 abc vs xyz。

我尝试了各种组合的群体和和,但我只是似乎不能得到任何工作。我该怎么做?

196032 次浏览

There are different ways to do that.

  • I created the data frame to showcase the different techniques to filter your data.

      df = pd.DataFrame({'Date': ['01-Jun-13', '03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],
    'abc': [100, -20, 40, 25, 60], 'xyz': [200, 50,-5, 15, 80] })
    
  • I separated months/year/day and separated month-year as you explained.

      def getMonth(s):
    return s.split("-")[1]
    
    
    def getDay(s):
    return s.split("-")[0]
    
    
    def getYear(s):
    return s.split("-")[2]
    
    
    def getYearMonth(s):
    return s.split("-")[1] + "-" + s.split("-")[2]
    
  • I created new columns: year, month, day and 'yearMonth'. In your case, you need one of both. You can group using two columns 'year','month' or using one column yearMonth

      df['year'] = df['Date'].apply(lambda x: getYear(x))
    df['month'] = df['Date'].apply(lambda x: getMonth(x))
    df['day'] = df['Date'].apply(lambda x: getDay(x))
    df['YearMonth'] = df['Date'].apply(lambda x: getYearMonth(x))
    

    Output:

            Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    2  15-Aug-13   40   -5   13   Aug  15    Aug-13
    3  20-Jan-14   25   15   14   Jan  20    Jan-14
    4  21-Feb-14   60   80   14   Feb  21    Feb-14
    
  • You can go through the different groups in groupby(..) items.

    In this case, we are grouping by two columns:

      for key, g in df.groupby(['year', 'month']):
    print key, g
    

    Output:

    ('13', 'Jun')         Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    ('13', 'Aug')         Date  abc  xyz year month day YearMonth
    2  15-Aug-13   40   -5   13   Aug  15    Aug-13
    ('14', 'Jan')         Date  abc  xyz year month day YearMonth
    3  20-Jan-14   25   15   14   Jan  20    Jan-14
    ('14', 'Feb')         Date  abc  xyz year month day YearMonth
    

    In this case, we are grouping by one column:

      for key, g in df.groupby(['YearMonth']):
    print key, g
    

    Output:

    Jun-13         Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    Aug-13         Date  abc  xyz year month day YearMonth
    2  15-Aug-13   40   -5   13   Aug  15    Aug-13
    Jan-14         Date  abc  xyz year month day YearMonth
    3  20-Jan-14   25   15   14   Jan  20    Jan-14
    Feb-14         Date  abc  xyz year month day YearMonth
    4  21-Feb-14   60   80   14   Feb  21    Feb-14
    
  • In case you want to access a specific item, you can use get_group

      print df.groupby(['YearMonth']).get_group('Jun-13')
    

    Output:

            Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    
  • Similar to get_group. This hack would help to filter values and get the grouped values.

    This also would give the same result.

      print df[df['YearMonth']=='Jun-13']
    

    Output:

            Date  abc  xyz year month day YearMonth
    0  01-Jun-13  100  200   13   Jun  01    Jun-13
    1  03-Jun-13  -20   50   13   Jun  03    Jun-13
    

    You can select list of abc or xyz values during Jun-13

      print df[df['YearMonth']=='Jun-13'].abc.values
    print df[df['YearMonth']=='Jun-13'].xyz.values
    

    Output:

    [100 -20]  #abc values
    [200  50]  #xyz values
    

    You can use this to go through the dates that you have classified as "year-month" and apply criteria on it to get related data.

      for x in set(df.YearMonth):
    print df[df['YearMonth']==x].abc.values
    print df[df['YearMonth']==x].xyz.values
    

I recommend also to check this answer as well.

You can use either resample or Grouper (which resamples under the hood).

First make sure that the datetime column is actually of datetimes (hit it with pd.to_datetime). It's easier if it's a DatetimeIndex:

In [11]: df1
Out[11]:
abc  xyz
Date
2013-06-01  100  200
2013-06-03  -20   50
2013-08-15   40   -5
2014-01-20   25   15
2014-02-21   60   80


In [12]: g = df1.groupby(pd.Grouper(freq="M"))  # DataFrameGroupBy (grouped by Month)


In [13]: g.sum()
Out[13]:
abc  xyz
Date
2013-06-30   80  250
2013-07-31  NaN  NaN
2013-08-31   40   -5
2013-09-30  NaN  NaN
2013-10-31  NaN  NaN
2013-11-30  NaN  NaN
2013-12-31  NaN  NaN
2014-01-31   25   15
2014-02-28   60   80


In [14]: df1.resample("M", how='sum')  # the same
Out[14]:
abc  xyz
Date
2013-06-30   40  125
2013-07-31  NaN  NaN
2013-08-31   40   -5
2013-09-30  NaN  NaN
2013-10-31  NaN  NaN
2013-11-30  NaN  NaN
2013-12-31  NaN  NaN
2014-01-31   25   15
2014-02-28   60   80

Note: Previously pd.Grouper(freq="M") was written as pd.TimeGrouper("M"). The latter is now deprecated since 0.21.


I had thought the following would work, but it doesn't (due to as_index not being respected? I'm not sure.). I'm including this for interest's sake.

If it's a column (it has to be a datetime64 column! as I say, hit it with to_datetime), you can use the PeriodIndex:

In [21]: df
Out[21]:
Date  abc  xyz
0 2013-06-01  100  200
1 2013-06-03  -20   50
2 2013-08-15   40   -5
3 2014-01-20   25   15
4 2014-02-21   60   80


In [22]: pd.DatetimeIndex(df.Date).to_period("M")  # old way
Out[22]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-06, ..., 2014-02]
Length: 5, Freq: M


In [23]: per = df.Date.dt.to_period("M")  # new way to get the same


In [24]: g = df.groupby(per)


In [25]: g.sum()  # dang not quite what we want (doesn't fill in the gaps)
Out[25]:
abc  xyz
2013-06   80  250
2013-08   40   -5
2014-01   25   15
2014-02   60   80

To get the desired result we have to reindex...

Keep it simple:

GB = DF.groupby([(DF.index.year), (DF.index.month)]).sum()

giving you,

print(GB)
abc  xyz
2013 6   80  250
8   40   -5
2014 1   25   15
2   60   80

and then you can plot like asked using,

GB.plot('abc', 'xyz', kind='scatter')

You can also do it by creating a string column with the year and month as follows:

df['date'] = df.index
df['year-month'] = df['date'].apply(lambda x: str(x.year) + ' ' + str(x.month))
grouped = df.groupby('year-month')

However this doesn't preserve the order when you loop over the groups, e.g.

for name, group in grouped:
print(name)

Will give:

2007 11
2007 12
2008 1
2008 10
2008 11
2008 12
2008 2
2008 3
2008 4
2008 5
2008 6
2008 7
2008 8
2008 9
2009 1
2009 10

So then, if you want to preserve the order, you must do as suggested by @Q-man above:

grouped = df.groupby([df.index.year, df.index.month])

This will preserve the order in the above loop:

(2007, 11)
(2007, 12)
(2008, 1)
(2008, 2)
(2008, 3)
(2008, 4)
(2008, 5)
(2008, 6)
(2008, 7)
(2008, 8)
(2008, 9)
(2008, 10)

Some of the answers are using Date as an index instead of a column (and there's nothing wrong with doing that).

However, for anyone who has the dates stored as a column (instead of an index), remember to access the column's dt attribute. That is:

# First make sure `Date` is a datetime column
df['Date'] = pd.to_datetime(
arg=df['Date'],
format='%d-%b-%y' # Assuming dd-Mon-yy format
)


# Group by year and month
df.groupby(
[
df['Date'].dt.year,
df['Date'].dt.month
]
).sum()