熊猫数据框中的 groupby 加权平均数和

我有一个数据框架:

    Out[78]:
contract month year  buys  adjusted_lots    price
0         W     Z    5  Sell             -5   554.85
1         C     Z    5  Sell             -3   424.50
2         C     Z    5  Sell             -2   424.00
3         C     Z    5  Sell             -2   423.75
4         C     Z    5  Sell             -3   423.50
5         C     Z    5  Sell             -2   425.50
6         C     Z    5  Sell             -3   425.25
7         C     Z    5  Sell             -2   426.00
8         C     Z    5  Sell             -2   426.75
9        CC     U    5   Buy              5  3328.00
10       SB     V    5   Buy              5    11.65
11       SB     V    5   Buy              5    11.64
12       SB     V    5   Buy              2    11.60

我需要一个调整后的地段,价格是加权平均数,价格和调整后的地段的总和,按所有其他栏分组,即。按(合同、月份、年份和采购)分组

R 上的类似解决方案是通过以下代码实现的,使用 dplyr,但是在熊猫中无法做到这一点。

> newdf = df %>%
select ( contract , month , year , buys , adjusted_lots , price ) %>%
group_by( contract , month , year ,  buys) %>%
summarise(qty = sum( adjusted_lots) , avgpx = weighted.mean(x = price , w = adjusted_lots) , comdty = "Comdty" )


> newdf
Source: local data frame [4 x 6]


contract month year comdty qty     avgpx
1        C     Z    5 Comdty -19  424.8289
2       CC     U    5 Comdty   5 3328.0000
3       SB     V    5 Comdty  12   11.6375
4        W     Z    5 Comdty  -5  554.8500

是否可以通过 groupby 或任何其他解决方案实现相同的操作?

93761 次浏览

EDIT: update aggregation so it works with recent version of pandas

To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:

# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])


# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)


# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),
price_weighted_mean=("price", wm))


adjusted_lots  price_weighted_mean
contract month year buys
C        Z     5    Sell            -19           424.828947
CC       U     5    Buy               5          3328.000000
SB       V     5    Buy              12            11.637500
W        Z     5    Sell             -5           554.850000

You can see more here:

and in a similar question here:

Hope this helps

[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling

Doing weighted average by groupby(...).apply(...) can be very slow (100x from the following). See my answer (and others) on this thread.

def weighted_average(df,data_col,weight_col,by_col):
df['_data_times_weight'] = df[data_col]*df[weight_col]
df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
g = df.groupby(by_col)
result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
del df['_data_times_weight'], df['_weight_where_notnull']
return result

The solution that uses a dict of aggregation functions will be deprecated in a future version of pandas (version 0.22):

FutureWarning: using a dict with renaming is deprecated and will be removed in a future
version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)

Use a groupby apply and return a Series to rename columns as discussed in: Rename result columns from Pandas aggregation ("FutureWarning: using a dict with renaming is deprecated")

def my_agg(x):
names = {'weighted_ave_price': (x['adjusted_lots'] * x['price']).sum()/x['adjusted_lots'].sum()}
return pd.Series(names, index=['weighted_ave_price'])

produces the same result:

>df.groupby(["contract", "month", "year", "buys"]).apply(my_agg)


weighted_ave_price
contract month year buys
C        Z     5    Sell          424.828947
CC       U     5    Buy          3328.000000
SB       V     5    Buy            11.637500
W        Z     5    Sell          554.850000

Wouldn't it be a lot more simpler to do this.

  1. Multiply (adjusted_lots * price_weighted_mean) into a new column "X"
  2. Use groupby().sum() for columns "X" and "adjusted_lots" to get grouped df df_grouped
  3. Compute weighted average on the df_grouped as df_grouped['X']/df_grouped['adjusted_lots']

With datar, you don't have to learn pandas APIs to transition your R code:

>>> from datar.all import f, tibble, c, rep, select, summarise, sum, weighted_mean, group_by
>>> df = tibble(
...     contract=c('W', rep('C', 8), 'CC', rep('SB', 3)),
...     month=c(rep('Z', 9), 'U', rep('V', 3)),
...     year=5,
...     buys=c(rep('Sell', 9), rep('Buy', 4)),
...     adjusted_lots=[-5, -3, -2, -2, -3, -2, -3, -2, -2, 5, 5, 5, 2],
...     price=[554.85, 424.50, 424.00, 423.75, 423.50, 425.50, 425.25, 426.00, 426.75,3328.00, 11.65, 11.64, 1
1.60]
... )
>>> df
contract month  year  buys  adjusted_lots    price
0         W     Z     5  Sell             -5   554.85
1         C     Z     5  Sell             -3   424.50
2         C     Z     5  Sell             -2   424.00
3         C     Z     5  Sell             -2   423.75
4         C     Z     5  Sell             -3   423.50
5         C     Z     5  Sell             -2   425.50
6         C     Z     5  Sell             -3   425.25
7         C     Z     5  Sell             -2   426.00
8         C     Z     5  Sell             -2   426.75
9        CC     U     5   Buy              5  3328.00
10       SB     V     5   Buy              5    11.65
11       SB     V     5   Buy              5    11.64
12       SB     V     5   Buy              2    11.60
>>> newdf = df >> \
...   select(f.contract, f.month, f.year, f.buys, f.adjusted_lots, f.price) >> \
...   group_by(f.contract, f.month, f.year, f.buys) >> \
...   summarise(
...       qty = sum(f.adjusted_lots),
...       avgpx = weighted_mean(x = f.price , w = f.adjusted_lots),
...       comdty = "Comdty"
...   )
[2021-05-24 13:11:03][datar][   INFO] `summarise()` has grouped output by ['contract', 'month', 'year'] (overr
ide with `_groups` argument)
>>>
>>> newdf
contract month  year  buys  qty        avgpx  comdty
0        C     Z     5  Sell  -19   424.828947  Comdty
1       CC     U     5   Buy    5  3328.000000  Comdty
2       SB     V     5   Buy   12    11.637500  Comdty
3        W     Z     5  Sell   -5   554.850000  Comdty
[Groups: ['contract', 'month', 'year'] (n=4)]

I am the author of the package. Feel free to submit issues if you have any questions.

ErnestScribbler's answer is much faster than the accepted solution. Here a multivariate analogue:

def weighted_average(df,data_col,weight_col,by_col):
''' Now data_col can be a list of variables '''
df_data = df[data_col].multiply(df[weight_col], axis='index')
df_weight = pd.notnull(df[data_col]).multiply(df[weight_col], axis='index')
df_data[by_col] = df[by_col]
df_weight[by_col] = df[by_col]
result = df_data.groupby(by_col).sum() / df_weight.groupby(by_col).sum()
return result

I came across this thread when confronted with a similar problem. In my case, I wanted to generate a weighted metric of a quarterback rating should more than one quarterback have attempted a pass in a given NFL game.

I may change the code if I start running into significant performance issues as I scale. For now, I preferred squeezing my solution into the .agg function alongside other transforms. Happy to see if someone has a simpler solution to achieve the same end. Ultimately, I employed a closure pattern.

The magic of the closure approach, if this is an unfamiliar pattern to a future reader, is that I can still return a simple function to pandas' .agg() method, but I get to do so with some additional information preconfigured from the top-level factory function.

def weighted_mean_factory(*args, **kwargs):
weights = kwargs.get('w').copy()
    

def weighted_mean(x):
x_mask = ~np.isnan(x)
w = weights.loc[x.index]
        

if all(v is False for v in x_mask):
raise ValueError('there are no non-missing x variable values')


return np.average(x[x_mask], weights=w[x_mask])
    

return weighted_mean


res_df = df.groupby(['game_id', 'team'])\
.agg(pass_player_cnt=('attempts', count_is_not_zero),
completions=('completions', 'sum'),
attempts=('attempts', 'sum'),
pass_yds=('pass_yards', 'sum'),
pass_tds=('pass_tds', 'sum'),
pass_int=('pass_int', 'sum'),
sack_taken=('sacks_taken', 'sum'),
sack_yds_loss=('sack_yds_loss', 'sum'),
longest_completion=('longest_completion', 'max'),
qbr_w_avg=('qb_rating', weighted_mean_factory(x='qb_rating', w=df['attempts']))
)

Some basic benchmarking stats on a DataFrame with the shape (5436, 31) are below and are not cause for concern on my end in terms of performance at this stage:

149 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This combines the original approach by jrjc with the closure approach by MB. It has the advantage of being able to reuse the closure function.

import pandas as pd


def group_weighted_mean_factory(df: pd.DataFrame, weight_col_name: str):
# Ref: https://stackoverflow.com/a/69787938/
def group_weighted_mean(x):
try:
return np.average(x, weights=df.loc[x.index, weight_col_name])
except ZeroDivisionError:
return np.average(x)
return group_weighted_mean


df = ...  # Define
group_weighted_mean = group_weighted_mean_factory(df, "adjusted_lots")
g = df.groupby(...)  # Define
agg_df = g.agg({'price': group_weighted_mean})