熊猫数据框中的 groupby 加权平均数和

小开

最佳答案

EDIT: update aggregation so it works with recent version of pandas

To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:

# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])


# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)


# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),
price_weighted_mean=("price", wm))


adjusted_lots  price_weighted_mean
contract month year buys
C        Z     5    Sell            -19           424.828947
CC       U     5    Buy               5          3328.000000
SB       V     5    Buy              12            11.637500
W        Z     5    Sell             -5           554.850000

You can see more here:

http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once

and in a similar question here:

Apply multiple functions to multiple groupby columns

Hope this helps

[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling

小开

Doing weighted average by groupby(...).apply(...) can be very slow (100x from the following). See my answer (and others) on this thread.

def weighted_average(df,data_col,weight_col,by_col):
df['_data_times_weight'] = df[data_col]*df[weight_col]
df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
g = df.groupby(by_col)
result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
del df['_data_times_weight'], df['_weight_where_notnull']
return result

小开

The solution that uses a dict of aggregation functions will be deprecated in a future version of pandas (version 0.22):

FutureWarning: using a dict with renaming is deprecated and will be removed in a future
version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)

Use a groupby apply and return a Series to rename columns as discussed in: Rename result columns from Pandas aggregation ("FutureWarning: using a dict with renaming is deprecated")

def my_agg(x):
names = {'weighted_ave_price': (x['adjusted_lots'] * x['price']).sum()/x['adjusted_lots'].sum()}
return pd.Series(names, index=['weighted_ave_price'])

produces the same result:

>df.groupby(["contract", "month", "year", "buys"]).apply(my_agg)


weighted_ave_price
contract month year buys
C        Z     5    Sell          424.828947
CC       U     5    Buy          3328.000000
SB       V     5    Buy            11.637500
W        Z     5    Sell          554.850000

小开

Wouldn't it be a lot more simpler to do this.

Multiply (adjusted_lots * price_weighted_mean) into a new column "X"
Use groupby().sum() for columns "X" and "adjusted_lots" to get grouped df df_grouped
Compute weighted average on the df_grouped as df_grouped['X']/df_grouped['adjusted_lots']

小开

With datar, you don't have to learn pandas APIs to transition your R code:

>>> from datar.all import f, tibble, c, rep, select, summarise, sum, weighted_mean, group_by
>>> df = tibble(
...     contract=c('W', rep('C', 8), 'CC', rep('SB', 3)),
...     month=c(rep('Z', 9), 'U', rep('V', 3)),
...     year=5,
...     buys=c(rep('Sell', 9), rep('Buy', 4)),
...     adjusted_lots=[-5, -3, -2, -2, -3, -2, -3, -2, -2, 5, 5, 5, 2],
...     price=[554.85, 424.50, 424.00, 423.75, 423.50, 425.50, 425.25, 426.00, 426.75,3328.00, 11.65, 11.64, 1
1.60]
... )
>>> df
contract month  year  buys  adjusted_lots    price
0         W     Z     5  Sell             -5   554.85
1         C     Z     5  Sell             -3   424.50
2         C     Z     5  Sell             -2   424.00
3         C     Z     5  Sell             -2   423.75
4         C     Z     5  Sell             -3   423.50
5         C     Z     5  Sell             -2   425.50
6         C     Z     5  Sell             -3   425.25
7         C     Z     5  Sell             -2   426.00
8         C     Z     5  Sell             -2   426.75
9        CC     U     5   Buy              5  3328.00
10       SB     V     5   Buy              5    11.65
11       SB     V     5   Buy              5    11.64
12       SB     V     5   Buy              2    11.60
>>> newdf = df >> \
...   select(f.contract, f.month, f.year, f.buys, f.adjusted_lots, f.price) >> \
...   group_by(f.contract, f.month, f.year, f.buys) >> \
...   summarise(
...       qty = sum(f.adjusted_lots),
...       avgpx = weighted_mean(x = f.price , w = f.adjusted_lots),
...       comdty = "Comdty"
...   )
[2021-05-24 13:11:03][datar][   INFO] `summarise()` has grouped output by ['contract', 'month', 'year'] (overr
ide with `_groups` argument)
>>>
>>> newdf
contract month  year  buys  qty        avgpx  comdty
0        C     Z     5  Sell  -19   424.828947  Comdty
1       CC     U     5   Buy    5  3328.000000  Comdty
2       SB     V     5   Buy   12    11.637500  Comdty
3        W     Z     5  Sell   -5   554.850000  Comdty
[Groups: ['contract', 'month', 'year'] (n=4)]

I am the author of the package. Feel free to submit issues if you have any questions.

小开

ErnestScribbler's answer is much faster than the accepted solution. Here a multivariate analogue:

def weighted_average(df,data_col,weight_col,by_col):
''' Now data_col can be a list of variables '''
df_data = df[data_col].multiply(df[weight_col], axis='index')
df_weight = pd.notnull(df[data_col]).multiply(df[weight_col], axis='index')
df_data[by_col] = df[by_col]
df_weight[by_col] = df[by_col]
result = df_data.groupby(by_col).sum() / df_weight.groupby(by_col).sum()
return result

小开

I came across this thread when confronted with a similar problem. In my case, I wanted to generate a weighted metric of a quarterback rating should more than one quarterback have attempted a pass in a given NFL game.

I may change the code if I start running into significant performance issues as I scale. For now, I preferred squeezing my solution into the .agg function alongside other transforms. Happy to see if someone has a simpler solution to achieve the same end. Ultimately, I employed a closure pattern.

The magic of the closure approach, if this is an unfamiliar pattern to a future reader, is that I can still return a simple function to pandas' .agg() method, but I get to do so with some additional information preconfigured from the top-level factory function.

def weighted_mean_factory(*args, **kwargs):
weights = kwargs.get('w').copy()
    

def weighted_mean(x):
x_mask = ~np.isnan(x)
w = weights.loc[x.index]
        

if all(v is False for v in x_mask):
raise ValueError('there are no non-missing x variable values')


return np.average(x[x_mask], weights=w[x_mask])
    

return weighted_mean


res_df = df.groupby(['game_id', 'team'])\
.agg(pass_player_cnt=('attempts', count_is_not_zero),
completions=('completions', 'sum'),
attempts=('attempts', 'sum'),
pass_yds=('pass_yards', 'sum'),
pass_tds=('pass_tds', 'sum'),
pass_int=('pass_int', 'sum'),
sack_taken=('sacks_taken', 'sum'),
sack_yds_loss=('sack_yds_loss', 'sum'),
longest_completion=('longest_completion', 'max'),
qbr_w_avg=('qb_rating', weighted_mean_factory(x='qb_rating', w=df['attempts']))
)

Some basic benchmarking stats on a DataFrame with the shape (5436, 31) are below and are not cause for concern on my end in terms of performance at this stage:

149 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

小开

This combines the original approach by jrjc with the closure approach by MB. It has the advantage of being able to reuse the closure function.

import pandas as pd


def group_weighted_mean_factory(df: pd.DataFrame, weight_col_name: str):
# Ref: https://stackoverflow.com/a/69787938/
def group_weighted_mean(x):
try:
return np.average(x, weights=df.loc[x.index, weight_col_name])
except ZeroDivisionError:
return np.average(x)
return group_weighted_mean


df = ...  # Define
group_weighted_mean = group_weighted_mean_factory(df, "adjusted_lots")
g = df.groupby(...)  # Define
agg_df = g.agg({'price': group_weighted_mean})