使用多列的熊猫 DataFrame 聚合函数

有没有一种方法可以像 DataFrame.agg方法那样编写一个聚合函数,这个聚合函数可以访问被聚合的数据的多个列?典型的用例是加权平均数加权标准差函数。

我希望能够写一些像

def wAvg(c, w):
return ((c * w).sum() / w.sum())


df = DataFrame(....) # df has columns c and w, i want weighted average
# of c using w as weight.
df.aggregate ({"c": wAvg}) # and somehow tell it to use w column as weights ...
52688 次浏览

Yes; use the .apply(...) function, which will be called on each sub-DataFrame. For example:

grouped = df.groupby(keys)


def wavg(group):
d = group['data']
w = group['weights']
return (d * w).sum() / w.sum()


grouped.apply(wavg)

The following (based on Wes McKinney' answer) accomplishes exactly what I was looking for. I'd be happy to learn if there's a simpler way of doing this within pandas.

def wavg_func(datacol, weightscol):
def wavg(group):
dd = group[datacol]
ww = group[weightscol] * 1.0
return (dd * ww).sum() / ww.sum()
return wavg




def df_wavg(df, groupbycol, weightscol):
grouped = df.groupby(groupbycol)
df_ret = grouped.agg({weightscol:sum})
datacols = [cc for cc in df.columns if cc not in [groupbycol, weightscol]]
for dcol in datacols:
try:
wavg_f = wavg_func(dcol, weightscol)
df_ret[dcol] = grouped.apply(wavg_f)
except TypeError:  # handle non-numeric columns
df_ret[dcol] = grouped.agg({dcol:min})
return df_ret

The function df_wavg() returns a dataframe that's grouped by the "groupby" column, and that returns the sum of the weights for the weights column. Other columns are either the weighted averages or, if non-numeric, the min() function is used for aggregation.

I do this a lot and found the following quite handy:

def weighed_average(grp):
return grp._get_numeric_data().multiply(grp['COUNT'], axis=0).sum()/grp['COUNT'].sum()
df.groupby('SOME_COL').apply(weighed_average)

This will compute the weighted average of all the numerical columns in the df and drop non-numeric ones.

Accomplishing this via groupby(...).apply(...) is non-performant. Here's a solution that I use all the time (essentially using kalu's logic).

def grouped_weighted_average(self, values, weights, *groupby_args, **groupby_kwargs):
"""
:param values: column(s) to take the average of
:param weights_col: column to weight on
:param group_args: args to pass into groupby (e.g. the level you want to group on)
:param group_kwargs: kwargs to pass into groupby
:return: pandas.Series or pandas.DataFrame
"""


if isinstance(values, str):
values = [values]


ss = []
for value_col in values:
df = self.copy()
prod_name = 'prod_{v}_{w}'.format(v=value_col, w=weights)
weights_name = 'weights_{w}'.format(w=weights)


df[prod_name] = df[value_col] * df[weights]
df[weights_name] = df[weights].where(~df[prod_name].isnull())
df = df.groupby(*groupby_args, **groupby_kwargs).sum()
s = df[prod_name] / df[weights_name]
s.name = value_col
ss.append(s)
df = pd.concat(ss, axis=1) if len(ss) > 1 else ss[0]
return df


pandas.DataFrame.grouped_weighted_average = grouped_weighted_average

My solution is similar to Nathaniel's solution, only it's for a single column and I don't deep-copy the entire data frame each time, which could be prohibitively slow. The performance gain over the solution groupby(...).apply(...) is about 100x(!)

def weighted_average(df, data_col, weight_col, by_col):
df['_data_times_weight'] = df[data_col] * df[weight_col]
df['_weight_where_notnull'] = df[weight_col] * pd.notnull(df[data_col])
g = df.groupby(by_col)
result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
del df['_data_times_weight'], df['_weight_where_notnull']
return result

It is possible to return any number of aggregated values from a groupby object with apply. Simply, return a Series and the index values will become the new column names.

Let's see a quick example:

df = pd.DataFrame({'group':['a','a','b','b'],
'd1':[5,10,100,30],
'd2':[7,1,3,20],
'weights':[.2,.8, .4, .6]},
columns=['group', 'd1', 'd2', 'weights'])
df


group   d1  d2  weights
0     a    5   7      0.2
1     a   10   1      0.8
2     b  100   3      0.4
3     b   30  20      0.6

Define a custom function that will be passed to apply. It implicitly accepts a DataFrame - meaning the data parameter is a DataFrame. Notice how it uses multiple columns, which is not possible with the agg groupby method:

def weighted_average(data):
d = {}
d['d1_wa'] = np.average(data['d1'], weights=data['weights'])
d['d2_wa'] = np.average(data['d2'], weights=data['weights'])
return pd.Series(d)

Call the groupby apply method with our custom function:

df.groupby('group').apply(weighted_average)


d1_wa  d2_wa
group
a        9.0    2.2
b       58.0   13.2

You can get better performance by precalculating the weighted totals into new DataFrame columns as explained in other answers and avoid using apply altogether.

Here's a solution which has the following benefits:

  1. You don't need to define a function in advance
  2. You can use it within a pipe (since it's using lambda)
  3. You can name the resulting column

:

df.groupby('group')
.apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights)})

You can also use the same code to perform multiple aggregations:

df.groupby('group')
.apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights),
'regular_average': np.average(x.data)}))

You can implement this function in the following way:

(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()

For example:

df = pd.DataFrame({'groups': [1, 1, 2, 2], 'c': [3, 3, 4, 4], 'w': [5, 5, 6, 6]})
(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()

Result:

groups
1    3.0
2    4.0
dtype: float64

Adding to Wes MacKinney answer, this will rename the aggregated column:

grouped = df.groupby(keys)


def wavg(group):
d = group['data']
w = group['weights']
return (d * w).sum() / w.sum()


grouped.apply(wavg).reset_index().rename(columns={0 : "wavg"})