考虑以下数据框架:
columns = ['A', 'B', 'C', 'D']
records = [
['foo', 'one', 0.162003, 0.087469],
['bar', 'one', -1.156319, -1.5262719999999999],
['foo', 'two', 0.833892, -1.666304],
['bar', 'three', -2.026673, -0.32205700000000004],
['foo', 'two', 0.41145200000000004, -0.9543709999999999],
['bar', 'two', 0.765878, -0.095968],
['foo', 'one', -0.65489, 0.678091],
['foo', 'three', -1.789842, -1.130922]
]
df = pd.DataFrame.from_records(records, columns=columns)
"""
A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
"""
可以使用以下命令:
df.groupby('A').apply(lambda x: (x['C'] - x['D']))
df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())
但没有下列工作:
df.groupby('A').transform(lambda x: (x['C'] - x['D']))
# KeyError or ValueError: could not broadcast input array from shape (5) into shape (5,3)
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
# KeyError or TypeError: cannot concatenate a non-NDFrame object
为什么? 这个例子就说明文档了似乎表明在组上调用transform
允许执行逐行操作处理:
# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)
换句话说,我认为转换本质上是一种特定类型的应用(不聚合的应用)。我哪里错了?
作为参考,下面是上面原始数据框架的构造:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})