熊猫应用函数将多个值返回到熊猫数据框中的行

我有一个带有时间索引的数据框架和包含3D 向量坐标的3列:

                         x             y             z
ts
2014-05-15 10:38         0.120117      0.987305      0.116211
2014-05-15 10:39         0.117188      0.984375      0.122070
2014-05-15 10:40         0.119141      0.987305      0.119141
2014-05-15 10:41         0.116211      0.984375      0.120117
2014-05-15 10:42         0.119141      0.983398      0.118164

我想应用一个转换到每一行,也返回一个向量

def myfunc(a, b, c):
do something
return e, f, g

但如果我这么做了:

df.apply(myfunc, axis=1)

最后我得到了一个熊猫系列,它的元素是元组。这是因为应用程序将采取的结果 myfunc 没有解包它。如何更改 myfunc 以获得包含3列的新 df?

编辑:

下面的所有解决方案都能正常工作。 Series 解决方案确实允许列名,List 解决方案似乎执行得更快。

def myfunc1(args):
e=args[0] + 2*args[1]
f=args[1]*args[2] +1
g=args[2] + args[0] * args[1]
return pd.Series([e,f,g], index=['a', 'b', 'c'])


def myfunc2(args):
e=args[0] + 2*args[1]
f=args[1]*args[2] +1
g=args[2] + args[0] * args[1]
return [e,f,g]


%timeit df.apply(myfunc1 ,axis=1)


100 loops, best of 3: 4.51 ms per loop


%timeit df.apply(myfunc2 ,axis=1)


100 loops, best of 3: 2.75 ms per loop
116153 次浏览

Return Series and it will put them in a DataFrame.

def myfunc(a, b, c):
do something
return pd.Series([e, f, g])

This has the bonus that you can give labels to each of the resulting columns. If you return a DataFrame it just inserts multiple rows for the group.

Found a possible solution, by changing myfunc to return an np.array like this:

import numpy as np


def myfunc(a, b, c):
do something
return np.array((e, f, g))

any better solution?

Just return a list instead of tuple.

In [81]: df
Out[81]:
x         y         z
ts
2014-05-15 10:38:00  0.120117  0.987305  0.116211
2014-05-15 10:39:00  0.117188  0.984375  0.122070
2014-05-15 10:40:00  0.119141  0.987305  0.119141
2014-05-15 10:41:00  0.116211  0.984375  0.120117
2014-05-15 10:42:00  0.119141  0.983398  0.118164


[5 rows x 3 columns]


In [82]: def myfunc(args):
....:        e=args[0] + 2*args[1]
....:        f=args[1]*args[2] +1
....:        g=args[2] + args[0] * args[1]
....:        return [e,f,g]
....:


In [83]: df.apply(myfunc ,axis=1)
Out[83]:
x         y         z
ts
2014-05-15 10:38:00  2.094727  1.114736  0.234803
2014-05-15 10:39:00  2.085938  1.120163  0.237427
2014-05-15 10:40:00  2.093751  1.117629  0.236770
2014-05-15 10:41:00  2.084961  1.118240  0.234512
2014-05-15 10:42:00  2.085937  1.116202  0.235327

Based on the excellent answer by @U2EF1, I've created a handy function that applies a specified function that returns tuples to a dataframe field, and expands the result back to the dataframe.

def apply_and_concat(dataframe, field, func, column_names):
return pd.concat((
dataframe,
dataframe[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)

Usage:

df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'], columns=['A'])
print df
A
a  1
b  2
c  3


def func(x):
return x*x, x*x*x


print apply_and_concat(df, 'A', func, ['x^2', 'x^3'])


A  x^2  x^3
a  1    1    1
b  2    4    8
c  3    9   27

Hope it helps someone.

I've tried returning a tuple (I was using functions like scipy.stats.pearsonr which return that kind of structures) but It returned a 1D Series instead of a Dataframe which was I expected. If I created a Series manually the performance was worse, so I fixed It using the result_type as explained in the official API documentation:

Returning a Series inside the function is similar to passing result_type='expand'. The resulting column names will be the Series index.

So you could edit your code this way:

def myfunc(a, b, c):
# do something
return (e, f, g)


df.apply(myfunc, axis=1, result_type='expand')

Pandas 1.0.5 has DataFrame.apply with parameter result_type that can help here. from the docs:

These only act when axis=1 (columns):


‘expand’ : list-like results will be turned into columns.


‘reduce’ : returns a Series if possible rather than expanding list-like results. This
is the opposite of ‘expand’.


‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the
original index and columns will be retained.

Some of the other people's answers contain mistakes, so I've summarized them below. The perfect answer is below.

Prepare the dataset. The version of pandas uses 1.1.5.

import numpy as np
import pandas as pd
import timeit


# check pandas version
print(pd.__version__)
# 1.1.5


# prepare DataFrame
df = pd.DataFrame({
'x': [0.120117, 0.117188, 0.119141, 0.116211, 0.119141],
'y': [0.987305, 0.984375, 0.987305, 0.984375, 0.983398],
'z': [0.116211, 0.122070, 0.119141, 0.120117, 0.118164]},
index=[
'2014-05-15 10:38',
'2014-05-15 10:39',
'2014-05-15 10:40',
'2014-05-15 10:41',
'2014-05-15 10:42'],
columns=['x', 'y', 'z'])
df.index.name = 'ts'
#                          x         y         z
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211
# 2014-05-15 10:39  0.117188  0.984375  0.122070
# 2014-05-15 10:40  0.119141  0.987305  0.119141
# 2014-05-15 10:41  0.116211  0.984375  0.120117
# 2014-05-15 10:42  0.119141  0.983398  0.118164

Solution 01.

Returns pd.Series in the apply function.

def myfunc1(args):
e = args[0] + 2*args[1]
f = args[1]*args[2] + 1
g = args[2] + args[0] * args[1]
return pd.Series([e, f, g])


df[['e', 'f', 'g']] = df.apply(myfunc1, axis=1)
#                          x         y         z         e         f         g
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327


t1 = timeit.timeit(
'df.apply(myfunc1, axis=1)',
globals=dict(df=df, myfunc1=myfunc1), number=10000)
print(round(t1, 3), 'seconds')
# 14.571 seconds

Solution 02.

Use result_type ='expand' when applying.

def myfunc2(args):
e = args[0] + 2*args[1]
f = args[1]*args[2] + 1
g = args[2] + args[0] * args[1]
return [e, f, g]


df[['e', 'f', 'g']] = df.apply(myfunc2, axis=1, result_type='expand')
#                          x         y         z         e         f         g
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327


t2 = timeit.timeit(
"df.apply(myfunc2, axis=1, result_type='expand')",
globals=dict(df=df, myfunc2=myfunc2), number=10000)
print(round(t2, 3), 'seconds')
# 9.907 seconds

Solution 03.

If you want to make it faster, use np.vectorize. Note that args cannot be a single argument when using np.vectorize.

def myfunc3(args0, args1, args2):
e = args0 + 2*args1
f = args1*args2 + 1
g = args2 + args0 * args1
return [e, f, g]


df[['e', 'f', 'g']] = pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)
#                          x         y         z         e         f         g
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327


t3 = timeit.timeit(
"pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)",
globals=dict(pd=pd, np=np, df=df, myfunc3=myfunc3), number=10000)
print(round(t3, 3), 'seconds')
# 1.598 seconds