熊猫应用函数将多个值返回到熊猫数据框中的行

小开

Return Series and it will put them in a DataFrame.

def myfunc(a, b, c):
do something
return pd.Series([e, f, g])

This has the bonus that you can give labels to each of the resulting columns. If you return a DataFrame it just inserts multiple rows for the group.

小开

Found a possible solution, by changing myfunc to return an np.array like this:

import numpy as np


def myfunc(a, b, c):
do something
return np.array((e, f, g))

any better solution?

小开

最佳答案

Just return a list instead of tuple.

In [81]: df
Out[81]:
x         y         z
ts
2014-05-15 10:38:00  0.120117  0.987305  0.116211
2014-05-15 10:39:00  0.117188  0.984375  0.122070
2014-05-15 10:40:00  0.119141  0.987305  0.119141
2014-05-15 10:41:00  0.116211  0.984375  0.120117
2014-05-15 10:42:00  0.119141  0.983398  0.118164


[5 rows x 3 columns]


In [82]: def myfunc(args):
....:        e=args[0] + 2*args[1]
....:        f=args[1]*args[2] +1
....:        g=args[2] + args[0] * args[1]
....:        return [e,f,g]
....:


In [83]: df.apply(myfunc ,axis=1)
Out[83]:
x         y         z
ts
2014-05-15 10:38:00  2.094727  1.114736  0.234803
2014-05-15 10:39:00  2.085938  1.120163  0.237427
2014-05-15 10:40:00  2.093751  1.117629  0.236770
2014-05-15 10:41:00  2.084961  1.118240  0.234512
2014-05-15 10:42:00  2.085937  1.116202  0.235327

小开

Based on the excellent answer by @U2EF1, I've created a handy function that applies a specified function that returns tuples to a dataframe field, and expands the result back to the dataframe.

def apply_and_concat(dataframe, field, func, column_names):
return pd.concat((
dataframe,
dataframe[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)

Usage:

df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'], columns=['A'])
print df
A
a  1
b  2
c  3


def func(x):
return x*x, x*x*x


print apply_and_concat(df, 'A', func, ['x^2', 'x^3'])


A  x^2  x^3
a  1    1    1
b  2    4    8
c  3    9   27

Hope it helps someone.

小开

I've tried returning a tuple (I was using functions like scipy.stats.pearsonr which return that kind of structures) but It returned a 1D Series instead of a Dataframe which was I expected. If I created a Series manually the performance was worse, so I fixed It using the result_type as explained in the official API documentation:

Returning a Series inside the function is similar to passing result_type='expand'. The resulting column names will be the Series index.

So you could edit your code this way:

def myfunc(a, b, c):
# do something
return (e, f, g)


df.apply(myfunc, axis=1, result_type='expand')

小开

Pandas 1.0.5 has DataFrame.apply with parameter result_type that can help here. from the docs:

These only act when axis=1 (columns):


‘expand’ : list-like results will be turned into columns.


‘reduce’ : returns a Series if possible rather than expanding list-like results. This
is the opposite of ‘expand’.


‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the
original index and columns will be retained.

小开

Some of the other people's answers contain mistakes, so I've summarized them below. The perfect answer is below.

Prepare the dataset. The version of pandas uses 1.1.5.

import numpy as np
import pandas as pd
import timeit


# check pandas version
print(pd.__version__)
# 1.1.5


# prepare DataFrame
df = pd.DataFrame({
'x': [0.120117, 0.117188, 0.119141, 0.116211, 0.119141],
'y': [0.987305, 0.984375, 0.987305, 0.984375, 0.983398],
'z': [0.116211, 0.122070, 0.119141, 0.120117, 0.118164]},
index=[
'2014-05-15 10:38',
'2014-05-15 10:39',
'2014-05-15 10:40',
'2014-05-15 10:41',
'2014-05-15 10:42'],
columns=['x', 'y', 'z'])
df.index.name = 'ts'
#                          x         y         z
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211
# 2014-05-15 10:39  0.117188  0.984375  0.122070
# 2014-05-15 10:40  0.119141  0.987305  0.119141
# 2014-05-15 10:41  0.116211  0.984375  0.120117
# 2014-05-15 10:42  0.119141  0.983398  0.118164

Solution 01.

Returns pd.Series in the apply function.

def myfunc1(args):
e = args[0] + 2*args[1]
f = args[1]*args[2] + 1
g = args[2] + args[0] * args[1]
return pd.Series([e, f, g])


df[['e', 'f', 'g']] = df.apply(myfunc1, axis=1)
#                          x         y         z         e         f         g
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327


t1 = timeit.timeit(
'df.apply(myfunc1, axis=1)',
globals=dict(df=df, myfunc1=myfunc1), number=10000)
print(round(t1, 3), 'seconds')
# 14.571 seconds

Solution 02.

Use result_type ='expand' when applying.

def myfunc2(args):
e = args[0] + 2*args[1]
f = args[1]*args[2] + 1
g = args[2] + args[0] * args[1]
return [e, f, g]


df[['e', 'f', 'g']] = df.apply(myfunc2, axis=1, result_type='expand')
#                          x         y         z         e         f         g
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327


t2 = timeit.timeit(
"df.apply(myfunc2, axis=1, result_type='expand')",
globals=dict(df=df, myfunc2=myfunc2), number=10000)
print(round(t2, 3), 'seconds')
# 9.907 seconds

Solution 03.

If you want to make it faster, use np.vectorize. Note that args cannot be a single argument when using np.vectorize.

def myfunc3(args0, args1, args2):
e = args0 + 2*args1
f = args1*args2 + 1
g = args2 + args0 * args1
return [e, f, g]


df[['e', 'f', 'g']] = pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)
#                          x         y         z         e         f         g
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327


t3 = timeit.timeit(
"pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)",
globals=dict(pd=pd, np=np, df=df, myfunc3=myfunc3), number=10000)
print(round(t3, 3), 'seconds')
# 1.598 seconds