从pandas apply()返回多个列

我有一个熊猫数据框架,df_test。它包含一个列'size',以字节为单位表示大小。我已经计算了KB, MB和GB使用以下代码:

df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])


df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')


df_test




dir       size       size_kb   size_mb size_gb
0  /Users/uname1     994933      971.6 KB    0.9 MB  0.0 GB
1  /Users/uname2  109338711  106,776.1 KB  104.3 MB  0.1 GB


[2 rows x 5 columns]

我已经运行了超过120,000行,根据%timeit,每列大约需要2.97秒* 3 = ~9秒。

有什么办法能让它快点吗?例如,我可以从apply中一次返回一列并运行3次,我可以一次返回所有三列以插入到原始的数据框架中吗?

我发现的其他问题都想获取多个值并返回一个值。我想取一个值并返回多个列

180648 次浏览

通常,为了返回多个值,我就是这样做的

def gimmeMultiple(group):
x1 = 1
x2 = 2
return array([[1, 2]])
def gimmeMultipleDf(group):
x1 = 1
x2 = 2
return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)

返回一个数据帧肯定有它的好处,但有时不是必需的。你可以看看apply()返回了什么,并对函数进行了一些操作;)

您可以从包含新数据的应用函数返回一个Series,从而避免需要迭代三次。将axis=1传递给apply函数将sizes函数应用到数据帧的每一行,返回一个要添加到新数据帧的序列。这个序列s包含新的值,以及原始数据。

def sizes(s):
s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return s


df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)

使用apply和zip将比Series方式快3倍。

def sizes(s):
return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))

测试结果如下:

Separate df.apply():


100 loops, best of 3: 1.43 ms per loop


Return Series:


100 loops, best of 3: 2.61 ms per loop


Return tuple:


1000 loops, best of 3: 819 µs per loop

目前的一些回复工作得很好,但我想提供另一个,也许更“pandifyed"选择。这适用于我当前的熊猫0.23(不确定它是否在以前的版本中工作):

import pandas as pd


df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])


def sizes(s):
a = locale.format_string("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c


df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")

注意,这个技巧是在applyresult_type参数上,它会将结果展开为可以直接分配给新列/旧列的DataFrame

只是另一种可读的方式。这段代码将添加三个新列及其值,在apply函数中返回不带使用参数的序列。

def sizes(s):


val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])


df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html的一般示例

df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)


#foo  bar
#0    1    2
#1    1    2
#2    1    2

非常酷的答案!谢谢Jesse和jaumebonet!以下是我对以下方面的一些观察:

  • zip(* ...
  • ... result_type="expand")

虽然expand更优雅一些(pandifyed),但**zip至少是快2倍。在下面这个简单的例子中,我得到快4倍

import pandas as pd


dat = [ [i, 10*i] for i in range(1000)]


df = pd.DataFrame(dat, columns = ["a","b"])


def add_and_sub(row):
add = row["a"] + row["b"]
sub = row["a"] - row["b"]
return add, sub


df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))

它提供了一个新的数据框架,其中包含原始数据框架的两列。

import pandas as pd
df = ...
df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)

顶部答案之间的表现有显著差异,Jesse &famaral42已经讨论过这个问题,但值得分享的是顶部答案之间的公平比较,并详细说明Jesse的答案:传递给函数的参数也会影响性能的一个微妙但重要的细节。

(Python 3.7.4, Pandas 1.0.3)

import pandas as pd
import locale
import timeit




def create_new_df_test():
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
return df_test




def sizes_pass_series_return_series(series):
series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return series




def sizes_pass_series_return_tuple(series):
a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c




def sizes_pass_value_return_tuple(value):
a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c

以下是调查结果:

# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


注意返回元组是最快的方法,但是传递作为参数的方法也会影响性能。代码中的差异很小,但性能的提高是显著的。

测试#4(传入单个值)的速度是测试#3(传入一系列值)的两倍,尽管执行的操作表面上是相同的。

但还有更多……

# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


在某些情况下(#1a和#4a),将函数应用到已经存在输出列的DataFrame比从函数中创建输出列更快。

下面是运行测试的代码:

# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('\nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)


print('\nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")


print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))


print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))

我相信1.1版本打破了上面答案中建议的行为。

import pandas as pd
def test_func(row):
row['c'] = str(row['a']) + str(row['b'])
row['d'] = row['a'] + 1
return row


df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']})
df.apply(test_func, axis=1)

上面的代码在pandas 1.1.0上运行返回:

   a  b   c  d
0  1  i  1i  2
1  1  i  1i  2
2  1  i  1i  2

而在熊猫1.0.5中,它返回:

   a   b    c  d
0  1   i   1i  2
1  2   j   2j  3
2  3   k   3k  4

我想这是你所期望的。

不确定发行说明如何解释这种行为,但是正如所解释的那样,在这里通过复制原始行来避免突变,从而恢复旧的行为。例如:

def test_func(row):
row = row.copy()   #  <---- Avoid mutating the original reference
row['c'] = str(row['a']) + str(row['b'])
row['d'] = row['a'] + 1
return row

这是一种非常快速的方法,用apply和。只需将多个值作为列表返回,然后使用to_list()

import pandas as pd


dat = [ [i, 10*i] for i in range(100000)]


df = pd.DataFrame(dat, columns = ["a","b"])


def add_and_div(x):
add = x + 3
div = x / 3
return [add, div]


start = time.time()
df[['c','d']] = df['a'].apply(lambda x: add_and_div(x)).to_list()
end = time.time()


print(end-start) # output: 0.27606

简单明了:

def func(item_df):
return [1,'Label 1'] if item_df['col_0'] > 0 else [0,'Label 0']
 

my_df[['col_1','col2']] = my_df.apply(func, axis=1,result_type='expand')

如果你用numpy来计算,你可以比上面的答案快40多倍。改编@Rocky K的前两个答案。主要区别是在实际的df为120k行的情况下运行。当您以数组方式应用函数(而不是以值方式应用函数)时,Numpy在数学方面要快得多。到目前为止,最好的答案是第三个,因为它使用numpy进行数学计算。还要注意,它每一行只计算1024**2和1024**3一次,而不是每行计算一次,节省了240k次计算。以下是我机器上的计时:

Tuples (pass value, return tuple then zip, new columns dont exist):
Runtime: 10.935037851333618


Tuples (pass value, return tuple then zip, new columns exist):
Runtime: 11.120025157928467


Use numpy for math portions:
Runtime: 0.24799370765686035

以下是我用来计算这些时间的脚本(改编自Rocky K):

import numpy as np
import pandas as pd
import locale
import time


size = np.random.random(120000) * 1000000000
data = pd.DataFrame({'Size': size})


def sizes_pass_value_return_tuple(value):
a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c


print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df1 = data.copy()
start = time.time()
df1['size_kb'],  df1['size_mb'], df1['size_gb'] = zip(*df1['Size'].apply(sizes_pass_value_return_tuple))
end = time.time()
print('Runtime:', end - start, '\n')


print('Tuples (pass value, return tuple then zip, new columns exist):')
df2 = data.copy()
start = time.time()
df2 = pd.concat([df2, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
df2['size_kb'],  df2['size_mb'], df2['size_gb'] = zip(*df2['Size'].apply(sizes_pass_value_return_tuple))
end = time.time()
print('Runtime:', end - start, '\n')


print('Use numpy for math portions:')
df3 = data.copy()
start = time.time()
df3['size_kb'] = (df3.Size.values / 1024).round(1)
df3['size_kb'] = df3.size_kb.astype(str) + ' KB'
df3['size_mb'] = (df3.Size.values / 1024 ** 2).round(1)
df3['size_mb'] = df3.size_mb.astype(str) + ' MB'
df3['size_gb'] = (df3.Size.values / 1024 ** 3).round(1)
df3['size_gb'] = df3.size_gb.astype(str) + ' GB'
end = time.time()
print('Runtime:', end - start, '\n')

我想在groupby上使用apply。我试着用你建议的方法。它确实对我有帮助,但不是全部。

添加result_type='expand'不起作用(因为我在Series上使用apply而不是DataFrame?),使用zip(*___)我失去了索引。

如果其他人也有同样的问题,下面是我(最终)解决它的方法:

dfg = df.groupby(by=['Column1','Column2']).Column3.apply(myfunc)
dfres = pd.DataFrame()
dfres['a'], dfres['b'], dfres['c'] = (dfg.apply(lambda x: x[0]), dfg.apply(lambda x: x[1]), dfg.apply(lambda x: x[2]))

或者你知道更好的办法。告诉我。

如果这超出了我们讨论的范围,请告诉我。