如何确定大熊猫数据框列中列表的长度

如何在不进行迭代的情况下确定列中列表的长度?

我有一个这样的数据框架:

                                                    CreationDate
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]

我正在计算 CreationDate栏中列表的长度,然后创建一个新的 Length栏,如下所示:

df['Length'] = df.CreationDate.apply(lambda x: len(x))

这就给了我这个:

                                                    CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

还有比这更简单的方法吗?

90176 次浏览

You can use the str accessor for some list operations as well. In this example,

df['CreationDate'].str.len()

returns the length of each list. See the docs for str.len.

df['Length'] = df['CreationDate'].str.len()
df
Out:
CreationDate  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:

ser = pd.Series([random.sample(string.ascii_letters,
random.randint(1, 20)) for _ in range(10**6)])


%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop


%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop


%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop


%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop
import pandas as pd


data = {'os': [['ubuntu', 'mac-osx', 'syslinux'], ['ubuntu', 'mod-rewrite', 'laconica', 'apache-2.2'], ['ubuntu', 'nat', 'squid', 'mikrotik']]}
index = ['2013-12-22 15:25:02', '2009-12-14 14:29:32', '2013-12-22 15:42:00']


df = pd.DataFrame(data, index)


# create Length column
df['Length'] = df.os.map(len)


# display(df)
os  Length
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]       3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]       4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]       4

%timeit

import pandas as pd
import random
import string


random.seed(365)


ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)])


%timeit ser.str.len()
252 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%timeit ser.map(len)
220 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%timeit ser.apply(len)
222 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)