在熊猫数据框列中查找最长字符串的长度

小开

最佳答案

DSM 的建议似乎是在不进行人工微优化的情况下所能得到的最好结果:

%timeit -n 100 df.col1.str.len().max()
100 loops, best of 3: 11.7 ms per loop


%timeit -n 100 df.col1.map(lambda x: len(x)).max()
100 loops, best of 3: 16.4 ms per loop


%timeit -n 100 df.col1.map(len).max()
100 loops, best of 3: 10.1 ms per loop

注意，显式地使用 str.len()方法似乎没有多大改进。如果您不熟悉 IPython，而这正是非常方便的 %timeit语法的来源，那么我肯定会建议您尝试使用 IPython 来快速测试这样的东西。

更新添加截图:

小开

作为一个小小的补充，您可能希望循环遍历数据框架中的所有对象列:

for c in df:
if df[c].dtype == 'object':
print('Max length of column %s: %s\n' %  (c, df[c].map(len).max()))

这将防止由 bool、 int 类型等抛出的错误。

可以扩展为其他非数字类型，如‘ string _’、‘ unicode _’即。

if df[c].dtype in ('object', 'string_', 'unicode_'):

小开

有时，您需要最长字符串 以字节为单位的长度。这与使用漂亮的 Unicode 字符的字符串相关，在这种情况下，字节长度大于常规长度。这在特定的情况下可能非常相关，例如对于 数据库写入。

col_bytes_len = int(df[col_name].astype(bytes).str.len().max())

备注:

使用 astype(bytes)比使用 str.encode(encoding='utf-8')更可靠。这是因为 astype(bytes)还可以正确地处理混合 dtype 的列。
输出封装在 int()中，因为输出是一个数字对象。
如果有一个编码错误，那么不考虑 df[col_name].astype(bytes)，考虑:
- df[col_name].str.encode('utf-8')
- df[col_name].str.encode('ascii', errors='backslashreplace')(最后选择)

小开

答得好，尤其是马吕斯和里奇，他们帮了大忙。

考虑到我们大多数人都在优化编码时间，这里有一个对这些答案的快速扩展，将所有列的最大项长度按照每列的最大项长度进行排序:

mx_dct = {c: df[c].map(lambda x: len(str(x))).max() for c in df.columns}
pd.Series(mx_dct).sort_values(ascending =False)

或者作为一句话:

pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df).sort_values(ascending =False)

通过改编原始样本，可以演示如下:

import pandas as pd


x = [['ab', 'bcd'], ['dfe', 'efghik']]
df = pd.DataFrame(x, columns=['col1','col2'])


print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}).sort_values(ascending =False))

产出:

col2    6
col1    3
dtype: int64

小开

import pandas as pd
import numpy as np


x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10)
df = pd.DataFrame(x, columns=['col1'])


# get longest string index from column
indx = df["col1"].str.len().idxmax()


# get longest string value
df["col1"][indx] # <---------------------

小开

这样可能更快(取决于数据框的大小) :

    maxsize=[df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]

或者

    maxsize=[df[x].array.astype('U').dtype.itemsize // 4 for x in df.columns]

对于小型数据帧，则不需要:

x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10)
df = pd.DataFrame(x, columns=['col1'])
%timeit -n 100 df.col1.str.len().max()
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
%timeit -n 100 df.col1.map(len).max()
%timeit -n 100 [df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
171 µs ± 5.92 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
126 µs ± 4.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
124 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
143 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但数据框架越大，速度就越快:

x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 1000)
df = pd.DataFrame(x, columns=['col1'])
%timeit -n 100 df.col1.str.len().max()
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
%timeit -n 100 df.col1.map(len).max()
%timeit -n 100 [df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
1.08 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.2 ms ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
878 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
705 µs ± 3.33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)




x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10000)
df = pd.DataFrame(x, columns=['col1'])
%timeit -n 100 df.col1.str.len().max()
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
%timeit -n 100 df.col1.map(len).max()
%timeit -n 100 [df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
8.87 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.88 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.81 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

因为我要在数据框架上测试不同的方法，所以必须首先转换 dtype (df [ x ] . astype (‘ string’))

如果它已经是一系列的 dtype 字符串，它会快10% :

%timeit -n 100 [df[x].array.astype('U').dtype.itemsize // 4 for x in df.columns]
5.26 ms ± 95.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这个更快:

%timeit -n 100 [df[x].astype('string').array.astype('S').dtype.itemsize for x in df.columns]
3.89 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit -n 100 [df[x].array.astype('S').dtype.itemsize for x in df.columns]
3.26 ms ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但是你可能会得到编码错误:

UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 15: ordinal not in range(128)

如果您让 NumPy 决定使用什么数据类型，那么您将自动知道最大的条目:

df.col1.astype('string').array.astype('S')
Out[173]:
array([b'ab', b'ab', b'ab', ..., b'efghik', b'efghik', b'efghik'],
dtype='|S6')

小开

您可以使用这种方法找到最长的字符串本身(而不仅仅是索引) :

import pandas as pd


df = pd.DataFrame(['a', 'aaa', 'aaaaa'], columns=['A'])


# 1. Get index of longest string in column
idx = df.A.str.len().idxmax()
# Index: 2


# 2. Get longest string using df['A'][idx]
print('Longest string in column:', df['A'][idx])
# Longest string in column: aaaaa

资料来源: https://blog.finxter.com/python-find-longest-string-in-a-dataframe-column/