根据字符串长度筛选字符串数据

小开

如果行中有数字，那么它们将转换为浮点数。

在从 cvs 导入之后，将所有行转换为字符串。为了更好的性能，将 lambdas 分割成多个线程。

小开

最佳答案

import pandas as pd


df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)

应用于 filex.csv:

A,B
123,abc
1234,abcd
1234567890,abcdefghij

上面的代码打印

            A           B
2  1234567890  abcdefghij

小开

根据给定的其他列的条件及其值过滤掉行的一种更加 Python 化的方法:

假设 df为:

data = {
"names": ["Alice", "Zac", "Anna", "O"],
"cars": ["Civic", "BMW", "Mitsubishi", "Benz"],
"age": ["1", "4", "2", "0"],
}


df=pd.DataFrame(data)
df:
age        cars  names
0   1       Civic  Alice
1   4         BMW    Zac
2   2  Mitsubishi   Anna
3   0        Benz      O

然后:

df[
df["names"].apply(lambda x: len(x) > 1)
& df["cars"].apply(lambda x: "i" in x)
& df["age"].apply(lambda x: int(x) < 2)
]

我们将有:

  age   cars  names
0   1  Civic  Alice

In the conditions above we are looking first at the length of strings, then we check whether a letter "i" exists in the strings or not, finally, we check for the value of integers in the first column.

小开

you can use df.apply(len) . it will give you the result

小开

我个人认为这是最简单的方法:

df['column_name'] = df[df['column_name'].str.len()!=10]

小开

从列 A 和 B 中过滤除10以外的值，这里我将 lambda 表达式传递给 map ()函数。函数始终应用于系列对象。

 df = df[df['A'].map(lambda x: len(str(x)) == 10)]
df = df[df['B'].map(lambda x: len(str(x)) == 10)]

小开

您可以使用 applicymap 一次过滤所有需要的列，然后使用 。所有()方法只过滤两列都为 True 的行。

#The *mask* variable is a dataframe of booleans, giving you True or False for the selected condition
mask = df[['A','B']].applymap(lambda x: len(str(x)) == 10)


#Here you can just use the mask to filter your rows, using the method *.all()* to filter only rows that are all True, but you could also use the *.any()* method for other needs
df = df[mask.all(axis=1)]

小开

你亦可使用 query:

df.query('A.str.len() == 10 & B.str.len() == 10')

小开

对于这样的字符串运算，使用内置方法(不使用 lambda)的香草 Python 要比 apply()或 str.len()快得多。

通过将 len映射到列表内涵内的每个字符串来构建一个布尔掩码，其速度分别比 apply()和 str.len()快约40-70% 。

对于多个列，zip()允许并发地计算来自不同列的值。

col_A_len = map(len, df['A'].astype(str))
col_B_len = map(len, df['B'].astype(str))
m = [a==3 and b==3 for a,b in zip(col_A_len, col_B_len)]
df1 = df[m]

对于单列，放置 zip()并在列上循环，检查长度是否等于3:

df2 = df[[a==3 for a in map(len, df['A'].astype(str))]]

这段代码可以使用 Series.map()方法写得简洁一些(但由于大熊猫的开销，比列表内涵要慢一些) :

df2 = df[df['A'].astype(str).map(len)==3]