检查 Panda DataFrame 列中的字符串是否位于字符串列表中

如果我有一个这样的框架

frame = pd.DataFrame({
"a": ["the cat is blue", "the sky is green", "the dog is black"]
})

我想检查这些行中是否包含某个单词,我只需要这样做。

frame["b"] = (
frame.a.str.contains("dog") |
frame.a.str.contains("cat") |
frame.a.str.contains("fish")
)

frame["b"]产出:

0     True
1    False
2     True
Name: b, dtype: bool

如果我决定列一个清单:

mylist = ["dog", "cat", "fish"]

如何检查行是否包含列表中的某个单词?

132665 次浏览
frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']})


frame
a
0   the cat is blue
1  the sky is green
2  the dog is black

The str.contains method accepts a regular expression pattern:

mylist = ['dog', 'cat', 'fish']
pattern = '|'.join(mylist)


pattern
'dog|cat|fish'


frame.a.str.contains(pattern)
0     True
1    False
2     True
Name: a, dtype: bool

Because regex patterns are supported, you can also embed flags:

frame = pd.DataFrame({'a' : ['Cat Mr. Nibbles is blue', 'the sky is green', 'the dog is black']})


frame
a
0  Cat Mr. Nibbles is blue
1         the sky is green
2         the dog is black


pattern = '|'.join([f'(?i){animal}' for animal in mylist])  # python 3.6+


pattern
'(?i)dog|(?i)cat|(?i)fish'
 

frame.a.str.contains(pattern)
0     True  # Because of the (?i) flag, 'Cat' is also matched to 'cat'
1    False
2     True

For list should work

print(frame[frame["a"].isin(mylist)])

See pandas.DataFrame.isin().

After going through the comments of the accepted answer of extracting the string, this approach can also be tried.

frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']})


frame
a
0   the cat is blue
1  the sky is green
2  the dog is black

Let us create our list which will have strings that needs to be matched and extracted.

mylist = ['dog', 'cat', 'fish']
pattern = '|'.join(mylist)

Now let create a function which will be responsible to find and extract the substring.

import re
def pattern_searcher(search_str:str, search_list:str):


search_obj = re.search(search_list, search_str)
if search_obj :
return_str = search_str[search_obj.start(): search_obj.end()]
else:
return_str = 'NA'
return return_str

We will use this function with pandas.DataFrame.apply

frame['matched_str'] = frame['a'].apply(lambda x: pattern_searcher(search_str=x, search_list=pattern))

Result :

              a             matched_str
0   the cat is blue         cat
1  the sky is green         NA
2  the dog is black         dog

We can check for three patterns simultaneously using pipe, for example

for i in range(len(df)):
if re.findall(r'car|oxide|gen', df.iat[i,1]):
df.iat[i,2]='Yes'
else:
df.iat[i,2]='No'