# find rows in `df1` which contain "foo" followed by somethingdf1[df1['col'].str.contains(r'foo(?!$)')]
col1 foobar
有时不需要正则表达式搜索,因此指定regex=False以禁用它。
#select all rows containing "foo"df1[df1['col'].str.contains('foo', regex=False)]# same as df1[df1['col'].str.contains('foo')] but faster.
col0 foo1 foobar
性能方面,正则表达式搜索比子字符串搜索慢:
df2 = pd.concat([df1] * 1000, ignore_index=True)
%timeit df2[df2['col'].str.contains('foo')]%timeit df2[df2['col'].str.contains('foo', regex=False)]
6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
避免使用基于正则表达式的搜索,如果你不需要它。
地址#0 有时,对结果执行子字符串搜索和过滤会导致
ValueError: cannot index with vector containing NA / NaN values
def try_search(p, x):try:return bool(p.search(x))except TypeError:return False
p = re.compile(regex_pattern)df1[[try_search(p, x) for x in df1['col']]]
col1 foobar
df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)
col1 col2 TrueFalse0 cat andhat 1000.0 True1 hat 2000000.0 False2 the small dog 1000.0 True3 fog 330000.0 False4 pet 3 30000.0 False