在 PythonPandas 中,如何在 DataFrame 中在两个值之间选择行?

我正在尝试修改 DataFrame df,使其仅包含 closing_price列中值在99和101之间的行,并尝试用下面的代码实现这一点。

然而,我得到了错误

ValueError: Series 的真值不明确。请使用 a.void、 a.bool ()、 a.item ()、 a.any ()或 a.all ()

我想知道有没有不使用循环的方法。

df = df[(99 <= df['closing_price'] <= 101)]
266653 次浏览

您应该使用 ()对布尔向量进行分组以消除歧义。

df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]

还有一种更好的替代方法——使用 Query ()方法:

In [58]: df = pd.DataFrame({'closing_price': np.random.randint(95, 105, 10)})


In [59]: df
Out[59]:
closing_price
0            104
1             99
2             98
3             95
4            103
5            101
6            101
7             99
8             95
9             96


In [60]: df.query('99 <= closing_price <= 101')
Out[60]:
closing_price
1             99
5            101
6            101
7             99

更新: 回复评论:

我喜欢这里的语法,但在尝试与 表达式; < code > df.query (’(mean + 2 * sd) < = close _ price < = (mean + 2 * sd)’)

In [161]: qry = "(closing_price.mean() - 2*closing_price.std())" +\
...:       " <= closing_price <= " + \
...:       "(closing_price.mean() + 2*closing_price.std())"
...:


In [162]: df.query(qry)
Out[162]:
closing_price
0             97
1            101
2             97
3             95
4            100
5             99
6            100
7            101
8             99
9             95

还要考虑 系列之间:

df = df[df['closing_price'].between(99, 101)]
newdf = df.query('closing_price.mean() <= closing_price <= closing_price.std()')

或者

mean = closing_price.mean()
std = closing_price.std()


newdf = df.query('@mean <= closing_price <= @std')

而不是这样

df = df[(99 <= df['closing_price'] <= 101)]

你应该用这个

df = df[(df['closing_price']>=99 ) & (df['closing_price']<=101)]

我们必须使用 NumPy 的按位逻辑运算符 | ,& ,~ ,^ 来组合查询。 此外,括号对于操作符优先级也很重要。

更多信息,您可以访问该链接 Rel = “ nofollow noReferrer”> 比较、蒙版和布尔逻辑

如果您正在处理多个值和多个输入,您也可以像这样设置一个 application 函数。在这种情况下,过滤一个数据框架的 GPS 位置落在一定范围内。

def filter_values(lat,lon):
if abs(lat - 33.77) < .01 and abs(lon - -118.16) < .01:
return True
elif abs(lat - 37.79) < .01 and abs(lon - -122.39) < .01:
return True
else:
return False




df = df[df.apply(lambda x: filter_values(x['lat'],x['lon']),axis=1)]

如果一个人必须调用 pd.Series.between(l,r) 一次又一次(对于不同的界限 lr) ,大量的工作是不必要的重复。在这种情况下,对帧/序列排序一次,然后使用 pd.Series.searchsorted()是有益的。我测量了高达25倍的加速,见下文。

def between_indices(x, lower, upper, inclusive=True):
"""
Returns smallest and largest index i for which holds
lower <= x[i] <= upper, under the assumption that x is sorted.
"""
i = x.searchsorted(lower, side="left" if inclusive else "right")
j = x.searchsorted(upper, side="right" if inclusive else "left")
return i, j


# Sort x once before repeated calls of between()
x = x.sort_values().reset_index(drop=True)
# x = x.sort_values(ignore_index=True) # for pandas>=1.0
ret1 = between_indices(x, lower=0.1, upper=0.9)
ret2 = between_indices(x, lower=0.2, upper=0.8)
ret3 = ...

基准

对于不同的参数 lowerupper,测量 pd.Series.between()的重复评价(n_reps=100)以及基于 pd.Series.searchsorted()的方法。在我的 MacBook Pro 2015上,使用 Python v3.8.0和 Pandas v1.0.3,下面的代码会导致以下的输出结果

# pd.Series.searchsorted()
# 5.87 ms ± 321 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pd.Series.between(lower, upper)
# 155 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Logical expressions: (x>=lower) & (x<=upper)
# 153 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import numpy as np
import pandas as pd


def between_indices(x, lower, upper, inclusive=True):
# Assumption: x is sorted.
i = x.searchsorted(lower, side="left" if inclusive else "right")
j = x.searchsorted(upper, side="right" if inclusive else "left")
return i, j


def between_fast(x, lower, upper, inclusive=True):
"""
Equivalent to pd.Series.between() under the assumption that x is sorted.
"""
i, j = between_indices(x, lower, upper, inclusive)
if True:
return x.iloc[i:j]
else:
# Mask creation is slow.
mask = np.zeros_like(x, dtype=bool)
mask[i:j] = True
mask = pd.Series(mask, index=x.index)
return x[mask]


def between(x, lower, upper, inclusive=True):
mask = x.between(lower, upper, inclusive=inclusive)
return x[mask]


def between_expr(x, lower, upper, inclusive=True):
if inclusive:
mask = (x>=lower) & (x<=upper)
else:
mask = (x>lower) & (x<upper)
return x[mask]


def benchmark(func, x, lowers, uppers):
for l,u in zip(lowers, uppers):
func(x,lower=l,upper=u)


n_samples = 1000
n_reps = 100
x = pd.Series(np.random.randn(n_samples))
# Sort the Series.
# For pandas>=1.0:
# x = x.sort_values(ignore_index=True)
x = x.sort_values().reset_index(drop=True)


# Assert equivalence of different methods.
assert(between_fast(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_expr(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_fast(x, 0, 1, False).equals(between(x, 0, 1, False)))
assert(between_expr(x, 0, 1, False).equals(between(x, 0, 1, False)))


# Benchmark repeated evaluations of between().
uppers = np.linspace(0, 3, n_reps)
lowers = -uppers
%timeit benchmark(between_fast, x, lowers, uppers)
%timeit benchmark(between, x, lowers, uppers)
%timeit benchmark(between_expr, x, lowers, uppers)