pandas equivalent of np.where

np.where has the semantics of a vectorized if/else (similar to Apache Spark's when/otherwise DataFrame method). I know that I can use np.where on pandas.Series, but pandas often defines its own API to use instead of raw numpy functions, which is usually more convenient with pd.Series/pd.DataFrame.

Sure enough, I found pandas.DataFrame.where. However, at first glance, it has completely different semantics. I could not find a way to rewrite the most basic example of np.where using pandas where:

# df is pd.DataFrame
# how to write this using df.where?
df['C'] = np.where((df['A']<0) | (df['B']>0), df['A']+df['B'], df['A']/df['B'])

Am I missing something obvious? Or is pandas' where intended for a completely different use case, despite same name as np.where?

129710 次浏览

试试:

(df['A'] + df['B']).where((df['A'] < 0) | (df['B'] > 0), df['A'] / df['B'])

numpy whereDataFrame where之间的区别在于,默认值是由 DataFrame提供的,where方法就是在这个 DataFrame上调用的(医生)。

也就是说。

np.where(m, A, B)

大致相当于

A.where(m, B)

如果你想要一个使用大熊猫的类似呼叫信号,你可以利用 Python 中方法调用的方式:

pd.DataFrame.where(cond=(df['A'] < 0) | (df['B'] > 0), self=df['A'] + df['B'], other=df['A'] / df['B'])

或没有 kwargs (注意: 参数的位置顺序不同于 numpy where 论点顺序) :

pd.DataFrame.where(df['A'] + df['B'], (df['A'] < 0) | (df['B'] > 0), df['A'] / df['B'])

我更喜欢使用 熊猫的 mask而不是 where,因为它不太违反直觉(至少对我来说)。

(df['A']/df['B']).mask(df['A']<0) | (df['B']>0), df['A']+df['B'])

在这里,列 AB被添加到条件所在的位置,否则它们的比值保持不变。