Boolean Series 键将被重新索引以匹配 DataFrame 索引

我是这样收到警告的:

df.loc[a_list][df.a_col.isnull()]

a_list的类型是 Int64Index,它包含一个行索引列表。所有这些行索引都属于 df

df.a_col.isnull()部分是我需要过滤的一个条件。

如果单独执行以下命令,则不会收到任何警告:

df.loc[a_list]
df[df.a_col.isnull()]

但是如果我把它们放在一起,我会得到一个警告信息(但是我可以看到结果) :

Boolean Series 键将被重新索引以匹配 DataFrame 索引

此警告消息的含义是什么? 它是否会影响返回的结果?

123091 次浏览

Your approach will work despite the warning, but it's best not to rely on implicit, unclear behavior.

Solution 1, make the selection of indices in a_list a boolean mask:

df[df.index.isin(a_list) & df.a_col.isnull()]

Solution 2, do it in two steps:

df2 = df.loc[a_list]
df2[df2.a_col.isnull()]

Solution 3, if you want a one-liner, use a trick found here:

df.loc[a_list].query('a_col != a_col')

The warning comes from the fact that the boolean vector df.a_col.isnull() is the length of df, while df.loc[a_list] is of the length of a_list, i.e. shorter. Therefore, some indices in df.a_col.isnull() are not in df.loc[a_list].

What pandas does is reindex the boolean series on the index of the calling dataframe. In effect, it gets from df.a_col.isnull() the values corresponding to the indices in a_list. This works, but the behavior is implicit, and could easily change in the future, so that's what the warning is about.

If you got this warning, using .loc[] instead of [] suppresses this warning.1

df.loc[boolean_mask]           # <--------- OK
df[boolean_mask]               # <--------- warning

For the particular case in the OP, you can chain .loc[] indexers:

df.loc[a_list].loc[df['a_col'].isna()]

or chain all conditions using and inside query():

# if a_list is a list of indices of df
df.query("index in @a_list and a_col != a_col")


# if a_list is a list of values in some other column such as b_col
df.query("b_col in @a_list and a_col != a_col")

or chain all conditions using & inside [] (as in @IanS's post).


This warning occurs if

  • the index of the boolean mask is not in the same order as the index of the dataframe it is filtering.

    df = pd.DataFrame({'a_col':[1, 2, np.nan]}, index=[0, 1, 2])
    m1 = pd.Series([True, False, True], index=[2, 1, 0])
    df.loc[m1]       # <--------- OK
    df[m1]           # <--------- warning
    
  • the index of a boolean mask is a super set of the index of the dataframe it is filtering. For example:

    m2 = pd.Series([True, False, True, True], np.r_[df.index, 10])
    df.loc[m2]       # <--------- OK
    df[m2]           # <--------- warning
    

1: If we look at the source codes of [] and loc[], literally the only difference when the index of the boolean mask is a (weak) super set of the index of the dataframe is that [] shows this warning (via _getitem_bool_array method) and loc[] does not.