根据列表中(不)的索引值切片熊猫数据框

我有一个 pandas数据帧,df

我想选择 df中的所有索引,它们是列表中的 没有,即 blacklist.

现在,我使用列表内涵来创建所需的标签来切片。

ix=[i for i in df.index if i not in blacklist]
df_select=df.loc[ix]

效果很好,但是如果我经常需要这样做,可能会很笨拙。

还有更好的办法吗?

112295 次浏览
import pandas as pd
df = pd.DataFrame(data=[5,6,7,8], index=[1,2,3,4], columns=['D',])
blacklist = [2,3]
#your current way ...
ix=[i for i in df.index if i not in blacklist]
df_select=df.loc[ix]


# use a mask
mask = [True if x else False for x in df.index if x not in blacklist]
df.loc[mask]

http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-label actually, loc and iloc both take a boolean array, in this case the mask. from now on you can reuse this mask and should be more efficient.

You could use set() to create the difference between your original indices and those that you want to remove:

df.loc[set(df.index) - set(blacklist)]

It has the advantage of being parsimonious, as well as being easier to read than a list comprehension.

Use isin on the index and invert the boolean index to perform label selection:

In [239]:


df = pd.DataFrame({'a':np.random.randn(5)})
df
Out[239]:
a
0 -0.548275
1 -0.411741
2 -1.187369
3  1.028967
4 -2.755030
In [240]:


t = [2,4]
df.loc[~df.index.isin(t)]
Out[240]:
a
0 -0.548275
1 -0.411741
3  1.028967

Thanks to ASGM; I found that I needed to turn the set into a list to make it work with a MultiIndex:

mi1 = pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1), ("b", 2)])
df1 = pd.DataFrame(data={"aaa":[1,2,3,4]}, index=mi1)
setValid = set(df1.index) - set([("a", 2)])
df1.loc[list(setValid)] # works
df1.loc[setValid] # fails

(sorry can't comment, insufficient rep)

If you are looking for a way to select all rows that are outside a condition you can use np.invert() given that the condition returns an array of booleans.

df.loc[np.invert(({condition 1}) & (condition 2))]
df = pd.DataFrame(data=[5,6,7,8], index=[1,2,3,4], columns=['D',])
blacklist = [2,3]


df.drop(blacklist,0)

You can use the np.setdiff1d function which finds the set difference of two arrays.

index = np.array(blacklist)
not_index = np.setdiff1d(df.index.to_numpy(), index)
df.iloc[not_index]

You could use difference() to obtain the difference between your original indices and those that you want to exclude:

df.loc[df.index.difference(blacklist), :]

It has the advantage of being easier to read.

Inside query you can access your variable blacklist using @:

df.query('index != @blacklist')


# Or alternatively:
df.query('index not in @blacklist')