在多索引 Python Panda 数据框架中过滤多个条目

小开

最佳答案

You can get_level_values in conjunction with Boolean slicing.

In [50]:


print df[np.in1d(df.index.get_level_values(1), ['Lake', 'River', 'Upland'])]
Area
NSRCODE PBL_AWI
CM      Lake      57124.819333
River      1603.906642
LBH     Lake     258046.508310
River     44262.807900

The same idea can be expressed in many different ways, such as df[df.index.get_level_values('PBL_AWI').isin(['Lake', 'River', 'Upland'])]

Note that you have 'upland' in your data instead of 'Upland'

小开

Also (from here):

def filter_by(df, constraints):
"""Filter MultiIndex by sublevels."""
indexer = [constraints[name] if name in constraints else slice(None)
for name in df.index.names]
return df.loc[tuple(indexer)] if len(df.shape) == 1 else df.loc[tuple(indexer),]


pd.Series.filter_by = filter_by
pd.DataFrame.filter_by = filter_by

... to be used as

df.filter_by({'PBL_AWI' : ['Lake', 'River', 'Upland']})

(untested with Panels and higher dimension elements, but I do expect it to work)

小开

Another (maybe cleaner) way might be this one:

print(df[df.index.isin(['Lake', 'River', 'Upland'], level=1)])

The parameter level specifies the index number (starting with 0) or index name (here: level='PBL_AWI')

小开

This is an answer to a slight variant of the question asked that might save someone else a little time. If you are looking for a wildcard type match to a label whose exact value you don't know, you can use something like this:

q_labels = [ label for label in df.index.levels[1] if label.startswith('Q') ]
new_df = df[ df.index.isin(q_labels, level=1) ]

小开

df.filter(regex=...,axis=...) is even more succinct, because it works on both index=0 and column=1 axis. You do not need to worry about levels, and you can be lazy with regex. Complete example for filter on index:

df.filter(regex='Lake|River|Upland',axis=0)

if you transpose it, and try to filter on columns (axis=1 by default), it works as well:

df.T.filter(regex='Lake|River|Upland')

Now, with regex you can also easily fix upper lower case issue with Upland:

upland = re.compile('Upland', re.IGNORECASE)
df.filter(regex=upland ,axis=0)

This is the command to read above input table:

df = pd.read_csv(io.StringIO(inpute_table), sep="\s{2,}").set_index(['NSRCODE', 'PBL_AWI'])

小开

A simpler approach using .loc would be

df.loc[(slice(None),['Lake', 'River', 'Upland']),:]

or for Series

df.loc[(slice(None),['Lake', 'River', 'Upland'])]

slice(None) means no filtering on the first level index. We can filter the second level index using a list of values ['Lake', 'River', 'Upland']

小开

You can also use query:

In [9]: df.query("PBL_AWI == ['Lake', 'River', 'Upland']")
Out[9]:
Area
NSRCODE PBL_AWI
CM      Lake     57124.82
River     1603.91
LBH     Lake    258046.51
River    44262.81

However, due to case sensitivity, 'upland' (lower case) won't be found. Therefore I recommend using fullmatch and set case=False:

In [10]: df.query("PBL_AWI.str.fullmatch('Lake|River|Upland', case=False).values")
Out[10]:
Area
NSRCODE PBL_AWI
CM      Lake       57124.82
River       1603.91
upland   2701289.27
LBH     Lake      258046.51
River      44262.81
upland  23843453.64