熊猫: 通过标签获得唯一的 MultiIndex 级别值

假设您有这个 MultiIndex-ed DataFrame:

df = pd.DataFrame({'country':['DE','DE','FR','FR'],
'biome':['Lake','Forest','Lake','Forest'],
'area':[10,20,30,40],
'count':[7,5,2,3]})
df = df.set_index(['country','biome'])

看起来像这样:

                area  count
country biome
DE      Lake      10      7
Forest    20      5
FR      Lake      30      2
Forest    40      3

我想 检索每个索引级别的唯一值。这可以完成使用

>>> df.index.levels[0]
['DE', 'FR']


>>> df.index.levels[1]
['Lake', 'Forest']

我想 真的做的是通过 按照名称对各个级别进行定位检索这些列表,即 'country''biome'。我能找到的最短的两条路是这样的:

>>> list(set(df.index.get_level_values('country')))
['DE', 'FR']


>>> df.index.levels[df.index.names.index('country')]
['DE', 'FR']

但是没有一种是非常优雅的,有没有一种更短或者更有表现力的方式呢?

69063 次浏览

I guess u want unique values in a certain level (and by level names) of a multiindex. I usually do the following, which is a bit long.

In [11]: df.index.get_level_values('country').unique()
Out[11]: array(['DE', 'FR'], dtype=object)

If you're going to do the level lookup repeatedly, you could create a map of your index level names to level unique values with:

df_level_value_map = {
name: level
for name, level in zip(df.index.names, df.index.levels)
}
df_level_value_map['']

But this is not in any way more efficient (or shorter) than your original attempts if you're only going to do this lookup once.

I really wish there was a method on indexes that returned such a dictionary (or series?) with a name like:

df.index.get_level_map(levels={...})

Where the levels parameter can limit the map to a subset of the existing levels. I could do without the parameter if it could be a property like:

df.index.level_map

An alternative approach is to find the number of levels by calling df.index.levels[level_index] where level_index can be inferred from df.index.names.index(level_name). In the above example level_name = 'co'.

The proposed answer by @Happy001 computes the unique which may be computationally intensive.

Pandas 0.23.0 finally introduced a much cleaner solution to this problem: the level argument to Index.unique():

In [3]: df.index.unique(level='country')
Out[3]: Index(['DE', 'FR'], dtype='object', name='country')

This is now the recommended solution. It is far more efficient because it avoids creating a complete representation of the level values in memory, and re-scanning it.

If you already know the index names, is it not straightforward to simply do: df['co'].unique() ?