熊猫多索引-如何选择第二级时使用列?

我有一个索引数据框:

index = pd.MultiIndex.from_product([['stock1','stock2'...],['price','volume'...]])

这是一个有用的结构,能够做 df['stock1'],但我如何选择所有的价格数据?我看不懂那些文件。

我尝试了以下方法,但没有成功:

如果不管出于什么原因,这种索引样式通常被认为是一个糟糕的主意,那么什么才是更好的选择呢?我是否应该在时间序列上而不是在列级别上为股票建立一个多索引索引?

非常感谢

EDIT-我对列使用多重索引,而不是索引(措辞占了上风)。文档中的示例侧重于多级索引,而不是列结构。

92893 次浏览

df.unstack() will "tear off" the last level of your MultiIndex and make your DataFrame a lot more conventional, with one column per type of data. For example:

index = pd.MultiIndex.from_product([['stock1','stock2','stock3'],['price','volume']])
df = pd.DataFrame([1,2,3,4,5,6], index)
print(df.unstack())

Gives you:

           0
price volume
stock1     1      2
stock2     3      4
stock3     5      6

I also noticed you missed this option:

df.loc[:,"price"]

As far as a best practice for your time data, keep it in a column corresponding to rows, preferably as a datetime object in Python (pandas has built in feature support for it). You can use the mask syntax to only get times relevant to your interest.

That is how you access a single column of your data frame. However for multiple columns we can pass a list, or a colon to get all:

df.loc[:,["price","volume"]]
#or
df.loc[:,:]

A useful way to query (and quickly) is to use masks to specify which rows/columns meet what condition you want:

Mask=df.loc[:,"price"]>50.0
df.loc[Mask, "stock"] #should return the stock prices greater than 50bucks.

Hope this helps, and as always feel free to follow up on this answer if I completely misunderstood your question, I'd love to help further.

Using @JohnZwinck's data sample:

In [132]: df
Out[132]:
0
stock1 price   1
volume  2
stock2 price   3
volume  4
stock3 price   5
volume  6

Option 1:

In [133]: df.loc[(slice(None), slice('price')), :]
Out[133]:
0
stock1 price  1
stock2 price  3
stock3 price  5

Option 2:

In [134]: df.loc[pd.IndexSlice[:, 'price'], :]
Out[134]:
0
stock1 price  1
stock2 price  3
stock3 price  5

UPDATE:

But what if for the 2nd Index, I want to select everything but price and there are multiple values so that enumeration is not an option. Is there something like slice(~'price')

first let's name the index levels:

df = df.rename_axis(["lvl0", "lvl1"])

now we can use the df.query() method:

In [18]: df.query("lvl1 != 'price'")
Out[18]:
0
lvl0   lvl1
stock1 volume  2
stock2 volume  4
stock3 volume  6

Also using John's data sample:

Using xs() is another way to slice a MultiIndex:

df
0
stock1 price   1
volume  2
stock2 price   3
volume  4
stock3 price   5
volume  6


df.xs('price', level=1, drop_level=False)
0
stock1 price  1
stock2 price  3
stock3 price  5

Alternatively if you have a MultiIndex in place of columns:

df
stock1        stock2        stock3
price volume  price volume  price volume
0      1      2      3      4      5      6


df.xs('price', axis=1, level=1, drop_level=False)
stock1 stock2 stock3
price  price  price
0      1      3      5

I have found the most intuitive solution for accessing a second-level column in a DataFrame with MultiIndex columns is using .loc together with slice().

In case of your DataFrame with

df
stock1        stock2        stock3
price volume  price volume  price volume
0      1      2      3      4      5      6
1      2      3      4      5      6      7

using df.loc[:, (slice(None), "price")]

would deliver all columns with the sub-column of "price"

  stock1  stock2  stock3
price   price   price
0      1       3       5
1      2       4       6

Within df.loc[:, (slice(None), "price")] the first argument of loc : delivers the result for all rows, the second argument (slice(None), "price") is a tuple responsible for selecting all first level columns (slice(None)) and all second-level columns with the name of "price".

While @MaxU's is the better answer, I want to point out here that we can also separately reset_index any part of a MultiIndex, e.g., suppose:

df = pd.DataFrame({
'price':range(6),
'stocks': [f'stock{i//2}' for i in range(6)],
'attr':['price','volume']*3
}).set_index(['stocks','attr'])

leading to df:

               price
stocks attr
stock0 price       0
volume      1
stock1 price       2
volume      3
stock2 price       4
volume      5

Then e.g.:

df_rst = df.reset_index('attr')
df_rst[df_rst['attr']=='price']

will lead to: enter image description here