大熊猫长到宽的形状,由两个变量决定

我有长格式的数据,并试图重塑宽,但似乎没有一个直接的方法来做到这一点,使用融化/堆栈/解除堆栈:

Salesman  Height   product      price
Knut      6        bat          5
Knut      6        ball         1
Knut      6        wand         3
Steve     5        pen          2

成为:

Salesman  Height    product_1  price_1  product_2 price_2 product_3 price_3
Knut      6        bat          5       ball      1        wand      3
Steve     5        pen          2        NA       NA        NA       NA

我认为 Stata 可以通过重塑指令做到这一点。

159909 次浏览
pivoted = df.pivot('salesman', 'product', 'price')

pg. 192 Python for Data Analysis

A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:

df['idx'] = df.groupby('Salesman').cumcount()

Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:

print df.pivot(index='Salesman',columns='idx')[['product','price']]


product              price
idx            0     1     2      0   1   2
Salesman
Knut         bat  ball  wand      5   1   3
Steve        pen   NaN   NaN      2 NaN NaN

To get closer to your desired output I added the following:

df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)


product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')


reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape


product_0 product_1 product_2  price_0  price_1  price_2  Height
Salesman
Knut           bat      ball      wand        5        1        3       6
Steve          pen       NaN       NaN        2      NaN      NaN       5

Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):

df['idx'] = df.groupby('Salesman').cumcount()


tmp = []
for var in ['product','price']:
df['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))


reshape = pd.concat(tmp,axis=1)

@Luke said:

I think Stata can do something like this with the reshape command.

You can but I think you also need a within group counter to get the reshape in stata to get your desired output:

     +-------------------------------------------+
| salesman   idx   height   product   price |
|-------------------------------------------|
1. |     Knut     0        6       bat       5 |
2. |     Knut     1        6      ball       1 |
3. |     Knut     2        6      wand       3 |
4. |    Steve     0        5       pen       2 |
+-------------------------------------------+

If you add idx then you could do reshape in stata:

reshape wide product price, i(salesman) j(idx)

A bit old but I will post this for other people.

What you want can be achieved, but you probably shouldn't want it ;) Pandas supports hierarchical indexes for both rows and columns. In Python 2.7.x ...

from StringIO import StringIO


raw = '''Salesman  Height   product      price
Knut      6        bat          5
Knut      6        ball         1
Knut      6        wand         3
Steve     5        pen          2'''
dff = pd.read_csv(StringIO(raw), sep='\s+')


print dff.set_index(['Salesman', 'Height', 'product']).unstack('product')

Produces a probably more convenient representation than what you were looking for

                price
product          ball bat pen wand
Salesman Height
Knut     6          1   5 NaN    3
Steve    5        NaN NaN   2  NaN

The advantage of using set_index and unstacking vs a single function as pivot is that you can break the operations down into clear small steps, which simplifies debugging.

Here's another solution more fleshed out, taken from Chris Albon's site.

Create "long" dataframe

raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': [6252, 24243, 2345, 2342, 23525]}


df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])

Make a "wide" data

df.pivot(index='patient', columns='obs', values='score')

Karl D's solution gets at the heart of the problem. But I find it's far easier to pivot everything (with .pivot_table because of the two index columns) and then sort and assign the columns to collapse the MultiIndex:

df['idx'] = df.groupby('Salesman').cumcount()+1
df = df.pivot_table(index=['Salesman', 'Height'], columns='idx',
values=['product', 'price'], aggfunc='first')


df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x,y in df.columns]
df = df.reset_index()

Output:

  Salesman  Height  price_1 product_1  price_2 product_2  price_3 product_3
0     Knut       6      5.0       bat      1.0      ball      3.0      wand
1    Steve       5      2.0       pen      NaN       NaN      NaN       NaN

An old question; this is an addition to the already excellent answers. pivot_wider from pyjanitor may be helpful as an abstraction for reshaping from long to wide (it is a wrapper around pd.pivot):

# pip install pyjanitor
import pandas as pd
import janitor


idx = df.groupby(['Salesman', 'Height']).cumcount().add(1)


(df.assign(idx = idx)
.pivot_wider(index = ['Salesman', 'Height'], names_from = 'idx')
)
 

Salesman  Height product_1 product_2 product_3  price_1  price_2  price_3
0     Knut       6       bat      ball      wand      5.0      1.0      3.0
1    Steve       5       pen       NaN       NaN      2.0      NaN      NaN