Groupby 和 pivot_table 之间的区别

我刚开始学习熊猫,想知道是否有任何区别之间的 groupbypivot_table的功能。谁能告诉我它们之间的区别?

64268 次浏览

Both pivot_table and groupby are used to aggregate your dataframe. The difference is only with regard to the shape of the result.

Using pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum) a table is created where a is on the row axis, b is on the column axis, and the values are the sum of c.

Example:

df = pd.DataFrame({"a": [1,2,3,1,2,3], "b":[1,1,1,2,2,2], "c":np.random.rand(6)})
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)


b         1         2
a
1  0.528470  0.484766
2  0.187277  0.144326
3  0.866832  0.650100

Using groupby, the dimensions given are placed into columns, and rows are created for each combination of those dimensions.

In this example, we create a series of the sum of values c, grouped by all unique combinations of a and b.

df.groupby(['a','b'])['c'].sum()


a  b
1  1    0.528470
2    0.484766
2  1    0.187277
2    0.144326
3  1    0.866832
2    0.650100
Name: c, dtype: float64

A similar usage of groupby is if we omit the ['c']. In this case, it creates a dataframe (not a series) of the sums of all remaining columns grouped by unique values of a and b.

print df.groupby(["a","b"]).sum()
c
a b
1 1  0.528470
2  0.484766
2 1  0.187277
2  0.144326
3 1  0.866832
2  0.650100

It's more appropriate to use .pivot_table() instead of .groupby() when you need to show aggregates with both rows and column labels.

.pivot_table() makes it easy to create row and column labels at the same time and is preferable, even though you can get similar results using .groupby() with few extra steps.

pivot_table = groupby + unstack and groupby = pivot_table + stack hold True.

In particular, if columns parameter of pivot_table() is not used, then groupby() and pivot_table() both produce the same result (if the same aggregator function is used).

# sample
df = pd.DataFrame({"a": [1,1,1,2,2,2], "b": [1,1,2,2,3,3], "c": [0,0.5,1,1,2,2]})


# example
gb = df.groupby(['a','b'])[['c']].sum()
pt = df.pivot_table(index=['a','b'], values=['c'], aggfunc='sum')


# equality test
gb.equals(pt) #True

In general, if we check the source code, pivot_table() internally calls __internal_pivot_table(). This function creates a single flat list out of index and columns and calls groupby() with this list as the grouper. Then after aggregation, calls unstack() on the list of columns.

If columns are never passed, there is nothing to unstack on, so groupby and pivot_table trivially produce the same output.

A demonstration of this function is:

gb = (
df
.groupby(['a','b'])[['c']].sum()
.unstack(['b'])
)
pt = df.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')


gb.equals(pt) # True

As stack() is the inverse operation of unstack(), the following holds True as well:

(
df
.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')
.stack(['b'])
.equals(
df.groupby(['a','b'])[['c']].sum()
)
) # True

In conclusion, depending on the use case, one is more convenient than the other but they can both be used instead of the other and after correctly applying stack()/unstack(), both will result in the same output.

However, there's a performance difference between the two methods. In short, pivot_table() is slower than groupby().agg().unstack(). You can read more about it from this answer.

Difference between pivot_table and groupby

pivot_table

pivot_table

groupby

groupby