大熊猫群体而不将列分组转变为索引

熊猫 groupby 的默认行为是将按列分组转换为索引,并将其从数据框的列列表中删除。例如,假设我有一个包含这些列的 dataFrame

col1|col2|col3|col4

如果我以这种方式应用一个带有 col2col3列的 groupby

df.groupby(['col2','col3']).sum()

数据帧 df在列列表中不再有 ['col2','col3']。它们被自动转换为结果数据框架的索引。

我的问题是,如何在一个列上执行 groupby 并将该列保留在数据框中?

140288 次浏览
df.groupby(['col2','col3'], as_index=False).sum()

Another way to do this would be:

df.groupby(['col2', 'col3']).sum().reset_index()

The following, somewhat detailed answer, is added to help those who are still confused on which variant of the answers to use.

First, the suggested two solutions to this problem are:

  • Solution 1: df.groupby(['A', 'B'], as_index=False).sum()
  • Solution 2: df.groupby(['A', 'B']).sum().reset_index()

Both give the expected result.


Solution 1:

As explained in the documentation, as_index will ask for SQL style grouped output, which will effectively ask pandas to preserve these grouped by columns in the output as it is prepared.

as_index: bool, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

Example:

Given the following Dataframe:

     A     B      C      D
0    A     1  0.502130  0.959404
1    A     3  0.335416  0.087215
2    B     2  0.067308  0.084595
3    B     4  0.454158  0.723124
4    B     4  0.323326  0.895858
5    C     2  0.672375  0.356736
6    C     5  0.929655  0.371913
7    D     5  0.212634  0.540736
8    D     5  0.471418  0.268270
9    E     1  0.061270  0.739610

Applying the first solution gives:

>>> df.groupby(["A", "B"], as_index=False).sum()


A     B      C        D
0    A     1  0.502130  0.959404
1    A     3  0.335416  0.087215
2    B     2  0.067308  0.084595
3    B     4  0.777483  1.618982
4    C     2  0.672375  0.356736
5    C     5  0.929655  0.371913
6    D     5  0.684052  0.809006
7    E     1  0.061270  0.739610

Where the groupby columns are preserved correctly.


Solution 2:

To understand the second solution, let's look at the output of the previous command with as_index = True which is the default behavior of pandas.DataFrame.groupby (check documentation):

>>> df.groupby(["A", "B"], as_index=True).sum()
C       D
A    B
A    1     0.502130  0.959404
3     0.335416  0.087215
B    2     0.067308  0.084595
4     0.777483  1.618982
C    2     0.672375  0.356736
5     0.929655  0.371913
D    5     0.684052  0.809006
E    1     0.061270  0.739610

As you can see, the groupby keys become the index of the dataframe. Using, pandas.DataFrame.reset_index (check documentation) we can put back the indices of the dataframe as columns and use a default index. Which also leads us to the same results as in the previous step:

>>> df.groupby(['A', 'B']).sum().reset_index()
A     B      C        D
0    A     1  0.502130  0.959404
1    A     3  0.335416  0.087215
2    B     2  0.067308  0.084595
3    B     4  0.777483  1.618982
4    C     2  0.672375  0.356736
5    C     5  0.929655  0.371913
6    D     5  0.684052  0.809006
7    E     1  0.061270  0.739610

Benchmark

Notice that since the first solution achieves the requirement in 1 step versus 2 steps in the second solution, the former is slightly faster:

%timeit df.groupby(["A", "B"], as_index=False).sum()
3.38 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit df.groupby(["A", "B"]).sum().reset_index()
3.9 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)