小开

最佳答案

在自己回答这个问题的过程中，我学到了很多东西，我想把这些例子和一些解释放在一起。

关于 levels论点的具体答案将在最后给出。

`pandas.concat`: The Missing Manual

导入和定义对象

import pandas as pd


d1 = pd.DataFrame(dict(A=.1, B=.2, C=.3), index=[2, 3])
d2 = pd.DataFrame(dict(B=.4, C=.5, D=.6), index=[1, 2])
d3 = pd.DataFrame(dict(A=.7, B=.8, D=.9), index=[1, 3])


s1 = pd.Series([1, 2], index=[2, 3])
s2 = pd.Series([3, 4], index=[1, 2])
s3 = pd.Series([5, 6], index=[1, 3])

争论

`objs`

我们遇到的第一个论点是 objs:

Objecs : Series、 DataFrame 或 Panel 对象的序列或映射如果传递了 dict，那么排序后的键将被用作 key 参数，除非它被传递，在这种情况下，值将被选中(见下文)。任何 Nothing 对象都将被静默地删除，除非它们全部为 Nothing，在这种情况下将引发 ValueError

我们通常看到这与 Series或 DataFrame对象列表一起使用。
I'll show that dict can be very useful as well.
生成器也可以使用，并且在使用 map时可能非常有用，就像在 map(f, list_of_df)中一样

现在，我们将继续讨论上面定义的一些 DataFrame和 Series对象的列表。稍后我将展示如何利用字典来提供非常有用的 MultiIndex结果。

pd.concat([d1, d2])


A    B    C    D
2  0.1  0.2  0.3  NaN
3  0.1  0.2  0.3  NaN
1  NaN  0.4  0.5  0.6
2  NaN  0.4  0.5  0.6

`axis`

我们遇到的第二个参数是 axis，它的默认值是 0:

坐标轴 : {0/’index’，1/’column’} ，默认值为0 连接的轴。

Two `DataFrame`s with `axis=0` (stacked)

对于 0或 index的值，我们的意思是说: “沿着列对齐并添加到索引”。

如上所示，我们使用 axis=0，因为 0是默认值，我们看到 d2的索引扩展了 d1的索引，尽管有值 2的重叠:

pd.concat([d1, d2], axis=0)


A    B    C    D
2  0.1  0.2  0.3  NaN
3  0.1  0.2  0.3  NaN
1  NaN  0.4  0.5  0.6
2  NaN  0.4  0.5  0.6

Two `DataFrame`s with `axis=1` (side by side)

对于值 1或 columns，我们的意思是说: “沿着索引对齐并添加到列”,

pd.concat([d1, d2], axis=1)


A    B    C    B    C    D
1  NaN  NaN  NaN  0.4  0.5  0.6
2  0.1  0.2  0.3  0.4  0.5  0.6
3  0.1  0.2  0.3  NaN  NaN  NaN

我们可以看到，生成的索引是索引的联合，而生成的列是 d1的列与 d2的列的扩展。

两个(或三个) `Series`与 `axis=0`(堆叠)

当沿着 axis=0结合 pandas.Series时，我们得到一个 pandas.Series。所得到的 Series的名称将是 None，除非所有被组合的 Series都有相同的名称。当我们打印出结果 Series时，请注意 'Name: A'。当它不存在时，我们可以假设 Series的名称是 None。

               |                       |                        |  pd.concat(
|  pd.concat(           |  pd.concat(            |      [s1.rename('A'),
pd.concat(    |      [s1.rename('A'), |      [s1.rename('A'),  |       s2.rename('B'),
[s1, s2]) |       s2])            |       s2.rename('A')]) |       s3.rename('A')])
-------------- | --------------------- | ---------------------- | ----------------------
2    1         | 2    1                | 2    1                 | 2    1
3    2         | 3    2                | 3    2                 | 3    2
1    3         | 1    3                | 1    3                 | 1    3
2    4         | 2    4                | 2    4                 | 2    4
dtype: int64   | dtype: int64          | Name: A, dtype: int64  | 1    5
|                       |                        | 3    6
|                       |                        | dtype: int64

两个(或三个) `Series`与 `axis=1`(并排)

当沿着 axis=1组合 pandas.Series时，我们引用 name属性来推断结果 pandas.DataFrame中的列名。

                       |                       |  pd.concat(
|  pd.concat(           |      [s1.rename('X'),
pd.concat(            |      [s1.rename('X'), |       s2.rename('Y'),
[s1, s2], axis=1) |       s2], axis=1)    |       s3.rename('Z')], axis=1)
---------------------- | --------------------- | ------------------------------
0    1            |      X    0           |      X    Y    Z
1  NaN  3.0            | 1  NaN  3.0           | 1  NaN  3.0  5.0
2  1.0  4.0            | 2  1.0  4.0           | 2  1.0  4.0  NaN
3  2.0  NaN            | 3  2.0  NaN           | 3  2.0  NaN  6.0

混合 `Series`和 `DataFrame`与 `axis=0`(堆叠)

当沿着 axis=0执行 Series和 DataFrame的串联操作时，我们将所有 Series转换为单列 DataFrame。

Take special note that this is a concatenation along axis=0; that means extending the index (rows) while aligning the columns. In the examples below, we see the index becomes [2, 3, 2, 3] which is an indiscriminate appending of indices. The columns do not overlap unless I force the naming of the Series column with the argument to to_frame:

 pd.concat(               |
[s1.to_frame(), d1]) |  pd.concat([s1, d1])
------------------------- | ---------------------
0    A    B    C     |      0    A    B    C
2  1.0  NaN  NaN  NaN     | 2  1.0  NaN  NaN  NaN
3  2.0  NaN  NaN  NaN     | 3  2.0  NaN  NaN  NaN
2  NaN  0.1  0.2  0.3     | 2  NaN  0.1  0.2  0.3
3  NaN  0.1  0.2  0.3     | 3  NaN  0.1  0.2  0.3

你可以看到结果的 pd.concat([s1, d1])是相同的，如果我自己完成了 to_frame。

但是，我可以使用 to_frame参数控制结果列的名称。用 rename方法重命名 Series会使没有控制结果 DataFrame中的列名。

 # Effectively renames       |                            |
# `s1` but does not align   |  # Does not rename.  So    |  # Renames to something
# with columns in `d1`      |  # Pandas defaults to `0`  |  # that does align with `d1`
pd.concat(                  |  pd.concat(                |  pd.concat(
[s1.to_frame('X'), d1]) |      [s1.rename('X'), d1]) |      [s1.to_frame('B'), d1])
---------------------------- | -------------------------- | ----------------------------
A    B    C    X        |      0    A    B    C      |      A    B    C
2  NaN  NaN  NaN  1.0        | 2  1.0  NaN  NaN  NaN      | 2  NaN  1.0  NaN
3  NaN  NaN  NaN  2.0        | 3  2.0  NaN  NaN  NaN      | 3  NaN  2.0  NaN
2  0.1  0.2  0.3  NaN        | 2  NaN  0.1  0.2  0.3      | 2  0.1  0.2  0.3
3  0.1  0.2  0.3  NaN        | 3  NaN  0.1  0.2  0.3      | 3  0.1  0.2  0.3

混合 `Series`和 `DataFrame`与 `axis=1`(并排)

This is fairly intuitive. Series column name defaults to an enumeration of such Series objects when a name attribute is not available.

                    |  pd.concat(
pd.concat(         |      [s1.rename('X'),
[s1, d1],      |       s2, s3, d1],
axis=1)        |      axis=1)
------------------- | -------------------------------
0    A    B    C |      X    0    1    A    B    C
2  1  0.1  0.2  0.3 | 1  NaN  3.0  5.0  NaN  NaN  NaN
3  2  0.1  0.2  0.3 | 2  1.0  4.0  NaN  0.1  0.2  0.3
| 3  2.0  NaN  6.0  0.1  0.2  0.3

`join`

第三个参数是 join，它描述结果合并应该是外部合并(默认值)还是内部合并。

Join : {‘ inner’，‘ outer’} ，默认的‘ outer’
如何处理其他轴上的索引。

It turns out, there is no left or right option as pd.concat can handle more than just two objects to merge.

就 d1和 d2而言，选项如下:

`outer`

pd.concat([d1, d2], axis=1, join='outer')


A    B    C    B    C    D
1  NaN  NaN  NaN  0.4  0.5  0.6
2  0.1  0.2  0.3  0.4  0.5  0.6
3  0.1  0.2  0.3  NaN  NaN  NaN

`inner`

pd.concat([d1, d2], axis=1, join='inner')


A    B    C    B    C    D
2  0.1  0.2  0.3  0.4  0.5  0.6

`join_axes`

Fourth argument is the thing that allows us to do our left merge and more.

Join _ ax : Index 对象列表
用于其他 n-1轴的特定索引，而不是执行内部/外部集合逻辑。

左合并

pd.concat([d1, d2, d3], axis=1, join_axes=[d1.index])


A    B    C    B    C    D    A    B    D
2  0.1  0.2  0.3  0.4  0.5  0.6  NaN  NaN  NaN
3  0.1  0.2  0.3  NaN  NaN  NaN  0.7  0.8  0.9

对，合并

pd.concat([d1, d2, d3], axis=1, join_axes=[d3.index])


A    B    C    B    C    D    A    B    D
1  NaN  NaN  NaN  0.4  0.5  0.6  0.7  0.8  0.9
3  0.1  0.2  0.3  NaN  NaN  NaN  0.7  0.8  0.9

`ignore_index`

无视 _ index : boolean，default False
如果为 True，则不要沿连接轴使用索引值。得到的轴将被标记为0，... ，n-1。如果连接的对象的连接轴没有有意义的索引信息，那么这一点非常有用。注意，联接中仍然尊重其他轴上的索引值。

就像我在 d2上面堆栈 d1一样，如果我不关心索引值，我可以重置它们或者忽略它们。

                      |  pd.concat(             |  pd.concat(
|      [d1, d2],          |      [d1, d2]
pd.concat([d1, d2])  |      ignore_index=True) |  ).reset_index(drop=True)
--------------------- | ----------------------- | -------------------------
A    B    C    D |      A    B    C    D   |      A    B    C    D
2  0.1  0.2  0.3  NaN | 0  0.1  0.2  0.3  NaN   | 0  0.1  0.2  0.3  NaN
3  0.1  0.2  0.3  NaN | 1  0.1  0.2  0.3  NaN   | 1  0.1  0.2  0.3  NaN
1  NaN  0.4  0.5  0.6 | 2  NaN  0.4  0.5  0.6   | 2  NaN  0.4  0.5  0.6
2  NaN  0.4  0.5  0.6 | 3  NaN  0.4  0.5  0.6   | 3  NaN  0.4  0.5  0.6

使用 axis=1时:

                                   |     pd.concat(
|         [d1, d2], axis=1,
pd.concat([d1, d2], axis=1)       |         ignore_index=True)
-------------------------------    |    -------------------------------
A    B    C    B    C    D    |         0    1    2    3    4    5
1  NaN  NaN  NaN  0.4  0.5  0.6    |    1  NaN  NaN  NaN  0.4  0.5  0.6
2  0.1  0.2  0.3  0.4  0.5  0.6    |    2  0.1  0.2  0.3  0.4  0.5  0.6
3  0.1  0.2  0.3  NaN  NaN  NaN    |    3  0.1  0.2  0.3  NaN  NaN  NaN

`keys`

我们可以传递一个标量值或元组列表，以便将元组或标量值分配给相应的 MultiIndex。传递的列表的长度必须与要连接的项的数目相同。

密钥 : 序列，默认值为 Nothing
如果传递了多个级别，则应包含元组。使用传递的键作为最外层，构建层次索引

`axis=0`

当沿着 axis=0连接 Series对象时(扩展索引)。

这些键将成为 index 属性中 MultiIndex对象的新初始级别。

 #           length 3             length 3           #         length 2        length 2
#          /--------\         /-----------\         #          /----\         /------\
pd.concat([s1, s2, s3], keys=['A', 'B', 'C'])       pd.concat([s1, s2], keys=['A', 'B'])
----------------------------------------------      -------------------------------------
A  2    1                                           A  2    1
3    2                                              3    2
B  1    3                                           B  1    3
2    4                                              2    4
C  1    5                                           dtype: int64
3    6
dtype: int64

However, we can use more than scalar values in the keys argument to create an even deeper MultiIndex. Here we pass tuples of length 2 the prepend two new levels of a MultiIndex:

 pd.concat(
[s1, s2, s3],
keys=[('A', 'X'), ('A', 'Y'), ('B', 'X')])
-----------------------------------------------
A  X  2    1
3    2
Y  1    3
2    4
B  X  1    5
3    6
dtype: int64

`axis=1`

沿着柱子延伸时有点不同。当我们使用 axis=0(见上文) ，我们的 keys作为 MultiIndex水平除了现有的索引。对于 axis=1，我们指的是 Series对象没有的轴，即 columns属性。

两个 Series与 axis=1的变异

注意，只要没有传递 keys，命名 s1和 s2就很重要，但是如果传递 keys，它就会被重写。

               |                       |                        |  pd.concat(
|  pd.concat(           |  pd.concat(            |      [s1.rename('U'),
pd.concat(    |      [s1, s2],        |      [s1.rename('U'),  |       s2.rename('V')],
[s1, s2], |      axis=1,          |       s2.rename('V')], |       axis=1,
axis=1)   |      keys=['X', 'Y']) |       axis=1)          |       keys=['X', 'Y'])
-------------- | --------------------- | ---------------------- | ----------------------
0    1    |      X    Y           |      U    V            |      X    Y
1  NaN  3.0    | 1  NaN  3.0           | 1  NaN  3.0            | 1  NaN  3.0
2  1.0  4.0    | 2  1.0  4.0           | 2  1.0  4.0            | 2  1.0  4.0
3  2.0  NaN    | 3  2.0  NaN           | 3  2.0  NaN            | 3  2.0  NaN

带有 Series和 axis=1的 MultiIndex

 pd.concat(
[s1, s2],
axis=1,
keys=[('W', 'X'), ('W', 'Y')])
-----------------------------------
W
X    Y
1  NaN  3.0
2  1.0  4.0
3  2.0  NaN

两个 DataFrame和一个 axis=1

与 axis=0示例一样，keys向 MultiIndex添加级别，但这次是向存储在 columns属性中的对象添加级别。

 pd.concat(                     |  pd.concat(
[d1, d2],                  |      [d1, d2],
axis=1,                    |      axis=1,
keys=['X', 'Y'])           |      keys=[('First', 'X'), ('Second', 'X')])
------------------------------- | --------------------------------------------
X              Y           |   First           Second
A    B    C    B    C    D |       X                X
1  NaN  NaN  NaN  0.4  0.5  0.6 |       A    B    C      B    C    D
2  0.1  0.2  0.3  0.4  0.5  0.6 | 1   NaN  NaN  NaN    0.4  0.5  0.6
3  0.1  0.2  0.3  NaN  NaN  NaN | 2   0.1  0.2  0.3    0.4  0.5  0.6
| 3   0.1  0.2  0.3    NaN  NaN  NaN

Series和 DataFrame与 axis=1

这很棘手。在这种情况下，标量键值不能作为 Series对象的唯一索引级别，因为它变成了列，同时也作为 DataFrame的 MultiIndex的第一级别。因此熊猫将再次使用 Series对象的 name属性作为列名的来源。

 pd.concat(           |  pd.concat(
[s1, d1],        |      [s1.rename('Z'), d1],
axis=1,          |      axis=1,
keys=['X', 'Y']) |      keys=['X', 'Y'])
--------------------- | --------------------------
X    Y             |    X    Y
0    A    B    C   |    Z    A    B    C
2  1  0.1  0.2  0.3   | 2  1  0.1  0.2  0.3
3  2  0.1  0.2  0.3   | 3  2  0.1  0.2  0.3

keys和 MultiIndex推理的局限性。

熊猫似乎只能从 Series的名字中推断出列名，但是当在具有不同列级数的数据帧之间进行类似的连接时，它不会填补空白。

d1_ = pd.concat(
[d1], axis=1,
keys=['One'])
d1_


One
A    B    C
2  0.1  0.2  0.3
3  0.1  0.2  0.3

然后将其与另一个数据帧连接，在列对象中只有一个级别，熊猫将拒绝尝试将 MultiIndex对象组成元组，并将所有数据帧组合为一个单一级别的对象、标量和元组。

pd.concat([d1_, d2], axis=1)


(One, A)  (One, B)  (One, C)    B    C    D
1       NaN       NaN       NaN  0.4  0.5  0.6
2       0.1       0.2       0.3  0.4  0.5  0.6
3       0.1       0.2       0.3  NaN  NaN  NaN

传递 `dict`而不是 `list`

传递字典时，pandas.concat将使用字典中的键作为 keys参数。

 # axis=0               |  # axis=1
pd.concat(             |  pd.concat(
{0: d1, 1: d2})    |      {0: d1, 1: d2}, axis=1)
----------------------- | -------------------------------
A    B    C    D |      0              1
0 2  0.1  0.2  0.3  NaN |      A    B    C    B    C    D
3  0.1  0.2  0.3  NaN | 1  NaN  NaN  NaN  0.4  0.5  0.6
1 1  NaN  0.4  0.5  0.6 | 2  0.1  0.2  0.3  0.4  0.5  0.6
2  NaN  0.4  0.5  0.6 | 3  0.1  0.2  0.3  NaN  NaN  NaN

`levels`

This is used in conjunction with the keys argument.When levels is left as its default value of None, Pandas will take the unique values of each level of the resulting MultiIndex and use that as the object used in the resulting index.levels attribute.

levels: list of sequences, default None
用于构造 MultiIndex 的特定级别(唯一值)。否则，它们将从键中推断出来。

如果大熊猫已经推断出这些水平应该是什么，那么我们自己又有什么优势来指明它呢？我将展示一个例子，让您自己去思考为什么这可能有用的其他原因。

例子

根据文档，levels参数是序列列表。这意味着我们可以使用另一个 pandas.Index作为这些序列之一。

考虑数据帧 df，它是 d1、 d2和 d3的连接:

df = pd.concat(
[d1, d2, d3], axis=1,
keys=['First', 'Second', 'Fourth'])


df


First           Second           Fourth
A    B    C      B    C    D      A    B    D
1   NaN  NaN  NaN    0.4  0.5  0.6    0.7  0.8  0.9
2   0.1  0.2  0.3    0.4  0.5  0.6    NaN  NaN  NaN
3   0.1  0.2  0.3    NaN  NaN  NaN    0.7  0.8  0.9

列对象的级别为:

print(df, *df.columns.levels, sep='\n')


Index(['First', 'Second', 'Fourth'], dtype='object')
Index(['A', 'B', 'C', 'D'], dtype='object')

如果我们在 groupby中使用 sum，我们得到:

df.groupby(axis=1, level=0).sum()


First  Fourth  Second
1    0.0     2.4     1.5
2    0.6     0.0     1.5
3    0.6     2.4     0.0

但是，如果不是 ['First', 'Second', 'Fourth']，而是另一个名为 Third和 Fifth的缺失类别呢？我想把它们包含在 groupby聚合的结果中？如果我们有 pandas.CategoricalIndex我们就能做到。我们可以用 levels参数提前指定。

因此，让我们将 df定义为:

cats = ['First', 'Second', 'Third', 'Fourth', 'Fifth']
lvl = pd.CategoricalIndex(cats, categories=cats, ordered=True)


df = pd.concat(
[d1, d2, d3], axis=1,
keys=['First', 'Second', 'Fourth'],
levels=[lvl]
)


df


First  Fourth  Second
1    0.0     2.4     1.5
2    0.6     0.0     1.5
3    0.6     2.4     0.0

But the first level of the columns object is:

df.columns.levels[0]


CategoricalIndex(
['First', 'Second', 'Third', 'Fourth', 'Fifth'],
categories=['First', 'Second', 'Third', 'Fourth', 'Fifth'],
ordered=True, dtype='category')

我们的 groupby总结如下:

df.groupby(axis=1, level=0).sum()


First  Second  Third  Fourth  Fifth
1    0.0     1.5    0.0     2.4    0.0
2    0.6     1.5    0.0     0.0    0.0
3    0.6     0.0    0.0     2.4    0.0

`names`

这用于命名结果 MultiIndex的级别。names列表的长度应该与产生的 MultiIndex中的级别数相匹配。

Name : list，default Nothing
结果分层索引中各级的名称

 # axis=0                     |  # axis=1
pd.concat(                   |  pd.concat(
[d1, d2],                |      [d1, d2],
keys=[0, 1],             |      axis=1, keys=[0, 1],
names=['lvl0', 'lvl1'])  |      names=['lvl0', 'lvl1'])
----------------------------- | ----------------------------------
A    B    C    D | lvl0    0              1
lvl0 lvl1                     | lvl1    A    B    C    B    C    D
0    2     0.1  0.2  0.3  NaN | 1     NaN  NaN  NaN  0.4  0.5  0.6
3     0.1  0.2  0.3  NaN | 2     0.1  0.2  0.3  0.4  0.5  0.6
1    1     NaN  0.4  0.5  0.6 | 3     0.1  0.2  0.3  NaN  NaN  NaN
2     NaN  0.4  0.5  0.6 |

`verify_integrity`

不言自明的文件

Valid_ Integrity : boolean，default False
检查新的连接轴是否包含重复数据。相对于实际的数据连接，这可能非常昂贵。

由于连接 d1和 d2得到的索引不是唯一的，因此完整性检查将会失败。

pd.concat([d1, d2])


A    B    C    D
2  0.1  0.2  0.3  NaN
3  0.1  0.2  0.3  NaN
1  NaN  0.4  0.5  0.6
2  NaN  0.4  0.5  0.6

还有

pd.concat([d1, d2], verify_integrity=True)

> ValueError: 索引具有重叠的值: [2]

What are the 'levels', 'keys', and names arguments for in Pandas' concat function?

问题

pandas.concat: The Missing Manual

导入和定义对象

争论

objs

axis

Two DataFrames with axis=0 (stacked)

Two DataFrames with axis=1 (side by side)

两个(或三个) Series与 axis=0(堆叠)

两个(或三个) Series与 axis=1(并排)

混合 Series和 DataFrame与 axis=0(堆叠)

混合 Series和 DataFrame与 axis=1(并排)

join

outer

inner

join_axes

左合并