熊猫合并两个不同列的数据框架

我肯定漏掉了一些简单的东西。试图在熊猫中合并两个数据框架,这两个数据框架有几乎相同的列名,但是右边的数据框架有一些左边没有的列,反之亦然。

>df_may


id  quantity  attr_1  attr_2
0  1        20       0       1
1  2        23       1       1
2  3        19       1       1
3  4        19       0       0


>df_jun


id  quantity  attr_1  attr_3
0  5         8       1       0
1  6        13       0       1
2  7        20       1       1
3  8        25       1       1

我试过加入一个外部连接:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")

但结果是:

Left data columns not unique: Index([....

我还指定了一个要连接的列(例如 on = "id") ,但是它重复了除 id之外的所有列,例如 attr_1_xattr_1_y,这并不理想。我还将整个列列表(有很多列)传递给 on:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))

结果是:

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

我错过了什么?我希望得到一个 df,其中包含所有行,并且尽可能填充 attr_1attr_2attr_3,在不显示这些行的地方填充 NaN。这似乎是一个非常典型的数据挖掘工作流程,但我被卡住了。

先谢谢你。

176820 次浏览

I think in this case concat is what you want:

In [12]:


pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join

helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

The accepted answer will break if there are duplicate headers:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects.

For example, here A has 3x trial columns, which prevents concat:

A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
#    id  trial  trial  trial
# 0   3      1      4      1


B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
#    id  trial
# 0   5      9
# 1   2      6


pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects

To fix this, deduplicate the column names before concat:

parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})


for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)


pd.concat([A, B], ignore_index=True)
#    id  trial  trial.1  trial.2
# 0   3      1        4        1
# 1   5      9      NaN      NaN
# 2   2      6      NaN      NaN

Or as a one-liner but less readable:

pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)

Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})