熊猫合并两个不同列的数据框架

小开

最佳答案

I think in this case concat is what you want:

In [12]:


pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

小开

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join

helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

小开

The accepted answer will break if there are duplicate headers:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects.

For example, here A has 3x trial columns, which prevents concat:

A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
#    id  trial  trial  trial
# 0   3      1      4      1


B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
#    id  trial
# 0   5      9
# 1   2      6


pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects

To fix this, deduplicate the column names before concat:

parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})


for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)


pd.concat([A, B], ignore_index=True)
#    id  trial  trial.1  trial.2
# 0   3      1        4        1
# 1   5      9      NaN      NaN
# 2   2      6      NaN      NaN

Or as a one-liner but less readable:

pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)

Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})