小开

最佳答案

pd.concat requires that the indices be unique. To remove rows with duplicate indices, use

df = df.loc[~df.index.duplicated(keep='first')]

import pandas as pd
from pandas import Timestamp


df1 = pd.DataFrame(
{'price': [0.7286, 0.7286, 0.7286, 0.7286],
'side': [2, 2, 2, 2],
'timestamp': [1451865675631331, 1451865675631400,
1451865675631861, 1451865675631866]},
index=pd.DatetimeIndex(['2000-1-1', '2000-1-1', '2001-1-1', '2002-1-1']))




df2 = pd.DataFrame(
{'bid': [0.7284, 0.7284, 0.7284, 0.7285, 0.7285],
'bid_size': [4000000, 4000000, 5000000, 1000000, 4000000],
'offer': [0.7285, 0.729, 0.7286, 0.7286, 0.729],
'offer_size': [1000000, 4000000, 4000000, 4000000, 4000000]},
index=pd.DatetimeIndex(['2000-1-1', '2001-1-1', '2002-1-1', '2003-1-1', '2004-1-1']))




df1 = df1.loc[~df1.index.duplicated(keep='first')]
#              price  side         timestamp
# 2000-01-01  0.7286     2  1451865675631331
# 2001-01-01  0.7286     2  1451865675631861
# 2002-01-01  0.7286     2  1451865675631866


df2 = df2.loc[~df2.index.duplicated(keep='first')]
#                bid  bid_size   offer  offer_size
# 2000-01-01  0.7284   4000000  0.7285     1000000
# 2001-01-01  0.7284   4000000  0.7290     4000000
# 2002-01-01  0.7284   5000000  0.7286     4000000
# 2003-01-01  0.7285   1000000  0.7286     4000000
# 2004-01-01  0.7285   4000000  0.7290     4000000


result = pd.concat([df1, df2], axis=0)
print(result)
bid  bid_size   offer  offer_size   price  side     timestamp
2000-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
2001-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
2002-01-01     NaN       NaN     NaN         NaN  0.7286     2  1.451866e+15
2000-01-01  0.7284   4000000  0.7285     1000000     NaN   NaN           NaN
2001-01-01  0.7284   4000000  0.7290     4000000     NaN   NaN           NaN
2002-01-01  0.7284   5000000  0.7286     4000000     NaN   NaN           NaN
2003-01-01  0.7285   1000000  0.7286     4000000     NaN   NaN           NaN
2004-01-01  0.7285   4000000  0.7290     4000000     NaN   NaN           NaN

Note there is also pd.join, which can join DataFrames based on their indices, and handle non-unique indices based on the how parameter. Rows with duplicate index are not removed.

In [94]: df1.join(df2)
Out[94]:
price  side         timestamp     bid  bid_size   offer  \
2000-01-01  0.7286     2  1451865675631331  0.7284   4000000  0.7285
2000-01-01  0.7286     2  1451865675631400  0.7284   4000000  0.7285
2001-01-01  0.7286     2  1451865675631861  0.7284   4000000  0.7290
2002-01-01  0.7286     2  1451865675631866  0.7284   5000000  0.7286


offer_size
2000-01-01     1000000
2000-01-01     1000000
2001-01-01     4000000
2002-01-01     4000000


In [95]: df1.join(df2, how='outer')
Out[95]:
price  side     timestamp     bid  bid_size   offer  offer_size
2000-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7285     1000000
2000-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7285     1000000
2001-01-01  0.7286     2  1.451866e+15  0.7284   4000000  0.7290     4000000
2002-01-01  0.7286     2  1.451866e+15  0.7284   5000000  0.7286     4000000
2003-01-01     NaN   NaN           NaN  0.7285   1000000  0.7286     4000000
2004-01-01     NaN   NaN           NaN  0.7285   4000000  0.7290     4000000

小开

You can mitigate this error without having to change your data or remove duplicates. Just create a new index with DataFrame.reset_index:

df = df.reset_index()

The old index is kept as a column in your dataframe, but if you don't need it you can do:

df = df.reset_index(drop=True)

Some prefer:

df.reset_index(inplace=True, drop=True)

小开

best solution from this page: https://pandas.pydata.org/pandas-docs/version/0.20/merging.html

df = pd.concat([df1, df2], axis=1, join_axes=[df1.index])

小开

This post comes up top when you search for the error but the answers are not complete, so let me add mine. There is another reason this error can happen: If you have duplicate columns in your data frames, you will not be able to concatenate and raise this. In fact, even in the original question there are two columns called timestamp. So it will be better to check if len(df.columns) == len(set(df.columns)) for all the data frames you are trying to concatenate.

小开

As a complement of Nicholas Morley's answer, when you find even this not works:

df = df.reset_index(drop=True)

You should check whether the columns are unique. When they are not, even reseting index not works. Duplicated columns should be removed first to make it works.

小开

Duplicated column names!

In my case the problem was because I had duplicated column names.

小开

This happens also when you have duplicates in the columns names.

小开

Same Indices Between the Two DFs

Another reason for this issue might be that df1 and df2 might have the same indices, between each other. For example, both the dfs might have the same index idx1.

To check if this is the issue, you can see if the following outputs not an empty list:

print([org_name for org_name in cum_df.index if org_name in df_from_2002.index])

My suggested solution then would be to rename the indices (so df1 would keep having idx1 and you would change idx1 to idx2 in df2) and after concatenating (df1 = pd.concat([df1, df2])), combine the two indices (in case you need to get the sum of them) with this code:

df1.iloc[idx1] = df1.iloc[[idx1, idx2]].sum()

and then remove idx2:

df1.drop([idx2], inplace=True)

小开

This happened to me when I was trying to concat two dataframes that have duplicated column names!

Let's say that I want to remove the first duplicated column:

duplicated_column = 'column'


df_tmp = df[duplicated_column].T
df_tmp = df_tmp.iloc[1: , :]


df = df.drop([duplicated_column], axis=1)
df = pd.concat([df, df_tmp.T], axis=1)

小开

Answers here helped but concat worked fine for me in some cases even where duplicate columns were present. However, in some cases it didn't work and raised the InvalidIndexError.

It turned out that it works fine if order of duplicate columns is same but raises an error if order of duplicate columns is different.

Example where it works fine:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [5, 6, 7], 'c': [9, 10, 11]})
df1 = pd.DataFrame({'a': [12], 'b': [13], 'c': [14]})
df.rename(columns={
'c': 'b'
}, inplace=True)
df1.rename(columns={
'c': 'b'
}, inplace=True)
print(pd.concat([df, df1]))


Output:
a   b   b
0   1   5   9
1   2   6  10
2   3   7  11
0  12  13  14

Example where it doesn't work:

df = pd.DataFrame({'b': [1, 2, 3], 'a': [5, 6, 7], 'c': [9, 10, 11]})
df1 = pd.DataFrame({'a': [12], 'b': [13], 'c': [14]})
df.rename(columns={
'c': 'b'
}, inplace=True)
df1.rename(columns={
'c': 'b'
}, inplace=True)
print(pd.concat([df, df1]))


Output:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely
valued Index objects

小开

This is because you have duplicated columns. Before concatenating drop duplicated columns in each DataFrame as follows:

df = df.loc[:,~df.columns.duplicated()].reset_index(drop=True)

小开

The problem for me was duplicate column labels, just as many others here mentioned it. To keep only the first column for duplicates I used below:

df=df.T[~df.T.index.duplicated(keep='first')].T

Concat DataFrame Reindexing only valid with uniquely valued Index objects

Duplicated column names!

Same Indices Between the Two DFs