熊猫“只能比较相同标签的 DataFrame 对象”错误

我使用熊猫来比较两个文件的输出加载到两个数据帧(uat,prod) : ...

uat = uat[['Customer Number','Product']]
prod = prod[['Customer Number','Product']]
print uat['Customer Number'] == prod['Customer Number']
print uat['Product'] == prod['Product']
print uat == prod


The first two match exactly:
74357    True
74356    True
Name: Customer Number, dtype: bool
74357    True
74356    True
Name: Product, dtype: bool

对于第三张照片,我得到一个错误: Can only compare identically-labeled DataFrame objects. If the first two compared fine, what's wrong with the 3rd?

谢谢

369269 次浏览

这里有一个小例子来演示这一点(它只适用于 DataFrames,而不是 Series,直到熊猫0.19,它适用于两者) :

In [1]: df1 = pd.DataFrame([[1, 2], [3, 4]])


In [2]: df2 = pd.DataFrame([[3, 4], [1, 2]], index=[1, 0])


In [3]: df1 == df2
Exception: Can only compare identically-labeled DataFrame objects

One solution is to 对索引进行排序 first (Note: 有些函数需要排序索引):

In [4]: df2.sort_index(inplace=True)


In [5]: df1 == df2
Out[5]:
0     1
0  True  True
1  True  True

Note: == is also sensitive to the order of columns, so you may have to use sort_index(axis=1):

In [11]: df1.sort_index().sort_index(axis=1) == df2.sort_index().sort_index(axis=1)
Out[11]:
0     1
0  True  True
1  True  True

注意: 这仍然会引发(如果排序后索引/列没有相同的标记)。

如果不需要比较,也可以删除索引列:

print(df1.reset_index(drop=True) == df2.reset_index(drop=True))

I have used this same technique in a unit test like so:

from pandas.util.testing import assert_frame_equal


assert_frame_equal(actual.reset_index(drop=True), expected.reset_index(drop=True))

当这个问题被提出的时候,熊猫体内还没有另外一个测试平等的函数,但是它已经被添加了一段时间: pandas.equals

你可以这样使用:

df1.equals(df2)

==的一些不同之处在于:

  • 你没有得到问题中描述的错误
  • 它返回一个简单的布尔值。
  • 同一位置的 NaN 值被认为是相等的
  • 2 DataFrames 需要具有相同的 dtype才能被认为是相等的,参见 这个堆栈溢出的问题

编辑:
正如在 @ 造纸技能树回答指数对齐指出的是重要的。除了提供的解决方案之外,还有一个选项是在比较 DataFrames 之前使用 对 DataFrames 的索引进行排序。对于 df1那就是 df1.sort_index(inplace=True)

当比较两个 DataFrame 时,必须确保第一个 DataFrame 中的记录数与第二个 DataFrame 中的记录数相匹配。在我们的示例中,两个 DataFrame 各有4条记录,其中有4个产品和4个价格。

例如,如果其中一个 DataFrame 有5个产品,而另一个 DataFrame 有4个产品,并且您尝试运行比较,您将得到以下错误:

ValueError: Can only compare identically-labeled Series objects

这应该能行

import pandas as pd
import numpy as np


firstProductSet = {'Product1': ['Computer','Phone','Printer','Desk'],
'Price1': [1200,800,200,350]
}
df1 = pd.DataFrame(firstProductSet,columns= ['Product1', 'Price1'])




secondProductSet = {'Product2': ['Computer','Phone','Printer','Desk'],
'Price2': [900,800,300,350]
}
df2 = pd.DataFrame(secondProductSet,columns= ['Product2', 'Price2'])




df1['Price2'] = df2['Price2'] #add the Price2 column from df2 to df1


df1['pricesMatch?'] = np.where(df1['Price1'] == df2['Price2'], 'True', 'False')  #create new column in df1 to check if prices match
df1['priceDiff?'] = np.where(df1['Price1'] == df2['Price2'], 0, df1['Price1'] - df2['Price2']) #create new column in df1 for price diff
print (df1)

https://datatofish.com/compare-values-dataframes/的例子

Flyingdutchman 的回答 很好但是 错了: 它使用 DataFrame.equals,在你的情况下返回 False。 相反,您需要使用 DataFrame.eq,它将返回 True

It seems that DataFrame.equals ignores the dataframe's index, while DataFrame.eq uses dataframes' indexes for alignment and then compares the aligned values. This is an occasion to 引用 the central gotcha of Pandas:

Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

As we can see in the following examples, the data alignment is neither broken, nor enforced, unless explicitly requested. So we have three different situations.

  1. 没有给出明确的指令,关于对齐: ==又名 DataFrame.__eq__,

In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame(index=[0, 1, 2], data={'col1':list('abc')})
In [3]: df2 = pd.DataFrame(index=[2, 0, 1], data={'col1':list('cab')})
In [4]: df1 == df2
---------------------------------------------------------------------------
...
ValueError: Can only compare identically-labeled DataFrame objects


  1. 校准被明确破坏: DataFrame.equalsDataFrame.valuesDataFrame.reset_index(),
    In [5]: df1.equals(df2)
Out[5]: False


In [9]: df1.values == df2.values
Out[9]:
array([[False],
[False],
[False]])


In [10]: (df1.values == df2.values).all().all()
Out[10]: False


  1. Alignment is explicitly enforced: DataFrame.eq, DataFrame.sort_index(),

In [6]: df1.eq(df2)
Out[6]:
col1
0  True
1  True
2  True


In [8]: df1.eq(df2).all().all()
Out[8]: True
    

我的答案是熊猫版 1.0.3

在这里,我展示了一个如何处理这个错误的完整示例。我已经添加了带零的行。你可以有你的数据框架从 csv 或任何其他来源。

import pandas as pd
import numpy as np




# df1 with 9 rows
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})


# df2 with 8 rows
df2 = pd.DataFrame({'Name':['John','Mike','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[25,45,14,34,26,44,29,42]})




# get lengths of df1 and df2
df1_len = len(df1)
df2_len = len(df2)




diff = df1_len - df2_len


rows_to_be_added1 = rows_to_be_added2 = 0
# rows_to_be_added1 = np.zeros(diff)


if diff < 0:
rows_to_be_added1 = abs(diff)
else:
rows_to_be_added2 = diff
    

# add empty rows to df1
if rows_to_be_added1 > 0:
df1 = df1.append(pd.DataFrame(np.zeros((rows_to_be_added1,len(df1.columns))),columns=df1.columns))


# add empty rows to df2
if rows_to_be_added2 > 0:
df2 = df2.append(pd.DataFrame(np.zeros((rows_to_be_added2,len(df2.columns))),columns=df2.columns))


# at this point we have two dataframes with the same number of rows, and maybe different indexes
# drop the indexes of both, so we can compare the dataframes and other operations like update etc.
df2.reset_index(drop=True, inplace=True)
df1.reset_index(drop=True, inplace=True)


# add a new column to df1
df1['New_age'] = None


# compare the Age column of df1 and df2, and update the New_age column of df1 with the Age column of df2 if they match, else None
df1['New_age'] = np.where(df1['Age'] == df2['Age'], df2['Age'], None)


# drop rows where Name is 0.0
df2 = df2.drop(df2[df2['Name'] == 0.0].index)


# now we don't get the error ValueError: Can only compare identically-labeled Series objects

在我的案例中,我发现了错误的出处:

问题是列名列表意外地包含在另一个列表中。

Consider following example:

column_names=['warrior','eat','ok','monkeys']


df_good = pd.DataFrame(np.ones(shape=(6,4)),columns=column_names)
df_good['ok'] < df_good['monkeys']


>>> 0    False
1    False
2    False
3    False
4    False
5    False


df_bad = pd.DataFrame(np.ones(shape=(6,4)),columns=[column_names])
df_bad ['ok'] < df_bad ['monkeys']


>>> ValueError: Can only compare identically-labeled DataFrame objects

问题是,您无法从视觉上区分糟糕的数据框架和好的数据框架。

In my case i just write directly param 柱子 in creating dataframe, because data from one sql-query was with names, and without in other