如何检查两个数据帧是否相等

假设我在 R 中有一个很大的数据集,我只想知道它们中的两个是否相同。当我试验不同的算法以获得相同的结果时,我经常使用它。例如,假设我们有以下数据集:

df1 <- data.frame(num = 1:5, let = letters[1:5])
df2 <- df1
df3 <- data.frame(num = c(1:5, NA), let = letters[1:6])
df4 <- df3

这就是我比较它们的方法:

table(x == y, useNA = 'ifany')

当数据集没有 NAs 时,这种方法非常有效:

> table(df1 == df2, useNA = 'ifany')
TRUE
10

但当他们拥有 NAs 时,情况就不那么乐观了:

> table(df3 == df4, useNA = 'ifany')
TRUE <NA>
11    1

在这个例子中,很容易忽略 NA,因为我们知道两个数据帧 是相等的。问题是 NA == <anything>产生 NA,所以每当一个数据集有一个 NA时,不管另一个数据集在同一位置上有什么,结果总是 NA

因此,使用 table()来比较数据集对我来说似乎并不理想

注意: 这不是 比较几个数据集在 R 中比较两个数据集比较 R 中的数据集的副本

109685 次浏览

Look up all.equal. It has some riders but it might work for you.

all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE

As Metrics pointed out, one could also use identical() to compare the datasets. The difference between this approach and that of Codoremifa is that identical() will just yield TRUE of FALSE, depending whether the objects being compared are identical or not, whereas all.equal() will either return TRUE or hints about the differences between the objects. For instance, consider the following:

> identical(df1, df3)
[1] FALSE


> all.equal(df1, df3)
[1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"
[2] "Component 1: Numeric: lengths (5, 6) differ"
[3] "Component 2: Lengths: 5, 6"
[4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >"
[5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"

Moreover, from what I've tested identical() seems to run much faster than all.equal().