比较两个Data.Frame以查找Data.Frame 1中的行,这些行不在Data.Frame2中

我有以下2个Data.Frame:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

我想找出A1有而A2没有的行。

这种类型的操作是否有内置功能?

(附言:我确实为它写了一个解决方案,我只是好奇是否有人已经做了一个更精巧的代码)

以下是我的解决方案:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])


rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)
420142 次浏览

对于这个特定的目的,它肯定不是有效的,但在这些情况下,我经常做的是在每个Data.Frame中插入指示符变量,然后合并:

a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)

所包括的_A1中的缺失值将注意到A1中缺少哪些行。A2也是如此。

您的解决方案的一个问题是列顺序必须匹配。另一个问题是,很容易想象这样的情况,即行被编码为相同,而实际上是不同的。使用合并的优点是,您可以免费获得一个好的解决方案所必需的所有错误检查。

这并没有直接回答你的问题,但它会给你共同的元素。这可以通过Paul Murrell的软件包compare来完成:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
#  a b
#1 1 a
#2 2 b
#3 3 c

函数compare在允许的比较类型方面为您提供了很大的灵活性(例如,更改每个向量的元素顺序、更改变量的顺序和名称、缩短变量、更改字符串的大小写)。从这一点,你应该能够找出一个或另一个缺少什么。例如(这不是很优雅):

difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
#  a b
#1 4 d
#2 5 e

我调整了merge函数以获得此功能。在较大的DataFrame上,它比完全合并解决方案使用更少的内存。我可以处理关键列的名称。

另一种解决方案是使用库prob

#  Derived from src/library/base/R/merge.R
#  Part of the R package, http://www.R-project.org
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  http://www.r-project.org/Licenses/


XinY <-
function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
notin = FALSE, incomparables = NULL,
...)
{
fix.by <- function(by, df)
{
## fix up 'by' to be a valid set of cols by number: 0 is row.names
if(is.null(by)) by <- numeric(0L)
by <- as.vector(by)
nc <- ncol(df)
if(is.character(by))
by <- match(by, c("row.names", names(df))) - 1L
else if(is.numeric(by)) {
if(any(by < 0L) || any(by > nc))
stop("'by' must match numbers of columns")
} else if(is.logical(by)) {
if(length(by) != nc) stop("'by' must match number of columns")
by <- seq_along(by)[by]
} else stop("'by' must specify column(s) as numbers, names or logical")
if(any(is.na(by))) stop("'by' must specify valid column(s)")
unique(by)
}


nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
by.x <- fix.by(by.x, x)
by.y <- fix.by(by.y, y)
if((l.b <- length(by.x)) != length(by.y))
stop("'by.x' and 'by.y' specify different numbers of columns")
if(l.b == 0L) {
## was: stop("no columns to match on")
## returns x
x
}
else {
if(any(by.x == 0L)) {
x <- cbind(Row.names = I(row.names(x)), x)
by.x <- by.x + 1L
}
if(any(by.y == 0L)) {
y <- cbind(Row.names = I(row.names(y)), y)
by.y <- by.y + 1L
}
## create keys from 'by' columns:
if(l.b == 1L) {                  # (be faster)
bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
} else {
## Do these together for consistency in as.character.
## Use same set of names.
bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
bx <- bz[seq_len(nx)]
by <- bz[nx + seq_len(ny)]
}
comm <- match(bx, by, 0L)
if (notin) {
res <- x[comm == 0,]
} else {
res <- x[comm > 0,]
}
}
## avoid a copy
## row.names(res) <- NULL
attr(res, "row.names") <- .set_row_names(nrow(res))
res
}




XnotinY <-
function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
notin = TRUE, incomparables = NULL,
...)
{
XinY(x,y,by,by.x,by.y,notin,incomparables)
}

sqldf提供了一个很好的解决方案。

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])


require(sqldf)


a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

以及在两个数据帧中的行:

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

dplyr的新版本有一个函数,anti_join,正是用于这些类型的比较

require(dplyr)
anti_join(a1,a2)

semi_join,以筛选a1中也在a2中的行

semi_join(a1,a2)

我写了一个包(https://github.com/alexsanjoseph/comparedf),因为我有同样的问题。

  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
> df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
> df_compare = compare_df(df1, df2, "row")


> df_compare$comparison_df
row chng_type a b
1   4         + 4 d
2   5         + 5 e

一个更复杂的例子:

library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", "Duster 360", "Merc 240D"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
hp = c(110, 110, 181, 110, 245, 62),
cyl = c(6, 6, 4, 6, 8, 4),
qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))


df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
hp = c(110, 110, 93, 110, 175, 105),
cyl = c(6, 6, 4, 6, 8, 6),
qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))


> df_compare$comparison_df
grp chng_type                id1 id2  hp cyl  qsec
1   1         -  Hornet Sportabout Dus 175   8 17.02
2   2         +         Datsun 710 Dat 181   4 33.00
3   2         -         Datsun 710 Dat  93   4 18.61
4   3         +         Duster 360 Dus 245   8 15.84
5   7         +          Merc 240D Mer  62   4 20.00
6   8         -            Valiant Val 105   6 20.22

该软件包还有一个用于快速检查的HTML_输出命令。

DF_比较$HTML_输出 enter image description here

基于PLYR中的匹配_DF的

又一解决方案。 这是Plyr的比赛_DF:

match_df <- function (x, y, on = NULL)
{
if (is.null(on)) {
on <- intersect(names(x), names(y))
message("Matching on: ", paste(on, collapse = ", "))
}
keys <- join.keys(x, y, on)
x[keys$x %in% keys$y, , drop = FALSE]
}

我们可以修改它来否定:

library(plyr)
negate_match_df <- function (x, y, on = NULL)
{
if (is.null(on)) {
on <- intersect(names(x), names(y))
message("Matching on: ", paste(on, collapse = ", "))
}
keys <- join.keys(x, y, on)
x[!(keys$x %in% keys$y), , drop = FALSE]
}

然后:

diff <- negate_match_df(a1,a2)

您的示例数据没有任何重复项,但您的解决方案会自动处理它们。这意味着在重复的情况下,某些答案可能与函数的结果不匹配。
这是我的解决方案,地址重复的方式与您的相同。它的规模也很大!

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}


library(data.table)
setDT(a1)
setDT(a2)


# no duplicates - as in example code
r <- fsetdiff(a1, a2)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE


# handling duplicates - make some duplicates
a1 <- rbind(a1, a1, a1)
a2 <- rbind(a2, a2, a2)
r <- fsetdiff(a1, a2, all = TRUE)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

它需要数据。表1.9.8+

也许它过于简单,但我使用了这个解决方案,我发现当我有一个可以用来比较数据集的主键时,它非常有用。希望能有所帮助。

a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
different.names <- (!a1$a %in% a2$a)
not.in.a2 <- a1[different.names,]

使用diffobj包装:

library(diffobj)


diffPrint(a1, a2)
diffObj(a1, a2)

enter image description here

enter image description here

德普利尔中:

setdiff(a1,a2)

基本上,setdiff(bigFrame, smallFrame)将获取第一个表中的额外记录。

在sqlverse中,这称为

Left Excluding Join Venn Diagram

为了更好地描述所有连接选项和集合主题,这是迄今为止我所见过的最好的总结之一:http://www.vertabelo.com/blog/technical-articles/sql-joins

但回到这个问题-以下是使用OP数据时setdiff()代码的结果:

> a1
a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e


> a2
a b
1 1 a
2 2 b
3 3 c


> setdiff(a1,a2)
a b
1 4 d
2 5 e

甚至anti_join(a1,a2)也会得到相同的结果。
更多信息:https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

您可以使用daff程序包(使用V8包装daff.js):

library(daff)


diff_data(data_ref = a2,
data = a1)

生成以下差异对象:

Daff Comparison: ‘a2’ vs. ‘a1’
First 6 and last 6 patch lines:
@@   a   b
1 ... ... ...
2       3   c
3 +++   4   d
4 +++   5   e
5 ... ... ...
6 ... ... ...
7       3   c
8 +++   4   d
9 +++   5   e

表格差异格式是在这里中描述的,应该是非常容易理解的。在第一列@@中具有+++的行是在a1中新的并且在a2中不存在的行。

差异对象可用于patch_data(),使用write_diff()存储差异用于文档目的,或使用render_diff()显示差异

render_diff(
diff_data(data_ref = a2,
data = a1)
)

生成简洁的HTML输出:

enter image description here

使用subset

missing<-subset(a1, !(a %in% a2$a))

以下代码同时使用data.tablefastmatch来提高速度。

library("data.table")
library("fastmatch")


a1 <- setDT(data.frame(a = 1:5, b=letters[1:5]))
a2 <- setDT(data.frame(a = 1:3, b=letters[1:3]))


compare_rows <- a1$a %fin% a2$a
# the %fin% function comes from the `fastmatch` package


added_rows <- a1[which(compare_rows == FALSE)]


added_rows


#    a b
# 1: 4 d
# 2: 5 e

真正快速的比较,以获得差异的计数。 使用特定的列名。

colname = "CreatedDate" # specify column name
index <- match(colname, names(source_df)) # get index name for column name
sel <- source_df[, index] == target_df[, index] # get differences, gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches

对于完整的数据帧,请不要提供列或索引名称

sel <- source_df[, ] == target_df[, ] # gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches