在 R 中查找重复值

我有一张21638行的表:

vocabulary <- read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Vocabulary.txt", header=T)

该表有五列,其中第一列包含响应者的 ID 号。我想检查是否有受访者出现两次,或者是否所有的受访者都是独一无二的。

来计算我可以使用的唯一 ID

length(unique(vocabulary$id))

并检查是否有任何复制品,我可能做

length(unique(vocabulary$id)) == nrow(vocabulary)

返回 TRUE,如果没有副本(没有副本)。

我的问题是:

有没有一种直接的方法来返回重复的值或行号?

一些进一步的解释:

使用函数 duplicated()有一个解释问题,因为它只返回严格意义上的重复,不包括“原始”。例如,sum(duplicated(vocabulary$id))dim(vocabulary[duplicated(vocabulary$id),])[1]可能返回“5”作为重复行数。问题是,如果只知道重复的行数,就不知道它们重复了多少行。“5”是指有五行,每行有一个副本,还是有一行有五个副本?而且由于您不会有重复的 ID 或行号,因此您不会有任何方法来找到“原件”。


* 我知道在这个调查中没有重复的 ID,但这是一个很好的例子,因为使用任何其他地方给出的答案,如 ABC0或 table(vocabulary$id)将输出一个干草堆到你的屏幕上,你将很难找到任何可能罕见的重复针。

360911 次浏览

This will give you duplicate rows:

vocabulary[duplicated(vocabulary$id),]

This will give you the number of duplicates:

dim(vocabulary[duplicated(vocabulary$id),])[1]

Example:

vocabulary2 <-rbind(vocabulary,vocabulary[1,]) #creates a duplicate at the end
vocabulary2[duplicated(vocabulary2$id),]
#            id year    sex education vocabulary
#21639 20040001 2004 Female         9          3
dim(vocabulary2[duplicated(vocabulary2$id),])[1]
#[1] 1 #=1 duplicate

EDIT

OK, with the additional information, here's what you should do: duplicated has a fromLast option which allows you to get duplicates from the end. If you combine this with the normal duplicated, you get all duplicates. The following example adds duplicates to the original vocabulary object (line 1 is duplicated twice and line 5 is duplicated once). I then use table to get the total number of duplicates per ID.

#Create vocabulary object with duplicates
voc.dups <-rbind(vocabulary,vocabulary[1,],vocabulary[1,],vocabulary[5,])


#List duplicates
dups <-voc.dups[duplicated(voc.dups$id)|duplicated(voc.dups$id, fromLast=TRUE),]
dups
#            id year    sex education vocabulary
#1     20040001 2004 Female         9          3
#5     20040008 2004   Male        14          1
#21639 20040001 2004 Female         9          3
#21640 20040001 2004 Female         9          3
#51000 20040008 2004   Male        14          1


#Count duplicates by id
table(dups$id)
#20040001 20040008
#       3        2

Here's a data.table solution that will list the duplicates along with the number of duplications (will be 1 if there are 2 copies, and so on - you can adjust that to suit your needs):

library(data.table)
dt = data.table(vocabulary)


dt[duplicated(id), cbind(.SD[1], number = .N), by = id]

You could use table, i.e.

n_occur <- data.frame(table(vocabulary$id))

gives you a data frame with a list of ids and the number of times they occurred.

n_occur[n_occur$Freq > 1,]

tells you which ids occurred more than once.

vocabulary[vocabulary$id %in% n_occur$Var1[n_occur$Freq > 1],]

returns the records with more than one occurrence.

A terser way, either with rev :

x[!(!duplicated(x) & rev(!duplicated(rev(x))))]

... rather than fromLast:

x[!(!duplicated(x) & !duplicated(x, fromLast = TRUE))]

... and as a helper function to provide either logical vector or elements from original vector :

duplicates <- function(x, as.bool = FALSE) {
is.dup <- !(!duplicated(x) & rev(!duplicated(rev(x))))
if (as.bool) { is.dup } else { x[is.dup] }
}

Treating vectors as data frames to pass to table is handy but can get difficult to read, and the data.table solution is fine but I'd prefer base R solutions for dealing with simple vectors like IDs.

Here, I summarize a few ways which may return different results to your question, so be careful:

# First assign your "id"s to an R object.
# Here's a hypothetical example:
id <- c("a","b","b","c","c","c","d","d","d","d")


#To return ALL MINUS ONE duplicated values:
id[duplicated(id)]
## [1] "b" "c" "c" "d" "d" "d"


#To return ALL duplicated values by specifying fromLast argument:
id[duplicated(id) | duplicated(id, fromLast=TRUE)]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"


#Yet another way to return ALL duplicated values, using %in% operator:
id[ id %in% id[duplicated(id)] ]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"

Hope these help. Good luck.