在只有选定列的数据框上唯一

我有一个大于100列的数据框架，我只比较其中的两列就可以找到唯一的行。我希望这是一个容易的，但我不能让它与 unique或 duplicated自己的工作。

在下面，我想唯一只使用 id和 id2:

data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))


id id2 somevalue
1   1         x
1   1         y
3   4         z

我希望获得以下两项中的任何一项:

id id2 somevalue
1   1         x
3   4         z

或:

id id2 somevalue
1   1         y
3   4         z

(我没有选择保留哪些唯一的行)

r
unique

150422 次浏览

小开

最佳答案

Ok, if it doesn't matter which value in the non-duplicated column you select, this should be pretty easy:

dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
> dat[!duplicated(dat[,c('id','id2')]),]
id id2 somevalue
1  1   1         x
3  3   4         z

Inside the duplicated call, I'm simply passing only those columns from dat that I don't want duplicates of. This code will automatically always select the first of any ambiguous values. (In this case, x.)

小开

Using unique():

dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])),]

小开

Minor update in @Joran's code.
Using the code below, you can avoid the ambiguity and only get the unique of two columns:

dat <- data.frame(id=c(1,1,3), id2=c(1,1,4) ,somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])), c("id", "id2")]

小开

Here are a couple dplyr options that keep non-duplicate rows based on columns id and id2:

library(dplyr)
df %>% distinct(id, id2, .keep_all = TRUE)
df %>% group_by(id, id2) %>% filter(row_number() == 1)
df %>% group_by(id, id2) %>% slice(1)