过滤掉 data.table 中的重复/非唯一行

编辑2019: < em > 这个问题是在2016年11月 data.table改变之前提出的,请参阅下面关于当前和以前方法的公认答案。

我有一个大约有250万行的 data.table表。有两列。我想删除两列中重复的所有行。前情提要,我会这么做: df -> unique(df[,c('V1', 'V2')]),但是这个不适用于 data.table。我试过 unique(df[,c(V1,V2), with=FALSE]),但它似乎仍然只能操作 data.table 的键,而不能操作整行。

有什么建议吗?

干杯, 戴维

例子

>dt
V1   V2
[1,]  A    B
[2,]  A    C
[3,]  A    D
[4,]  A    B
[5,]  B    A
[6,]  C    D
[7,]  C    D
[8,]  E    F
[9,]  G    G
[10,] A    B

在上面的 data.table 中,其中 V2是表键,只有第4、7和10行将被删除。

> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C",
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F",
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")
106725 次浏览

unique(df) works on your example.

For v1.9.8+ (released November 2016)

From ?unique.data.table By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  E  F
6:  G  G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G

With your example data.table...

> dt<-data.table(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", "E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", "G"))
> setkey(dt,V2)

Consider the following tests:

> haskey(dt) # obviously dt has a key, since we just set it
[1] TRUE


> haskey(dt[,list(V1,V2)]) # ... but this is treated like a "new" table, and does not have a key
[1] FALSE


> haskey(dt[,.SD]) # note that this still has a key
[1] TRUE

So, you can list the columns of the table and then take the unique() of that, with no need for setting the key to all columns or dropping it (by setting it to NULL) as required by the solution from @Andrie (and edited by @MatthewDowle). The solutions suggested by @Pop and @Rahul didn't work for me.

See Try 3 below, which is very similar to your initial try. Your example was not clear so I'm not sure why it didn't work. Also it was a few months ago when you posted the question, so maybe data.table was updated?

> unique(dt) # Try 1: wrong answer (missing V1=C and V2=D)
V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G


> dt[!duplicated(dt)] # Try 2: wrong answer (missing V1=C and V2=D)
V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G


> unique(dt[,list(V1,V2)]) # Try 3: correct answer; does not require modifying key
V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G


> setkey(dt,NULL)
> unique(dt) # Try 4: correct answer; requires key to be removed
V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G

This should work for you

dt <- unique(dt, by = c('V1', 'V2'))

Keeping data.table notation you can use:

unique(df[, .(V1, V2, V3), nomatch=0 ])

As here https://stackoverflow.com/a/31875208/10087503

I haven't compared speed of this vs Magma's version.