过滤掉 data.table 中的重复/非唯一行

小开

unique(df) works on your example.

小开

最佳答案

For v1.9.8+ (released November 2016)

From ?unique.data.table By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  E  F
6:  G  G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G

小开

With your example data.table...

> dt<-data.table(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", "E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", "G"))
> setkey(dt,V2)

Consider the following tests:

> haskey(dt) # obviously dt has a key, since we just set it
[1] TRUE


> haskey(dt[,list(V1,V2)]) # ... but this is treated like a "new" table, and does not have a key
[1] FALSE


> haskey(dt[,.SD]) # note that this still has a key
[1] TRUE

So, you can list the columns of the table and then take the unique() of that, with no need for setting the key to all columns or dropping it (by setting it to NULL) as required by the solution from @Andrie (and edited by @MatthewDowle). The solutions suggested by @Pop and @Rahul didn't work for me.

See Try 3 below, which is very similar to your initial try. Your example was not clear so I'm not sure why it didn't work. Also it was a few months ago when you posted the question, so maybe data.table was updated?

> unique(dt) # Try 1: wrong answer (missing V1=C and V2=D)
V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G


> dt[!duplicated(dt)] # Try 2: wrong answer (missing V1=C and V2=D)
V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G


> unique(dt[,list(V1,V2)]) # Try 3: correct answer; does not require modifying key
V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G


> setkey(dt,NULL)
> unique(dt) # Try 4: correct answer; requires key to be removed
V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G

小开

This should work for you

dt <- unique(dt, by = c('V1', 'V2'))

小开

Keeping data.table notation you can use:

unique(df[, .(V1, V2, V3), nomatch=0 ])

As here https://stackoverflow.com/a/31875208/10087503

I haven't compared speed of this vs Magma's version.