When should I use the := operator in data.table?

data.table objects now have a := operator. What makes this operator different from all other assignment operators? Also, what are its uses, how much faster is it, and when should it be avoided?

17696 次浏览

Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a data.frame but doesn't copy the entire table each time.

m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)


system.time(for (i in 1:1000) DF[i,1] <- i)
user  system elapsed
287.062 302.627 591.984


system.time(for (i in 1:1000) DT[i,V1:=i])
user  system elapsed
1.148   0.000   1.158     ( 511 times faster )

Putting the := in j like that allows more idioms :

DT["a",done:=TRUE]   # binary search for group 'a' and set a flag
DT[,newcol:=42]      # add a new column by reference (no copy of existing data)
DT[,col:=NULL]       # remove a column by reference

and :

DT[,newcol:=sum(v),by=group]  # like a fast transform() by group

I can't think of any reasons to avoid := ! Other than, inside a for loop. Since := appears inside DT[...], it comes with the small overhead of the [.data.table method; e.g., S3 dispatch and checking for the presence and type of arguments such as i, by, nomatch etc. So for inside for loops, there is a low overhead, direct version of := called for0. See for1 for more details and examples. The disadvantages of for0 include that i must be row numbers (no binary search) and you can't combine it with by. By making those restrictions for0 can reduce the overhead dramatically.

system.time(for (i in 1:1000) set(DT,i,"V1",i))
user  system elapsed
0.016   0.000   0.018