重复数据行,帧 N 次

我有以下数据框:

data.frame(a = c(1,2,3),b = c(1,2,3))
a b
1 1 1
2 2 2
3 3 3

我想重复这些行 N次。例如,这里这些行重复3次:

  a b
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
7 1 1
8 2 2
9 3 3

在 R 中有一个简单的函数可以做到这一点吗? 谢谢!

96642 次浏览

EDIT: updated to a better modern R answer.

You can use replicate(), then rbind the result back together. The rownames are automatically altered to run from 1:nrows.

d <- data.frame(a = c(1,2,3),b = c(1,2,3))
n <- 3
do.call("rbind", replicate(n, d, simplify = FALSE))

A more traditional way is to use indexing, but here the rowname altering is not quite so neat (but more informative):

 d[rep(seq_len(nrow(d)), n), ]

Here are improvements on the above, the first two using purrr functional programming, idiomatic purrr:

purrr::map_dfr(seq_len(3), ~d)

and less idiomatic purrr (identical result, though more awkward):

purrr::map_dfr(seq_len(3), function(x) d)

and finally via indexing rather than list apply using dplyr:

d %>% slice(rep(row_number(), 3))
d <- data.frame(a = c(1,2,3),b = c(1,2,3))
r <- Reduce(rbind, list(d)[rep(1L, times=3L)])

For data.frame objects, this solution is several times faster than @mdsummer's and @wojciech-sobala's.

d[rep(seq_len(nrow(d)), n), ]

For data.table objects, @mdsummer's is a bit faster than applying the above after converting to data.frame. For large n this might flip. microbenchmark.

Full code:

packages <- c("data.table", "ggplot2", "RUnit", "microbenchmark")
lapply(packages, require, character.only=T)


Repeat1 <- function(d, n) {
return(do.call("rbind", replicate(n, d, simplify = FALSE)))
}


Repeat2 <- function(d, n) {
return(Reduce(rbind, list(d)[rep(1L, times=n)]))
}


Repeat3 <- function(d, n) {
if ("data.table" %in% class(d)) return(d[rep(seq_len(nrow(d)), n)])
return(d[rep(seq_len(nrow(d)), n), ])
}


Repeat3.dt.convert <- function(d, n) {
if ("data.table" %in% class(d)) d <- as.data.frame(d)
return(d[rep(seq_len(nrow(d)), n), ])
}


# Try with data.frames
mtcars1 <- Repeat1(mtcars, 3)
mtcars2 <- Repeat2(mtcars, 3)
mtcars3 <- Repeat3(mtcars, 3)


checkEquals(mtcars1, mtcars2)
#  Only difference is row.names having ".k" suffix instead of "k" from 1 & 2
checkEquals(mtcars1, mtcars3)


# Works with data.tables too
mtcars.dt <- data.table(mtcars)
mtcars.dt1 <- Repeat1(mtcars.dt, 3)
mtcars.dt2 <- Repeat2(mtcars.dt, 3)
mtcars.dt3 <- Repeat3(mtcars.dt, 3)


# No row.names mismatch since data.tables don't have row.names
checkEquals(mtcars.dt1, mtcars.dt2)
checkEquals(mtcars.dt1, mtcars.dt3)


# Time test
res <- microbenchmark(Repeat1(mtcars, 10),
Repeat2(mtcars, 10),
Repeat3(mtcars, 10),
Repeat1(mtcars.dt, 10),
Repeat2(mtcars.dt, 10),
Repeat3(mtcars.dt, 10),
Repeat3.dt.convert(mtcars.dt, 10))
print(res)
ggsave("repeat_microbenchmark.png", autoplot(res))

Just use simple indexing with repeat function.

mydata<-data.frame(a = c(1,2,3),b = c(1,2,3)) #creating your data frame
n<-10           #defining no. of time you want repetition of the rows of your dataframe


mydata<-mydata[rep(rownames(mydata),n),] #use rep function while doing indexing
rownames(mydata)<-1:NROW(mydata)    #rename rows just to get cleaner look of data

The package dplyr contains the function bind_rows() that directly combines all data frames in a list, such that there is no need to use do.call() together with rbind():

df <- data.frame(a = c(1, 2, 3), b = c(1, 2, 3))
library(dplyr)
bind_rows(replicate(3, df, simplify = FALSE))

For a large number of repetions bind_rows() is also much faster than rbind():

library(microbenchmark)
microbenchmark(rbind = do.call("rbind", replicate(1000, df, simplify = FALSE)),
bind_rows = bind_rows(replicate(1000, df, simplify = FALSE)),
times = 20)
## Unit: milliseconds
##       expr       min        lq      mean   median        uq       max neval cld
##      rbind 31.796100 33.017077 35.436753 34.32861 36.773017 43.556112    20   b
##  bind_rows  1.765956  1.818087  1.881697  1.86207  1.898839  2.321621    20  a

Even simpler:

library(data.table)
my_data <- data.frame(a = c(1,2,3),b = c(1,2,3))
rbindlist(replicate(n = 3, expr = my_data, simplify = FALSE)

With the -package, you could use the special symbol .I together with rep:

df <- data.frame(a = c(1,2,3), b = c(1,2,3))
dt <- as.data.table(df)


n <- 3


dt[rep(dt[, .I], n)]

which gives:

   a b
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3
7: 1 1
8: 2 2
9: 3 3

For time execution purposes, i would like to suggest a comparison of different way of rbind:

> mydata <- data.frame(a=1:200,b=201:400,c=301:500)
> microbenchmark(rbind = do.call("rbind",replicate(n=100,mydata,simplify = FALSE)),
+                bind_rows = bind_rows(replicate(n=100,mydata,simplify = FALSE)),
+                rbindlist = rbindlist(replicate(n=100,exp= mydata,simplify = FALSE)),
+                times= 2000)
Unit: microseconds
expr    min      lq      mean  median      uq      max neval
rbind 5760.7 6723.10 8642.6930 7132.30 7761.05 240720.3  2000
bind_rows  976.4 1186.90 1430.7741 1308.85 1469.80  15817.9  2000
rbindlist  263.6  347.85  465.5894  392.90  459.95  10974.2  2000

You can use tidyr::uncount:

data.frame(a = c(1,2,3),b = c(1,2,3)) %>%
tidyr::uncount(3)
  a b
1 1 1
2 1 1
3 1 1
4 2 2
5 2 2
6 2 2
7 3 3
8 3 3
9 3 3