重复 data.frame 的行

小开

最佳答案

df <- data.frame(a = 1:2, b = letters[1:2])
df[rep(seq_len(nrow(df)), each = 2), ]

小开

尝试使用例子

N=2
rep(1:4, each = N)

作为索引

小开

如果您可以重复整个事情，或子集它第一然后重复，那么这个类似的问题可能是有帮助的。再说一遍:

library(mefa)
rep(mtcars,10)

或者干脆

mefa:::rep.data.frame(mtcars)

小开

Row 函数有时似乎会为列创建列表，这会导致内存混乱。我已经写了下面的内容，看起来效果不错:

library(plyr)
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}

小开

加上@dardisco 提到的 mefa::rep.data.frame()，它非常灵活。

You can either 重复每行 N 次:

rep(df, each=N)

或者 重复整个数据帧 N 次(思考: 就像你回收向量化参数一样)

rep(df, times=N)

为 mefa竖起两个大拇指！我从来没有听说过它，直到现在，我不得不写手动代码来做到这一点。

小开

另一种方法是首先获取行索引，附加额外的 df 副本，然后根据索引进行排序:

df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]

虽然其他解决方案可能更短，这种方法可能更有利于在某些情况下。

小开

为了参考和补充引用 mefa 的答案，如果你不想包含整个包的话，也许值得看看 mefa::rep.data.frame()的实现:

> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f

小开

我的解决方案类似于 mefa:::rep.data.frame，但是更快一些，并且注意行名:

rep.data.frame <- function(x, times) {
rnames <- attr(x, "row.names")
x <- lapply(x, rep.int, times = times)
class(x) <- "data.frame"
if (!is.numeric(rnames))
attr(x, "row.names") <- make.unique(rep.int(rnames, times))
else
attr(x, "row.names") <- .set_row_names(length(rnames) * times)
x
}

比较解决方案:

library(Lahman)
library(microbenchmark)
microbenchmark(
mefa:::rep.data.frame(Batting, 10),
rep.data.frame(Batting, 10),
Batting[rep.int(seq_len(nrow(Batting)), 10), ],
times = 10
)
#> Unit: milliseconds
#>                                            expr       min       lq     mean   median        uq       max neval cld
#>              mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749  278.1066  356.3210    10  a
#>                     rep.data.frame(Batting, 10)  79.70335  82.8165 134.0974  87.2587  191.1713  307.4567    10  a
#>  Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927    10   b

小开

一个干净的 dplyr解决方案，取自 here

library(dplyr)
df <- tibble(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))

小开

There is a lovely vectorized solution that repeats only certain rows n-times each, possible for example by adding an ntimes column to your data frame:

  A B   C ntimes
1 j i 100      2
2 K P 101      4
3 Z Z 102      1

方法:

df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2,4,1))
df <- as.data.frame(lapply(df, rep, df$ntimes))

结果:

  A B   C ntimes
1 Z Z 102      1
2 j i 100      2
3 j i 100      2
4 K P 101      4
5 K P 101      4
6 K P 101      4
7 K P 101      4

这与 Josh O’Brien 和 Mark Miller 的方法非常相似:

df[rep(seq_len(nrow(df)), df$ntimes),]

然而，这种方法似乎要慢得多:

df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2000,3000,4000))


microbenchmark::microbenchmark(
df[rep(seq_len(nrow(df)), df$ntimes),],
as.data.frame(lapply(df, rep, df$ntimes)),
times = 10
)

结果:

Unit: microseconds
expr      min       lq      mean   median       uq      max neval
df[rep(seq_len(nrow(df)), df$ntimes), ] 3563.113 3586.873 3683.7790 3613.702 3657.063 4326.757    10
as.data.frame(lapply(df, rep, df$ntimes))  625.552  654.638  676.4067  668.094  681.929  799.893    10