根据列中的公共值将大型数据框拆分为一个数据框列表

我有一个包含10列的数据框架,收集“用户”的操作,其中一列包含一个 ID (非唯一的,标识用户)(列10)。数据帧的长度约为750000行。我试图提取单独的数据帧(因此获取数据帧的列表或向量) ,按包含“用户”标识符的列进行分割,以隔离单个参与者的操作。

ID | Data1 | Data2 | ... | UserID
1  | aaa   | bbb   | ... | u_001
2  | aab   | bb2   | ... | u_001
3  | aac   | bb3   | ... | u_001
4  | aad   | bb4   | ... | u_002

导致

list(
ID | Data1 | Data2 | ... | UserID
1  | aaa   | bbb   | ... | u_001
2  | aab   | bb2   | ... | u_001
3  | aac   | bb3   | ... | u_001
,
4  | aad   | bb4   | ... | u_002
...)

对于一个小样本(1000行) ,下面的方法非常有效:

paths = by(smallsampleMat, smallsampleMat[,"userID"], function(x) x)

然后通过路径[1]访问我想要的元素。

当应用到原始的大型数据帧甚至矩阵表示时,这会让我的机器(4GB 内存,MacOSX 10.6,R 2.15)窒息,而且永远不会完成(我知道有更新的 R 版本存在,但我相信这不是主要问题)。

似乎拆分更有效,并且在很长时间后才能完成,但是我不知道(劣 R 知识)如何将得到的向量列表拆分成矩阵的向量。

path = split(smallsampleMat, smallsampleMat[,10])

我也考虑过使用 big.matrix等,但没有太多的成功,这将加快过程。

137602 次浏览

You can just as easily access each element in the list using e.g. path[[1]]. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply functions to further operate on each element in the list. Example below.

#  For reproducibile data
set.seed(1)


#  Make some data
userid <- rep(1:2,times=4)
data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
data2 <- sample(10,8)
df <- data.frame( userid , data1 , data2 )


#  Split on userid
out <- split( df , f = df$userid )
#$`1`
#  userid data1 data2
#1      1   gjn     3
#3      1   yqp     1
#5      1   rjs     6
#7      1   jtw     5


#$`2`
#  userid data1 data2
#2      2   xfv     4
#4      2   bfe    10
#6      2   mrx     2
#8      2   fqd     9

Access each element using the [[ operator like this:

out[[1]]
#  userid data1 data2
#1      1   gjn     3
#3      1   yqp     1
#5      1   rjs     6
#7      1   jtw     5

Or use an *apply function to do further operations on each list element. For instance, to take the mean of the data2 column you could use sapply like this:

sapply( out , function(x) mean( x$data2 ) )
#   1    2
#3.75 6.25

Stumbled across this answer and I actually wanted BOTH groups (data containing that one user and data containing everything but that one user). Not necessary for the specifics of this post, but I thought I would add in case someone was googling the same issue as me.

df <- data.frame(
ran_data1=rnorm(125),
ran_data2=rnorm(125),
g=rep(factor(LETTERS[1:5]), 25)
)


test_x = split(df,df$g)[['A']]
test_y = split(df,df$g!='A')[['TRUE']]

Here's what it looks like:

head(test_x)
x          y g
1   1.1362198  1.2969541 A
6   0.5510307 -0.2512449 A
11  0.0321679  0.2358821 A
16  0.4734277 -1.2889081 A
21 -1.2686151  0.2524744 A


> head(test_y)
x          y g
2 -2.23477293  1.1514810 B
3 -0.46958938 -1.7434205 C
4  0.07365603  0.1111419 D
5 -1.08758355  0.4727281 E
7  0.28448637 -1.5124336 B
8  1.24117504  0.4928257 C

From version 0.8.0, dplyr offers a handy function called group_split():

# On sample data from @Aus_10


df %>%
group_split(g)


[[1]]
# A tibble: 25 x 3
ran_data1 ran_data2 g
<dbl>     <dbl> <fct>
1     2.04      0.627 A
2     0.530    -0.703 A
3    -0.475     0.541 A
4     1.20     -0.565 A
5    -0.380    -0.126 A
6     1.25     -1.69  A
7    -0.153    -1.02  A
8     1.52     -0.520 A
9     0.905    -0.976 A
10     0.517    -0.535 A
# … with 15 more rows


[[2]]
# A tibble: 25 x 3
ran_data1 ran_data2 g
<dbl>     <dbl> <fct>
1     1.61      0.858 B
2     1.05     -1.25  B
3    -0.440    -0.506 B
4    -1.17      1.81  B
5     1.47     -1.60  B
6    -0.682    -0.726 B
7    -2.21      0.282 B
8    -0.499     0.591 B
9     0.711    -1.21  B
10     0.705     0.960 B
# … with 15 more rows

To not include the grouping column:

df %>%
group_split(g, keep = FALSE)