收集多组列

我有一个在线调查的数据,调查对象回答了1-3个问题。调查软件(Qualtrics)将这些数据记录在多个栏目中,也就是说,调查中的 Q3.2栏目将包括 Q3.2.1.Q3.2.2.Q3.2.3.栏目:

df <- data.frame(
id = 1:10,
time = as.Date('2009-01-01') + 0:9,
Q3.2.1. = rnorm(10, 0, 1),
Q3.2.2. = rnorm(10, 0, 1),
Q3.2.3. = rnorm(10, 0, 1),
Q3.3.1. = rnorm(10, 0, 1),
Q3.3.2. = rnorm(10, 0, 1),
Q3.3.3. = rnorm(10, 0, 1)
)


# Sample data


id       time    Q3.2.1.     Q3.2.2.    Q3.2.3.     Q3.3.1.    Q3.3.2.     Q3.3.3.
1   1 2009-01-01 -0.2059165 -0.29177677 -0.7107192  1.52718069 -0.4484351 -1.21550600
2   2 2009-01-02 -0.1981136 -1.19813815  1.1750200 -0.40380049 -1.8376094  1.03588482
3   3 2009-01-03  0.3514795 -0.27425539  1.1171712 -1.02641801 -2.0646661 -0.35353058
...

我想将所有的 QN.N * 列合并成整洁的单独的 QN.N 列,最终得到如下结果:

   id       time loop_number        Q3.2        Q3.3
1   1 2009-01-01           1 -0.20591649  1.52718069
2   2 2009-01-02           1 -0.19811357 -0.40380049
3   3 2009-01-03           1  0.35147949 -1.02641801
...
11  1 2009-01-01           2 -0.29177677  -0.4484351
12  2 2009-01-02           2 -1.19813815  -1.8376094
13  3 2009-01-03           2 -0.27425539  -2.0646661
...
21  1 2009-01-01           3 -0.71071921 -1.21550600
22  2 2009-01-02           3  1.17501999  1.03588482
23  3 2009-01-03           3  1.11717121 -0.35353058
...

tidyr库具有 gather()函数,这个函数可以很好地组合 列集:

library(dplyr)
library(tidyr)
library(stringr)


df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>%
mutate(loop_number = str_sub(loop_number,-2,-2)) %>%
select(id, time, loop_number, Q3.2)




id       time loop_number        Q3.2
1   1 2009-01-01           1 -0.20591649
2   2 2009-01-02           1 -0.19811357
3   3 2009-01-03           1  0.35147949
...
29  9 2009-01-09           3 -0.58581232
30 10 2009-01-10           3 -2.33393981

正如预期的那样,最终的数据框架有30行(10个个体,每个3个循环)。然而,收集第二组列并不能正常工作ーー它成功地将两组合列 Q3.2Q3.3组合起来,但最终得到的是90行,而不是30行(所有10个个体的组合、3个 Q3.2循环和3个 Q3.3循环; 实际数据中每组列的组合将大幅增加) :

df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>%
gather(loop_number, Q3.3, starts_with("Q3.3")) %>%
mutate(loop_number = str_sub(loop_number,-2,-2))




id       time loop_number        Q3.2        Q3.3
1   1 2009-01-01           1 -0.20591649  1.52718069
2   2 2009-01-02           1 -0.19811357 -0.40380049
3   3 2009-01-03           1  0.35147949 -1.02641801
...
89  9 2009-01-09           3 -0.58581232 -0.13187024
90 10 2009-01-10           3 -2.33393981 -0.48502131

有没有一种方法可以像这样使用对 gather()的多个调用,在保持正确的行数的同时组合这样的小列子集?

119736 次浏览

这可以通过使用 reshape来完成,但是使用 dplyr也是可能的。

  colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))
colnames(df)[2] <- "Date"
res <- reshape(df, idvar=c("id", "Date"), varying=3:8, direction="long", sep="_")
row.names(res) <- 1:nrow(res)
  

head(res)
#  id       Date time       Q3.2       Q3.3
#1  1 2009-01-01    1  1.3709584  0.4554501
#2  2 2009-01-02    1 -0.5646982  0.7048373
#3  3 2009-01-03    1  0.3631284  1.0351035
#4  4 2009-01-04    1  0.6328626 -0.6089264
#5  5 2009-01-05    1  0.4042683  0.5049551
#6  6 2009-01-06    1 -0.1061245 -1.7170087

或者使用 dplyr

  library(tidyr)
library(dplyr)
colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))


df %>%
gather(loop_number, "Q3", starts_with("Q3")) %>%
separate(loop_number,c("L1", "L2"), sep="_") %>%
spread(L1, Q3) %>%
select(-L2) %>%
head()
#  id       time       Q3.2       Q3.3
#1  1 2009-01-01  1.3709584  0.4554501
#2  1 2009-01-01  1.3048697  0.2059986
#3  1 2009-01-01 -0.3066386  0.3219253
#4  2 2009-01-02 -0.5646982  0.7048373
#5  2 2009-01-02  2.2866454 -0.3610573
#6  2 2009-01-02 -1.7813084 -0.7838389

更新

有了新版本的 tidyr,我们可以使用 pivot_longer来重塑多个列的形状。(使用从上面的 gsub改变的列名)

library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with("Q3"),
names_to = c(".value", "Q3"), names_sep = "_") %>%
select(-Q3)
# A tibble: 30 x 4
#      id time         Q3.2    Q3.3
#   <int> <date>      <dbl>   <dbl>
# 1     1 2009-01-01  0.974  1.47
# 2     1 2009-01-01 -0.849 -0.513
# 3     1 2009-01-01  0.894  0.0442
# 4     2 2009-01-02  2.04  -0.553
# 5     2 2009-01-02  0.694  0.0972
# 6     2 2009-01-02 -1.11   1.85
# 7     3 2009-01-03  0.413  0.733
# 8     3 2009-01-03 -0.896 -0.271
#9     3 2009-01-03  0.509 -0.0512
#10     4 2009-01-04  1.81   0.668
# … with 20 more rows

注意: 值不同,因为在创建输入数据集时没有设置种子

它与“ tidyr”和“ dplyr”完全没有关系,但是这里有另一个选项可以考虑: 我的“分裂堆栈形状”包裹、 V1.4.0及以上版本的 merged.stack

library(splitstackshape)
merged.stack(df, id.vars = c("id", "time"),
var.stubs = c("Q3.2.", "Q3.3."),
sep = "var.stubs")
#     id       time .time_1       Q3.2.       Q3.3.
#  1:  1 2009-01-01      1. -0.62645381  1.35867955
#  2:  1 2009-01-01      2.  1.51178117 -0.16452360
#  3:  1 2009-01-01      3.  0.91897737  0.39810588
#  4:  2 2009-01-02      1.  0.18364332 -0.10278773
#  5:  2 2009-01-02      2.  0.38984324 -0.25336168
#  6:  2 2009-01-02      3.  0.78213630 -0.61202639
#  7:  3 2009-01-03      1. -0.83562861  0.38767161
# <<:::SNIP:::>>
# 24:  8 2009-01-08      3. -1.47075238 -1.04413463
# 25:  9 2009-01-09      1.  0.57578135  1.10002537
# 26:  9 2009-01-09      2.  0.82122120 -0.11234621
# 27:  9 2009-01-09      3. -0.47815006  0.56971963
# 28: 10 2009-01-10      1. -0.30538839  0.76317575
# 29: 10 2009-01-10      2.  0.59390132  0.88110773
# 30: 10 2009-01-10      3.  0.41794156 -0.13505460
#     id       time .time_1       Q3.2.       Q3.3.

在我看来,这种做法似乎很自然:

df %>%
gather(key, value, -id, -time) %>%
extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%
spread(question, value)

首先收集所有问题列,使用 extract()将其分为 questionloop_number,然后将 spread()问题重新分为列。

#>    id       time loop_number         Q3.2        Q3.3
#> 1   1 2009-01-01           1  0.142259203 -0.35842736
#> 2   1 2009-01-01           2  0.061034802  0.79354061
#> 3   1 2009-01-01           3 -0.525686204 -0.67456611
#> 4   2 2009-01-02           1 -1.044461185 -1.19662936
#> 5   2 2009-01-02           2  0.393808163  0.42384717

通过最近对 melt.data.table的更新,我们现在可以融化多个列:

require(data.table) ## 1.9.5
melt(setDT(df), id=1:2, measure=patterns("^Q3.2", "^Q3.3"),
value.name=c("Q3.2", "Q3.3"), variable.name="loop_number")
#    id       time loop_number         Q3.2        Q3.3
# 1:  1 2009-01-01           1 -0.433978480  0.41227209
# 2:  2 2009-01-02           1 -0.567995351  0.30701144
# 3:  3 2009-01-03           1 -0.092041353 -0.96024077
# 4:  4 2009-01-04           1  1.137433487  0.60603396
# 5:  5 2009-01-05           1 -1.071498263 -0.01655584
# 6:  6 2009-01-06           1 -0.048376809  0.55889996
# 7:  7 2009-01-07           1 -0.007312176  0.69872938

您可以从 here获得开发版本。

如果你和我一样,不知道如何使用“正则表达式与捕获组”为 extract,以下代码复制 Hadleys 的答案中的 extract(...)行:

df %>%
gather(question_number, value, starts_with("Q3.")) %>%
mutate(loop_number = str_sub(question_number,-2,-2), question_number = str_sub(question_number,1,4)) %>%
select(id, time, loop_number, question_number, value) %>%
spread(key = question_number, value = value)

这里的问题是,最初的聚集形成了一个键列,实际上是两个键的组合。在我的原始解决方案中,我选择在注释中使用 mutate,将这一列分成两列,其中包含等效的信息,一个 loop_number列和一个 question_number列。然后可以使用 spread将长格式数据(键值对 (question_number, value))转换为宽格式数据。