Dplyr 变异/替换行子集上的几个列

我正在尝试一个基于 dplyr 的工作流(而不是使用大部分的 data.table,我已经习惯了) ,我遇到了一个问题,我无法找到一个等效的 dplyr 解决方案。我通常会遇到这样的情况: 我需要根据单个条件有条件地更新/替换多个列。下面是我的 data.table 解决方案的一些示例代码:

library(data.table)


# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))


# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit',
`:=`(qty.exit = qty,
cf = 0,
delta.watts = 13)]

这个问题是否有简单的 dplyr 解决方案?我希望避免使用 ifelse,因为我不想多次键入条件-这是一个简化的例子,但有时会有许多基于单个条件的赋值。

提前谢谢你的帮助!

66642 次浏览

You can do this with magrittr's two-way pipe %<>%:

library(dplyr)
library(magrittr)


dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
cf = 0,
delta.watts = 13)

This reduces the amount of typing, but is still much slower than data.table.

As eipi10 shows above, there's not a simple way to do a subset replacement in dplyr because DT uses pass-by-reference semantics vs dplyr using pass-by-value. dplyr requires the use of ifelse() on the whole vector, whereas DT will do the subset and update by reference (returning the whole DT). So, for this exercise, DT will be substantially faster.

You could alternatively subset first, then update, and finally recombine:

dt.sub <- dt[dt$measure == "exit",] %>%
mutate(qty.exit= qty, cf= 0, delta.watts= 13)


dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])

But DT is gonna be substantially faster: (editted to use eipi10's new answer)

library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt= {dt <- dt[measure == 'exit',
`:=`(qty.exit = qty,
cf = 0,
delta.watts = 13)]},
eipi10= {dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
cf = 0,
delta.watts = 13)},
alex= {dt.sub <- dt[dt$measure == "exit",] %>%
mutate(qty.exit= qty, cf= 0, delta.watts= 13)


dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])})




Unit: microseconds
expr      min        lq      mean   median       uq      max neval cld
dt  591.480  672.2565  747.0771  743.341  780.973 1837.539   100  a
eipi10 3481.212 3677.1685 4008.0314 3796.909 3936.796 6857.509   100   b
alex 3412.029 3637.6350 3867.0649 3726.204 3936.985 5424.427   100   b

These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:

1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate but only acts on the rows satisfying the condition:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}


DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.

mutate_last <- function(.data, ...) {
n <- n_groups(.data)
indices <- attr(.data, "indices")[[n]] + 1
.data[indices, ] <- .data[indices, ] %>% mutate(...)
.data
}




DF %>%
group_by(is.exit = measure == 'exit') %>%
mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
ungroup() %>%
select(-is.exit)

2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse, replace or arithmetic with logicals as illustrated. This also works for data tables.

library(dplyr)


DF %>% mutate(is.exit = measure == 'exit',
qty.exit = ifelse(is.exit, qty, qty.exit),
cf = (!is.exit) * cf,
delta.watts = replace(delta.watts, is.exit, 13)) %>%
select(-is.exit)

3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.

library(sqldf)


DF %>%
do(sqldf(c("update '.'
set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13
where measure = 'exit'",
"select * from '.'")))

4) row_case_when Also check out row_case_when defined in Returning a tibble: how to vectorize with case_when? . It uses a syntax similar to case_when but applies to rows.

library(dplyr)


DF %>%
row_case_when(
measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
TRUE ~ data.frame(qty.exit, cf, delta.watts)
)

Note 1: We used this as DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))

Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.

Here's a solution I like:

mutate_when <- function(data, ...) {
dots <- eval(substitute(alist(...)))
for (i in seq(1, length(dots), by = 2)) {
condition <- eval(dots[[i]], envir = data)
mutations <- eval(dots[[i + 1]], envir = data[condition, , drop = FALSE])
data[condition, names(mutations)] <- mutations
}
data
}

It lets you write things like e.g.

mtcars %>% mutate_when(
mpg > 22,    list(cyl = 100),
disp == 160, list(cyl = 200)
)

which is quite readable -- although it may not be as performant as it could be.

I just stumbled across this and really like mutate_cond() by @G. Grothendieck, but thought it might come in handy to also handle new variables. So, below has two additions:

Unrelated: Second last line made a bit more dplyr by using filter()

Three new lines at the beginning get variable names for use in mutate(), and initializes any new variables in the data frame before mutate() occurs. New variables are initialized for the remainder of the data.frame using new_init, which is set to missing (NA) as a default.

mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
# Initialize any new variables as new_init
new_vars <- substitute(list(...))[-1]
new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
.data[, new_vars] <- new_init


condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
.data
}

Here are some examples using the iris data:

Change Petal.Length to 88 where Species == "setosa". This will work in the original function as well as this new version.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88)

Same as above, but also create a new variable x (NA in rows not included in the condition). Not possible before.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)

Same as above, but rows not included in the condition for x are set to FALSE.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)

This example shows how new_init can be set to a list to initialize multiple new variables with different values. Here, two new variables are created with excluded rows being initialized using different values (x initialised as FALSE, y as NA)

iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
x = TRUE, y = Sepal.Length ^ 2,
new_init = list(FALSE, NA))

At the expense of breaking with the usual dplyr syntax, you can use within from base:

dt %>% within(qty.exit[measure == 'exit'] <- qty[measure == 'exit'],
delta.watts[measure == 'exit'] <- 13)

It seems to integrate well with the pipe, and you can do pretty much anything you want inside it.

mutate_cond is a great function, but it gives an error if there is an NA in the column(s) used to create the condition. I feel that a conditional mutate should simply leave such rows alone. This matches the behavior of filter(), which returns rows when the condition is TRUE, but omits both rows with FALSE and NA.

With this small change the function works like a charm:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
condition[is.na(condition)] = FALSE
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}

With the creation of rlang, a slightly modified version of Grothendieck's 1a example is possible, eliminating the need for the envir argument, as enquo() captures the environment that .p is created in automatically.

mutate_rows <- function(.data, .p, ...) {
.p <- rlang::enquo(.p)
.p_lgl <- rlang::eval_tidy(.p, .data)
.data[.p_lgl, ] <- .data[.p_lgl, ] %>% mutate(...)
.data
}


dt %>% mutate_rows(measure == "exit", qty.exit = qty, cf = 0, delta.watts = 13)

I don't actually see any changes to dplyr that would make this much easier. case_when is great for when there are multiple different conditions and outcomes for one column but it doesn't help for this case where you want to change multiple columns based on one condition. Similarly, recode saves typing if you are replacing multiple different values in one column but doesn't help with doing so in multiple columns at once. Finally, mutate_at etc. only apply conditions to the column names not the rows in the dataframe. You could potentially write a function for mutate_at that would do it but I can't figure out how you would make it behave differently for different columns.

That said here is how I would approach it using nest form tidyr and map from purrr.

library(data.table)
library(dplyr)
library(tidyr)
library(purrr)


# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))


dt2 <- dt %>%
nest(-measure) %>%
mutate(data = if_else(
measure == "exit",
map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
data
)) %>%
unnest()

You could split the dataset and do a regular mutate call on the TRUE part.

the split can be done with either dplyr::group_split() or base::split(), I like the base version better here since it preserves names, see the discussion at https://github.com/tidyverse/dplyr/issues/4223 .

df1 <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50),
stringsAsFactors = F)
library(tidyverse)
df1 %>%
group_split(measure == "exit", .keep = FALSE) %>%
modify_at(2, ~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
bind_rows()
#> # A tibble: 50 × 7
#>     site space measure   qty qty.exit delta.watts    cf
#>    <int> <int> <chr>   <dbl>    <dbl>       <dbl> <dbl>
#>  1     5     1 linear     22        0       100.  0.126
#>  2     3     3 led        12        0        61.5 0.161
#>  3     6     1 led        26        0        25.5 0.307
#>  4     5     2 cfl        16        0        26.5 0.865
#>  5     6     3 linear     19        0        57.5 0.684
#>  6     1     4 led        12        0        14.5 0.802
#>  7     6     4 led         5        0        90.5 0.547
#>  8     5     4 linear     28        0        54.5 0.171
#>  9     1     2 linear      5        0        24.5 0.775
#> 10     1     2 cfl        24        0        96.5 0.144
#> # … with 40 more rows
df1 %>%
split(~measure == "exit") %>%
modify_at("TRUE", ~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
bind_rows()
#>    site space measure qty qty.exit delta.watts          cf
#> 1     5     1  linear  22        0       100.5 0.125646491
#> 2     3     3     led  12        0        61.5 0.160692291
#> 3     6     1     led  26        0        25.5 0.307239765
#> 4     5     2     cfl  16        0        26.5 0.864969074
#> 5     6     3  linear  19        0        57.5 0.683945200
#> 6     1     4     led  12        0        14.5 0.802398642
#> 7     6     4     led   5        0        90.5 0.547211378
#> 8     5     4  linear  28        0        54.5 0.170614207
#> 9     1     2  linear   5        0        24.5 0.774603932
#> 10    1     2     cfl  24        0        96.5 0.144310557
#> 11    3     4  linear  21        0        93.5 0.682622390
#> 12    4     4     led   2        0        48.5 0.941718646
#> 13    4     4     cfl   2        0       100.5 0.918448627
#> 14    5     2     led  11        0        63.5 0.998143780
#> 15    4     1     led  21        0        53.5 0.644740176
#> 16    1     3     cfl   5        0        28.5 0.110610285
#> 17    1     3  linear  24        0        41.5 0.538868200
#> 18    4     3     led  29        0        19.5 0.998474289
#> 19    2     3     cfl   4        0        22.5 0.008167536
#> 20    5     1     led  20        0        56.5 0.740833476
#> 21    3     2     led   5        0        44.5 0.223967706
#> 22    1     4     led  27        0        32.5 0.199850583
#> 23    3     4     cfl  17        0        61.5 0.104023080
#> 24    1     3     cfl  11        0        34.5 0.399036247
#> 25    2     3  linear  29        0        65.5 0.600678235
#> 26    2     4     cfl  23        0        29.5 0.291611352
#> 27    6     2  linear  13        0        37.5 0.225021614
#> 28    2     3     led  17        0        62.5 0.879606956
#> 29    2     4     led  29        0        51.5 0.301759669
#> 30    5     1     led  11        0        54.5 0.793816856
#> 31    2     3     led  20        0        29.5 0.514759195
#> 32    3     4  linear   6        0        68.5 0.475085443
#> 33    1     4     led  21        0        34.5 0.133207588
#> 34    2     4  linear  25        0        80.5 0.164279355
#> 35    5     3     led   7        0        73.5 0.252937836
#> 36    6     2     led  15        0        99.5 0.554864929
#> 37    3     2  linear   6        0        44.5 0.377257874
#> 38    4     4    exit  15       15        13.0 0.000000000
#> 39    3     3    exit  10       10        13.0 0.000000000
#> 40    5     1    exit  15       15        13.0 0.000000000
#> 41    4     2    exit   1        1        13.0 0.000000000
#> 42    5     3    exit  10       10        13.0 0.000000000
#> 43    1     3    exit  14       14        13.0 0.000000000
#> 44    5     2    exit  12       12        13.0 0.000000000
#> 45    2     2    exit  30       30        13.0 0.000000000
#> 46    6     3    exit  28       28        13.0 0.000000000
#> 47    1     1    exit  14       14        13.0 0.000000000
#> 48    3     3    exit  21       21        13.0 0.000000000
#> 49    4     2    exit  13       13        13.0 0.000000000
#> 50    4     3    exit  12       12        13.0 0.000000000

Created on 2022-10-07 by the reprex package (v2.0.1)

I think this answer has not been mentioned before. It runs almost as fast as the 'default' data.table-solution..

Use base::replace()

df %>% mutate( qty.exit = replace( qty.exit, measure == 'exit', qty[ measure == 'exit'] ),
cf = replace( cf, measure == 'exit', 0 ),
delta.watts = replace( delta.watts, measure == 'exit', 13 ) )

replace recycles the replacement value, so when you want the values of columns qty entered into colums qty.exit, you have to subset qty as well... hence the qty[ measure == 'exit'] in the first replacement..

now, you will probably not want to retype the measure == 'exit' all the time... so you can create an index-vector containing that selection, and use it in the functions above.

#build an index-vector matching the condition
index.v <- which( df$measure == 'exit' )


df %>% mutate( qty.exit = replace( qty.exit, index.v, qty[ index.v] ),
cf = replace( cf, index.v, 0 ),
delta.watts = replace( delta.watts, index.v, 13 ) )

benchmarks

# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
# data.table   1.005018 1.053370 1.137456 1.112871 1.186228 1.690996   100
# wimpel       1.061052 1.079128 1.218183 1.105037 1.137272 7.390613   100
# wimpel.index 1.043881 1.064818 1.131675 1.085304 1.108502 4.192995   100

One concise solution would be to do the mutation on the filtered subset and then add back the non-exit rows of the table:

library(dplyr)


dt %>%
filter(measure == 'exit') %>%
mutate(qty.exit = qty, cf = 0, delta.watts = 13) %>%
rbind(dt %>% filter(measure != 'exit'))