替换数据框架中的所有特定值

有了数据框架之后,如何替换所有行和列中的所有特定值。例如,我想用 NA替换所有的空记录(不需要输入位置) :

df <- data.frame(list(A=c("", "xyz", "jkl"), B=c(12, "", 100)))


A   B
1      12
2  xyz
3  jkl 100

预期结果:

    A   B
1  NA   12
2  xyz  NA
3  jkl  100
372141 次浏览

Like this:

> df[df==""]<-NA
> df
A    B
1 <NA>   12
2  xyz <NA>
3  jkl  100

Since PikkuKatja and glallen asked for a more general solution and I cannot comment yet, I'll write an answer. You can combine statements as in:

> df[df=="" | df==12] <- NA
> df
A    B
1  <NA> <NA>
2  xyz  <NA>
3  jkl  100

For factors, zxzak's code already yields factors:

> df <- data.frame(list(A=c("","xyz","jkl"), B=c(12,"",100)))
> str(df)
'data.frame':   3 obs. of  2 variables:
$ A: Factor w/ 3 levels "","jkl","xyz": 1 3 2
$ B: Factor w/ 3 levels "","100","12": 3 1 2

If in trouble, I'd suggest to temporarily drop the factors.

df[] <- lapply(df, as.character)

We can use data.table to get it quickly. First create df without factors,

df <- data.frame(list(A=c("","xyz","jkl"), B=c(12,"",100)), stringsAsFactors=F)

Now you can use

setDT(df)
for (jj in 1:ncol(df)) set(df, i = which(df[[jj]]==""), j = jj, v = NA)

and you can convert it back to a data.frame

setDF(df)

If you only want to use data.frame and keep factors it's more difficult, you need to work with

levels(df$value)[levels(df$value)==""] <- NA

where value is the name of every column. You need to insert it in a loop.

If you want to replace multiple values in a data frame, looping through all columns might help.

Say you want to replace "" and 100:

na_codes <- c(100, "")
for (i in seq_along(df)) {
df[[i]][df[[i]] %in% na_codes] <- NA
}

Here are a couple dplyr options:

library(dplyr)


# all columns:
df %>%
mutate_all(~na_if(., ''))


# specific column types:
df %>%
mutate_if(is.factor, ~na_if(., ''))


# specific columns:
df %>%
mutate_at(vars(A, B), ~na_if(., ''))


# or:
df %>%
mutate(A = replace(A, A == '', NA))


# replace can be used if you want something other than NA:
df %>%
mutate(A = as.character(A)) %>%
mutate(A = replace(A, A == '', 'used to be empty'))

It appears that a solution is missing for multiple values to be replaced and for factors, so I will add one.

Consider a data frame dat with various classes.

dat
#    character integer       Date factor               POSIX
# 1                  4 2022-07-10      B 2022-07-10 20:08:10
# 2                  1 2022-07-11    FOO 2022-07-10 21:08:10
# 3                 -2 2022-07-12        2022-07-10 22:08:10
# 4                  2 2022-07-13      B 2022-07-10 23:08:10
# 5          a       3 2022-07-14        2022-07-11 00:08:10
# 6          c       1 2022-07-15        2022-07-11 01:08:10
# 7          a      -1 2022-07-16    FOO 2022-07-11 02:08:10
# 8          a      -1 2022-07-17      A 2022-07-11 03:08:10
# 9                  4 2022-07-18    FOO 2022-07-11 04:08:10
# 10         c       0 2022-07-19    FOO 2022-07-11 05:08:10
# 11         b      -2 2022-07-20      B 2022-07-11 06:08:10
# 12         c      -2 2022-07-21      A 2022-07-11 07:08:10

We may put everything we want to convert to NA on a list to_na,

To_NA <- list('', -1, -2, 'c', 'FOO', as.Date('2022-07-17'), as.POSIXct('2022-07-11 00:08:10'))

and use it in a small function make_na based on replace. if the respective variable is.factor we may want to droplevels of values that have just been deleted.

make_na <- \(x, z) {x <- replace(x, x %in% z, NA); if (is.factor(x)) droplevels(x) else x}

We can apply it on a vector,

make_na(dat$character, To_NA)
# [1] NA  NA  NA  NA  "a" NA  "a" "a" NA  NA  "b" NA

or loop over the columns using lapply.

dat[] <- lapply(dat, make_na, To_NA)

Gives

dat
#    character integer       Date factor               POSIX
# 1       <NA>       4 2022-07-10      B 2022-07-10 20:08:10
# 2       <NA>       1 2022-07-11   <NA> 2022-07-10 21:08:10
# 3       <NA>      NA 2022-07-12   <NA> 2022-07-10 22:08:10
# 4       <NA>       2 2022-07-13      B 2022-07-10 23:08:10
# 5          a       3 2022-07-14   <NA>                <NA>
# 6       <NA>       1 2022-07-15   <NA> 2022-07-11 01:08:10
# 7          a      NA 2022-07-16   <NA> 2022-07-11 02:08:10
# 8          a      NA       <NA>      A 2022-07-11 03:08:10
# 9       <NA>       4 2022-07-18   <NA> 2022-07-11 04:08:10
# 10      <NA>       0 2022-07-19   <NA> 2022-07-11 05:08:10
# 11         b      NA 2022-07-20      B 2022-07-11 06:08:10
# 12      <NA>      NA 2022-07-21      A 2022-07-11 07:08:10

Where:

str(dat)
# 'data.frame': 12 obs. of  5 variables:
#  $ character: chr  NA NA NA NA ...
#  $ integer  : int  4 1 NA 2 3 1 NA NA 4 0 ...
#  $ Date     : Date, format: "2022-07-10" "2022-07-11" "2022-07-12" ...
#  $ factor   : Factor w/ 2 levels "A","B": 2 NA NA 2 NA NA NA 1 NA NA ...
#  $ POSIX    : POSIXct, format: "2022-07-10 20:08:10" "2022-07-10 21:08:10" "2022-07-10 22:08:10" ...

Data:

dat <- structure(list(character = c("", "", "", "", "a", "c", "a", "a",
"", "c", "b", "c"), integer = c(4L, 1L, -2L, 2L, 3L, 1L, -1L,
-1L, 4L, 0L, -2L, -2L), Date = structure(c(19183, 19184, 19185,
19186, 19187, 19188, 19189, 19190, 19191, 19192, 19193, 19194
), class = "Date"), factor = structure(c(3L, 4L, 1L, 3L, 1L,
1L, 4L, 2L, 4L, 4L, 3L, 2L), levels = c("", "A", "B", "FOO"), class = "factor"),
POSIX = structure(c(1657476490L, 1657480090L, 1657483690L,
1657487290L, 1657490890L, 1657494490L, 1657498090L, 1657501690L,
1657505290L, 1657508890L, 1657512490L, 1657516090L), class = c("POSIXct",
"POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-12L))

Another option is to use replace_with_na_all() from the naniar package, which allows you to replace all the values meeting a condition in the entire dataframe.

library(naniar)
library(dplyr)
    

df %>%
replace_with_na_all(condition = ~.x == "")

Output

  A     B
<chr> <chr>
1 NA    12
2 xyz   NA
3 jkl   100

The upside to this method is that if you also had some cells that also had spaces included, then we could provide both in the conditions argument. Although it would be better to first just trim the whitespace, then use the function above (i.e., adding mutate(across(everything(), ~ trimws(.x))) to the pipe).

df <- data.frame(list(A=c("", "xyz", "  "), B=c(12, "   ", 100)))


df %>%
replace_with_na_all(condition = ~.x %in% c("", "  ", "   "))


#  A     B
#  <chr> <chr>
#1 NA    12
#2 xyz   NA
#3 NA    100