filter for complete cases in data.frame using dplyr (case-wise deletion)

Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).

library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5)
)


df %.%
filter(complete.cases(x1,x2))
68128 次浏览

try this

df[complete.cases(df),] #output to console

甚至是这个

df.complete <- df[complete.cases(df),] #assign to a new data.frame

上面的命令负责检查所有列(变量)的完整性 在你的数据框里。

试试这个:

df %>% na.omit

或者这样:

df %>% filter(complete.cases(.))

或者这样:

library(tidyr)
df %>% drop_na

如果要根据一个变量的缺失情况进行筛选,请使用一个条件:

df %>% filter(!is.na(x1))

or

df %>% drop_na(x1)

Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.

str(df %>% na.omit)
## 'data.frame':   2 obs. of  2 variables:
##  $ x1: num  1 2
##  $ x2: num  1 2
##  - attr(*, "na.action")= 'omit' Named int  3 4
##    ..- attr(*, "names")= chr  "3" "4"

ADDED Have updated to reflect latest version of dplyr and comments.

ADDED 已更新以反映 tidyr 的最新版本和评论。

下面是 Grothendieck 回复的一些基准结果。Omit ()花费的时间是其他两个解决方案的20倍。我认为,如果 dplyr 有一个函数可以作为 filter 的一部分,那就太好了。

library('rbenchmark')
library('dplyr')


n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA




benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)


#                                                  test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))()           50   5.422    1.000
# 1               df %.% filter(complete.cases(x1, x2))           50   6.262    1.155
# 2                                    df %.% na.omit()           50 109.618   20.217

这对我有用:

df %>%
filter(complete.cases(df))

Or a little more general:

library(dplyr) # 0.4
df %>% filter(complete.cases(.))

这样做的好处是,在将数据传递给筛选器之前,数据可以在链中进行修改。

另一个具有更多栏目的基准:

set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")


#Unit: relative
#             expr       min        lq    median         uq       max neval
#         na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233    20
#filter.anonymous  1.149305  1.022891  1.013779  0.9948659  4.668691    20
#         rowSums  2.281002  2.377807  2.420615  2.3467519  5.223077    20
#          filter  1.000000  1.000000  1.000000  1.0000000  1.000000    20

这是一个简短的函数,让你指定列(基本上是 dplyr::select能理解的所有内容) ,它不应该有任何 NA 值(模仿大熊猫 ()) :

drop_na <- function(data, ...){
if (missing(...)){
f = complete.cases(data)
} else {
f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
}
filter(data, f)
}

[ drop_na is now part of tidyr: 以上内容可以由 library("tidyr")代替]

Examples:

library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs

Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):

library(magrittr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5))


df %>%
extract(complete.cases(.), )

额外的好处是速度,这是 filterna.omit变种中最快的方法(使用@Miha Tro t 微基准测试)。

Dplyr > = 1.0.4

if_anyif_alldplyr的较新版本中可用于在 filter函数中应用类似于 across的语法。如果您的数据框架中有其他变量,而这些变量并不是您认为的完整情况的一部分,那么这将非常有用。例如,如果您只希望在以“ x”开头的列中有不丢失的行:

library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
y = c(NA, "A", "B", "C")
)


df %>%
dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))


x1 x2    y
1  1  1 <NA>
2  2  2    A

有关这些函数的更多信息,请参见此 link