在 R 中删除数据文件的空行

我有一个空行数据集。我想删除它们:

myData<-myData[-which(apply(myData,1,function(x)all(is.na(x)))),]

现在我要在数据中添加一个列并初始化第一个值:

myData$newCol[1] <- -999


Error in `$<-.data.frame`(`*tmp*`, "newCol", value = -999) :
replacement has 1 rows, data has 0

不幸的是,它没有工作,我真的不明白为什么,我不能解决这个问题。 当我一次删除一行的时候,它起作用了:

TgData = TgData[2:nrow(TgData),]

或者类似的东西。

当我只使用前13.000行时,它也可以工作。

但是它不能处理我的实际数据,有32000行。

我做错了什么? 这对我来说似乎毫无意义。

235148 次浏览

I assume you want to remove rows that are all NAs. Then, you can do the following :

data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    1   NA    4
[3,]    4    6    7
[4,]   NA   NA   NA
[5,]    4    8   NA


data[rowSums(is.na(data)) != ncol(data),]
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    1   NA    4
[3,]    4    6    7
[4,]    4    8   NA

If you want to remove rows that have at least one NA, just change the condition :

data[rowSums(is.na(data)) == 0,]
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    6    7

If you have empty rows, not NAs, you can do:

data[!apply(data == "", 1, all),]

To remove both (NAs and empty):

data <- data[!apply(is.na(data) | data == "", 1, all),]

This is similar to some of the above answers, but with this, you can specify if you want to remove rows with a percentage of missing values greater-than or equal-to a given percent (with the argument pct)

drop_rows_all_na <- function(x, pct=1) x[!rowSums(is.na(x)) >= ncol(x)*pct,]

Where x is a dataframe and pct is the threshold of NA-filled data you want to get rid of.

pct = 1 means remove rows that have 100% of its values NA. pct = .5 means remome rows that have at least half its values NA

Alternative solution for rows of NAs using janitor package

myData %>% remove_empty("rows")

Here are some dplyr options:

# sample data
df <- data.frame(a = c('1', NA, '3', NA), b = c('a', 'b', 'c', NA), c = c('e', 'f', 'g', NA))


library(dplyr)


# remove rows where all values are NA:
df %>% filter_all(any_vars(!is.na(.)))
df %>% filter_all(any_vars(complete.cases(.)))




# remove rows where only some values are NA:
df %>% filter_all(all_vars(!is.na(.)))
df %>% filter_all(all_vars(complete.cases(.)))


# or more succinctly:
df %>% filter(complete.cases(.))
df %>% na.omit


# dplyr and tidyr:
library(tidyr)
df %>% drop_na

Here's yet another answer if you just want a handy function wrapper. Also, many of the above solutions remove a row with ANY NAs, whereas this one only removes rows that are ALL NAs.

data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data


rmNArows<-function(d){
goodRows<-apply(d,1,function(x) sum(is.na(x))!=ncol(d))
d[goodRows,]
}


rmNArows(data)

Using dplyr's if_all/if_any

Drop rows with any NA OR Select rows with no NA value.

df %>% filter(!if_any(a:c, is.na))


#  a b c
#1 1 a e
#2 3 c g


#Also
df %>% filter(if_all(a:c, Negate(is.na)))

Drop rows with all NA values or select rows with at least one non-NA value.

df %>% filter(!if_all(a:c, is.na))


#     a b c
#1    1 a e
#2 <NA> b f
#3    3 c g


#Also
df %>% filter(if_any(a:c, Negate(is.na)))

data

Using data from @sbha -

df <- data.frame(a = c('1', NA, '3', NA),
b = c('a', 'b', 'c', NA),
c = c('e', 'f', 'g', NA))