将所有数据框字符列转换为因子

给定一个(预先存在的)具有各种类型的列的数据框架,在不影响其他类型的任何列的情况下,将其所有字符列转换为因子的最简单方法是什么?

下面是 data.frame的一个例子:

df <- data.frame(A = factor(LETTERS[1:5]),
B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
D = letters[1:5],
E = paste(LETTERS[1:5], letters[1:5]),
stringsAsFactors = FALSE)
df
#   A B     C D   E
# 1 A 1  TRUE a A a
# 2 B 2  TRUE b B b
# 3 C 3 FALSE c C c
# 4 D 4 FALSE d D d
# 5 E 5  TRUE e E e
str(df)
# 'data.frame':  5 obs. of  5 variables:
#  $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
#  $ B: int  1 2 3 4 5
#  $ C: logi  TRUE TRUE FALSE FALSE TRUE
#  $ D: chr  "a" "b" "c" "d" ...
#  $ E: chr  "A a" "B b" "C c" "D d" ...

我知道我能做到:

df$D <- as.factor(df$D)
df$E <- as.factor(df$E)

有没有办法使这个过程更自动化一点?

154745 次浏览
DF <- data.frame(x=letters[1:5], y=1:5, stringsAsFactors=FALSE)


str(DF)
#'data.frame':  5 obs. of  2 variables:
# $ x: chr  "a" "b" "c" "d" ...
# $ y: int  1 2 3 4 5

You can use as.data.frame to turn all character columns into factor columns:

DF <- as.data.frame(unclass(DF),stringsAsFactors=TRUE)
str(DF)
#'data.frame':  5 obs. of  2 variables:
# $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
# $ y: int  1 2 3 4 5

Roland's answer is great for this specific problem, but I thought I would share a more generalized approach.

DF <- data.frame(x = letters[1:5], y = 1:5, z = LETTERS[1:5],
stringsAsFactors=FALSE)
str(DF)
# 'data.frame':  5 obs. of  3 variables:
#  $ x: chr  "a" "b" "c" "d" ...
#  $ y: int  1 2 3 4 5
#  $ z: chr  "A" "B" "C" "D" ...


## The conversion
DF[sapply(DF, is.character)] <- lapply(DF[sapply(DF, is.character)],
as.factor)
str(DF)
# 'data.frame':  5 obs. of  3 variables:
#  $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
#  $ y: int  1 2 3 4 5
#  $ z: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5

For the conversion, the left hand side of the assign (DF[sapply(DF, is.character)]) subsets the columns that are character. In the right hand side, for that subset, you use lapply to perform whatever conversion you need to do. R is smart enough to replace the original columns with the results.

The handy thing about this is if you wanted to go the other way or do other conversions, it's as simple as changing what you're looking for on the left and specifying what you want to change it to on the right.

As @Raf Z commented on this question, dplyr now has mutate_if. Super useful, simple and readable.

> str(df)
'data.frame':   5 obs. of  5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int  1 2 3 4 5
$ C: logi  TRUE TRUE FALSE FALSE TRUE
$ D: chr  "a" "b" "c" "d" ...
$ E: chr  "A a" "B b" "C c" "D d" ...


> df <- df %>% mutate_if(is.character,as.factor)


> str(df)
'data.frame':   5 obs. of  5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int  1 2 3 4 5
$ C: logi  TRUE TRUE FALSE FALSE TRUE
$ D: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
$ E: Factor w/ 5 levels "A a","B b","C c",..: 1 2 3 4 5

I used to do a simple for loop. As @A5C1D2H2I1M1N2O1R2T1 answer, lapply is a nice solution. But if you convert all the columns, you will need a data.frame before, otherwise you will end up with a list. Little execution time differences.

 mm2N=mm2New[,10:18]
str(mm2N)
'data.frame':   35487 obs. of  9 variables:
$ bb    : int  4 6 2 3 3 2 5 2 1 2 ...
$ vabb  : int  -3 -3 -2 -2 -3 -1 0 0 3 3 ...
$ bb55  : int  7 6 3 4 4 4 9 2 5 4 ...
$ vabb55: int  -3 -1 0 -1 -2 -2 -3 0 -1 3 ...
$ zr    : num  0 -2 -1 1 -1 -1 -1 1 1 0 ...
$ z55r  : num  -2 -2 0 1 -2 -2 -2 1 -1 1 ...
$ fechar: num  0 -1 1 0 1 1 0 0 1 0 ...
$ varr  : num  3 3 1 1 1 1 4 1 1 3 ...
$ minmax: int  3 0 4 6 6 6 0 6 6 1 ...


# For solution
t1=Sys.time()
for(i in 1:ncol(mm2N)) mm2N[,i]=as.factor(mm2N[,i])
Sys.time()-t1
Time difference of 0.2020121 secs
str(mm2N)
'data.frame':   35487 obs. of  9 variables:
$ bb    : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
$ vabb  : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
$ bb55  : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
$ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
$ zr    : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
$ z55r  : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
$ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
$ varr  : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
$ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...


#lapply solution
mm2N=mm2New[,10:18]
t1=Sys.time()
mm2N <- lapply(mm2N, as.factor)
Sys.time()-t1
Time difference of 0.209012 secs
str(mm2N)
List of 9
$ bb    : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
$ vabb  : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
$ bb55  : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
$ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
$ zr    : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
$ z55r  : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
$ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
$ varr  : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
$ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...


#data.frame lapply solution
mm2N=mm2New[,10:18]
t1=Sys.time()
mm2N <- data.frame(lapply(mm2N, as.factor))
Sys.time()-t1
Time difference of 0.2010119 secs
str(mm2N)
'data.frame':   35487 obs. of  9 variables:
$ bb    : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
$ vabb  : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
$ bb55  : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
$ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
$ zr    : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
$ z55r  : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
$ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
$ varr  : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
$ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...

The easiest way would be to use the code given below. It would automate the whole process of converting all the variables as factors in a dataframe in R. it worked perfectly fine for me. food_cat here is the dataset which I am using. Change it to the one which you are working on.

    for(i in 1:ncol(food_cat)){


food_cat[,i] <- as.factor(food_cat[,i])


}

I noticed "[" indexing columns fails to create levels when iterating:

for ( a_feature in convert.to.factors) {
feature.df[a_feature] <- factor(feature.df[a_feature]) }

It creates, e.g. for the "Status" column:

Status : Factor w/ 1 level "c(\"Success\", \"Fail\")" : NA NA NA ...

Which is remedied by using "[[" indexing:

for ( a_feature in convert.to.factors) {
feature.df[[a_feature]] <- factor(feature.df[[a_feature]]) }

Giving instead, as desired:

. Status : Factor w/ 2 levels "Success", "Fail" : 1 1 2 1 ...

Working with dplyr

library(dplyr)


df <- data.frame(A = factor(LETTERS[1:5]),
B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
D = letters[1:5],
E = paste(LETTERS[1:5], letters[1:5]),
stringsAsFactors = FALSE)


str(df)

we get:

'data.frame':   5 obs. of  5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int  1 2 3 4 5
$ C: logi  TRUE TRUE FALSE FALSE TRUE
$ D: chr  "a" "b" "c" "d" ...
$ E: chr  "A a" "B b" "C c" "D d" ...

Now, we can convert all chr to factors:

df <- df%>%mutate_if(is.character, as.factor)
str(df)

And we get:

'data.frame':   5 obs. of  5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int  1 2 3 4 5
$ C: logi  TRUE TRUE FALSE FALSE TRUE
$ D: chr  "a" "b" "c" "d" ...
$ E: chr  "A a" "B b" "C c" "D d" ...

Let's provide also other solutions:

With base package:

df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)],
as.factor)

With dplyr 1.0.0

df <- df%>%mutate(across(where(is.factor), as.character))

With purrr package:

library(purrr)


df <- df%>% modify_if(is.factor, as.character)

Based on @Roland 's answer and @Paul de Barros 's comments, I observed to the following conclusion:

    df <- data.frame(A = factor(LETTERS[1:5]),
B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
D = letters[1:5],
E = paste(LETTERS[1:5], letters[1:5]),
stringsAsFactors = FALSE)
   

df<-as.data.frame(unclass(df),stringsAsFactors=TRUE)
str(df)

Practically and simply seems to work.

> str(df)
'data.frame':   5 obs. of  5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int  1 2 3 4 5
$ C: logi  TRUE TRUE FALSE FALSE TRUE
$ D: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
$ E: Factor w/ 5 levels "A a","B b","C c",..: 1 2 3 4 5