R 中模型矩阵中因子的所有级数

我有一个由数字和因子变量组成的 data.frame,如下所示。

testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

我想构建一个 matrix,它为因子分配虚拟变量,而不使用数值变量。

model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)

正如预期的那样,在运行 lm时,这将省略每个因子的一个级别作为参考级别。但是,我想为所有因素的每个级别构建一个具有虚拟/指示器变量的 matrix。我正在为 ABc2构建这个矩阵,所以我不担心多重共线性。

有没有办法让 model.matrix为每个级别的因子创建虚拟?

61621 次浏览

你需要重置因子变量的 contrasts:

model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F),
Fifth=contrasts(testFrame$Fifth, contrasts=F)))

或者,少打一点字,不用正确的名字:

model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)),
Fifth=diag(nlevels(testFrame$Fifth))))

(试图挽回我自己...)为了回应贾里德在@Fabian 上关于自动化的评论,请注意,所有你需要提供的是一个对比矩阵的命名列表。contrasts()获取一个矢量/因子,并从中生成对比矩阵。为此,我们可以使用 lapply()对数据集中的每个因子运行 contrasts(),例如,对于提供的 testFrame示例:

> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
Alice Bob Charlie David
Alice       1   0       0     0
Bob         0   1       0     0
Charlie     0   0       1     0
David       0   0       0     1


$Fifth
Edward Frank Georgia Hank Isaac
Edward       1     0       0    0     0
Frank        0     1       0    0     0
Georgia      0     0       1    0     0
Hank         0     0       0    1     0
Isaac        0     0       0    0     1

这个问题恰到好处地出现在了“ Fabian”的回答中:

model.matrix(~ ., data=testFrame,
contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))

也可以使用来自 caretdummyVars

好吧。只是阅读上面的内容,然后把它们放在一起。假设你想得到一个矩阵,比如“ X.factor”,它乘以你的系数向量得到线性预测器。还有几个额外的步骤:

X.factors =
model.matrix( ~ ., data=X, contrasts.arg =
lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
contrasts, contrasts = FALSE))

(注意,如果只有一个 factor 列,则需要将 X [ * ]返回到数据框架中。)

然后说你得到了这样的东西:

attr(X.factors,"assign")
[1]  0  1  **2**  2  **3**  3  3  **4**  4  4  5  6  7  8  9 10 #emphasis added

我们要去掉每个因子的参考水平

att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))
model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)

或者

model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)

应该是最直截了当的

使用 R 包“ CatEncoders”

library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))


fit <- OneHotEncoder.fit(testFrame)


z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output

caret实现了一个很好的函数 dummyVars,用两行代码实现了这一点:

图书馆(插入字符) Dmy <-DummyVars (“ ~ .”,data = testFrame) TestFrame2 <-data.frame (预测(dmy,newdata = testFrame))

检查最后一列:

colnames(testFrame2)


"First"  "Second"         "Third"          "Fourth.Alice"   "Fourth.Bob"     "Fourth.Charlie" "Fourth.David"   "Fifth.Edward"   "Fifth.Frank"   "Fifth.Georgia"  "Fifth.Hank"     "Fifth.Isaac"

这里最好的一点是您得到了原始的数据框架,加上已经排除了用于转换的原始变量的虚拟变量。

更多信息: http://amunategui.github.io/dummyVar-Walkthrough/

我目前正在学习拉索模型和 glmnet::cv.glmnet()model.matrix()Matrix::sparse.model.matrix()(对于高维矩阵,使用 model.matrix将消磨我们的时间,由 glmnet的作者建议。).

只要在那里分享,就会得到与“ Fabian”和“ Gavin”相同的答案。与此同时,@asdf123还引入了另一个软件包 library('CatEncoders')

> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
>
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))

资料来源: R for Everyone: Advanced Analytics and Graphics (第273页)

tidyverse的回答是:

library(dplyr)
library(tidyr)
result <- testFrame %>%
mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>%
mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")

产生预期的结果(与@Gavin Simpson 的回答相同) :

> head(result, 6)
First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1     1      5     4           0         0             1           0           0          1            0         0          0
2     1     14    10           0         0             0           1           0          0            1         0          0
3     2      2     9           0         1             0           0           1          0            0         0          0
4     2      5     4           0         0             0           1           0          1            0         0          0
5     2     13     5           0         0             1           0           1          0            0         0          0
6     2     15     7           1         0             0           0           1          0            0         0          0

您可以使用 tidyverse来实现这一点,而无需手动指定每一列。

诀窍在于创建一个“长”的数据框架。

然后,删除一些内容,并将其展开,以创建指示器/虚拟变量。

密码:

library(tidyverse)


## add index variable for pivoting
testFrame$id <- 1:nrow(testFrame)


testFrame %>%
## pivot to "long" format
gather(feature, value, -id) %>%
## add indicator value
mutate(indicator=1) %>%
## create feature name that unites a feature and its value
unite(feature, value, col="feature_value", sep="_") %>%
## convert to wide format, filling missing values with zero
spread(feature_value, indicator, fill=0)


输出:

   id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ...
1   1            1           0             0          0           0       0       0       0
2   2            0           1             0          0           0       0       0       0
3   3            0           0             1          0           0       0       0       0
4   4            0           0             0          1           0       0       0       0
5   5            0           0             0          0           1       0       0       0
6   6            1           0             0          0           0       0       0       0
7   7            0           1             0          0           0       0       1       0
8   8            0           0             1          0           0       1       0       0
9   9            0           0             0          1           0       0       0       0
10 10            0           0             0          0           1       0       0       0
11 11            1           0             0          0           0       0       0       0
12 12            0           1             0          0           0       0       0       0
...

我编写了一个名为 模型矩阵模型的包来改进 model. Matrix ()的功能。默认情况下,包中的 ModelMatrixModel ()函数返回一个类,该类包含一个稀疏矩阵,其中包含所有级别的虚拟变量,这些虚拟变量适合在 glmnet 包中的 cv.glmnet ()中输入。重要的是,回来了 类还存储转换参数,如因子级别信息,然后可以应用于新数据。该函数可以处理 r 公式中的大多数项,比如 poly()和交互作用。它还提供了其他一些选项,比如处理无效的因子级别和缩放输出。

#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     7     17           1         0             0           0
## 2     9      7           0         1             0           0


#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2)))
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     6      3           0         1             0           0
## 2     2     12           0         0             1           0