How to use the 'sweep' function

When I look at the source of R Packages, i see the function sweep used quite often. Sometimes it's used when a simpler function would have sufficed (e.g., apply), other times, it's impossible to know exactly what it's is doing without spending a fair amount of time to step through the code block it's in.

The fact that I can reproduce sweep's effect using a simpler function suggests that i don't understand sweep's core use cases, and the fact that this function is used so often suggests that it's quite useful.

The context:

sweep is a function in R's standard library; its arguments are:

sweep(x, MARGIN, STATS, FUN="-", check.margin=T, ...)


# x is the data
# STATS refers to the summary statistics which you wish to 'sweep out'
# FUN is the function used to carry out the sweep, "-" is the default

As you can see, the arguments are similar to apply though sweep requires one more parameter, STATS.

Another key difference is that sweep returns an array of the same shape as the input array, whereas the result returned by apply depends on the function passed in.

sweep in action:

# e.g., use 'sweep' to express a given matrix in terms of distance from
# the respective column mean


# create some data:
M = matrix( 1:12, ncol=3)


# calculate column-wise mean for M
dx = colMeans(M)


# now 'sweep' that summary statistic from M
sweep(M, 2, dx, FUN="-")


[,1] [,2] [,3]
[1,] -1.5 -1.5 -1.5
[2,] -0.5 -0.5 -0.5
[3,]  0.5  0.5  0.5
[4,]  1.5  1.5  1.5

So in sum, what i'm looking for is an exemplary use case or two for sweep.

Please, do not recite or link to the R Documentation, mailing lists, or any of the 'primary' R sources--assume I've read them. What I'm interested in is how experienced R programmers/analysts use sweep in their own code.

60570 次浏览

当按行或按列操作矩阵时,通常使用 sweep(),操作的其他输入对于每一行/列是不同的值。是否按行或按列操作由 MARGIN 定义,如 apply()。用于我称之为“另一个输入”的值是由 STATS 定义的。 因此,对于每一行(或每一列) ,您将从 STATS 获取一个值,并在由 FUN 定义的操作中使用。

例如,如果你想在你定义的矩阵的第一行加1,第二行加2,等等,你可以这样做:

sweep (M, 1, c(1: 4), "+")

坦率地说,我也不理解 R 文档中的定义,我只是通过查找示例了解到这一点。

对于系统地操作一个大型矩阵,扫描()可以非常有效,无论是一列一列地,还是一行一行地,如下所示:

> print(size)
Weight Waist Height
[1,]    130    26    140
[2,]    110    24    155
[3,]    118    25    142
[4,]    112    25    175
[5,]    128    26    170


> sweep(size, 2, c(10, 20, 30), "+")
Weight Waist Height
[1,]    140    46    170
[2,]    120    44    185
[3,]    128    45    172
[4,]    122    45    205
[5,]    138    46    200

当然,这个示例很简单,但是可以更改 STATS 和 FUN 参数,还可以进行其他操作。

这个问题有点老,但是因为我最近遇到过这个问题,所以可以在 stats 函数 cov.wt的源代码中找到一个典型的清除用法,该函数用于计算加权协方差矩阵。我正在看 R3.0.1中的代码。这里使用 sweep在计算协方差之前减去列平均值。在代码的第19行,导出了居中向量:

 center <- if (center)
colSums(wt * x)
else 0

在第54行,它被扫出矩阵

x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)

The author of the code is using the default value FUN = "-", which confused me for a while.

您可以使用 sweep函数来缩放和集中数据,如下面的代码所示。请注意,meanssds在这里是任意的(您可能有一些参考值,您希望基于它们来标准化数据) :

df=matrix(sample.int(150, size = 100, replace = FALSE),5,5)


df_means=t(apply(df,2,mean))
df_sds=t(apply(df,2,sd))


df_T=sweep(sweep(df,2,df_means,"-"),2,df_sds,"/")*10+50

此代码将原始分数转换为 T 分数(平均值 = 50,sd = 10) :

> df
[,1] [,2] [,3] [,4] [,5]
[1,]  109    8   89   69   15
[2,]   85   13   25  150   26
[3,]   30   79   48    1  125
[4,]   56   74   23  140  100
[5,]  136  110  112   12   43
> df_T
[,1]     [,2]     [,3]     [,4]     [,5]
[1,] 56.15561 39.03218 57.46965 49.22319 40.28305
[2,] 50.42946 40.15594 41.31905 60.87539 42.56695
[3,] 37.30704 54.98946 47.12317 39.44109 63.12203
[4,] 43.51037 53.86571 40.81435 59.43685 57.93136
[5,] 62.59752 61.95672 63.27377 41.02349 46.09661

其中一种用法是计算数组的 加重了和。如果 rowSumscolSums可以假定为意味着“权重 = 1”,那么在此之前可以使用 sweep来给出一个加权结果。这对于 > = 3维的数组尤其有用。

This comes up e.g. when calculating a weighted covariance matrix as per @James King's example.

下面是另一个基于当前项目的例子:

set.seed(1)
## 2x2x2 array
a1 <- array(as.integer(rnorm(8, 10, 5)), dim=c(2, 2, 2))
## 'element-wise' sum of matrices
## weights = 1
rowSums(a1, dims=2)
## weights
w1 <- c(3, 4)
## a1[, , 1] * 3;  a1[, , 2] * 4
a1 <- sweep(a1, MARGIN=3, STATS=w1, FUN="*")
rowSums(a1, dims=2)