卷积神经网络批量标准化

我是卷积神经网络的新手,只是对特征映射和如何对图像进行卷积来提取特征有一些想法。我很高兴知道一些细节,关于应用批量正常化在 CNN。

我读了这篇论文 https://arxiv.org/pdf/1502.03167v3.pdf,可以理解 BN 算法应用于一个数据,但最后他们提到,一个轻微的修改时,需要应用到 CNN:

对于卷积图层,我们还希望规范化服从卷积属性——这样同一特征映射的不同元素,在不同的位置,以相同的方式进行规范化。为了实现这一点,我们联合规范化一个小批处理中的所有激活,遍布所有位置。在 Alg。1,我们让 B 是跨越小批量和空间位置的元素的特征映射中的所有值的集合-因此对于 m 的小批量和 p × q 的特征映射,我们使用 m ′ = | B | = m · pq 的有效小批量。我们每个特征映射学习一对参数 γ (k)和 β (k) ,而不是每个激活。Alg.这样,在推理过程中,BN 变换对给定特征映射中的每个激活应用相同的线性映射。

当他们说 使同一特征映射的不同元素,在不同的位置,以同样的方式归一化

我知道特征映射是什么意思,不同的元素是每个特征映射的权重。但是我不能理解位置或者空间位置是什么意思。

我完全听不懂下面这句话 在 Alg。1,我们让 B 成为一个特征映射中的所有值的集合,这个特征映射跨越了一个小批量和空间位置的元素

如果有人能用更简单的术语来详细解释我,我会很高兴的

48713 次浏览

I'm only 70% sure of what I say, so if it does not make sense, please edit or mention it before downvoting.

About location or spatial location: they mean the position of pixels in an image or feature map. A feature map is comparable to a sparse modified version of image where concepts are represented.

About so that different elements of the same feature map, at different locations, are normalized in the same way: some normalisation algorithms are local, so they are dependent of their close surrounding (location) and not the things far apart in the image. They probably mean that every pixel, regardless of their location, is treated just like the element of a set, independently of it's direct special surrounding.

About In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations: They get a flat list of every values of every training example in the minibatch, and this list combines things whatever their location is on the feature map.

Let's start with the terms. Remember that the output of the convolutional layer is a 4-rank tensor [B, H, W, C], where B is the batch size, (H, W) is the feature map size, C is the number of channels. An index (x, y) where 0 <= x < H and 0 <= y < W is a spatial location.

Usual batchnorm

Now, here's how the batchnorm is applied in a usual way (in pseudo-code):

# t is the incoming tensor of shape [B, H, W, C]
# mean and stddev are computed along 0 axis and have shape [H, W, C]
mean = mean(t, axis=0)
stddev = stddev(t, axis=0)
for i in 0..B-1:
out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)

Basically, it computes H*W*C means and H*W*C standard deviations across B elements. You may notice that different elements at different spatial locations have their own mean and variance and gather only B values.

Batchnorm in conv layer

This way is totally possible. But the convolutional layer has a special property: filter weights are shared across the input image (you can read it in detail in this post). That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W values, at different locations.

Here's how the code looks like in this case (again pseudo-code):

# t is still the incoming tensor of shape [B, H, W, C]
# but mean and stddev are computed along (0, 1, 2) axes and have just [C] shape
mean = mean(t, axis=(0, 1, 2))
stddev = stddev(t, axis=(0, 1, 2))
for i in 0..B-1, x in 0..H-1, y in 0..W-1:
out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)

In total, there are only C means and standard deviations and each one of them is computed over B*H*W values. That's what they mean when they say "effective mini-batch": the difference between the two is only in axis selection (or equivalently "mini-batch selection").

Some clarification on Maxim's answer.

I was puzzled by seeing in Keras that the axis you specify is the channels axis, as it doesn't make sense to normalize over the channels - as every channel in a conv-net is considered a different "feature". I.e. normalizing over all channels is equivalent to normalizing number of bedrooms with size in square feet (multivariate regression example from Andrew's ML course). This is usually not what you want - what you do is normalize every feature by itself. I.e. you normalize the number of bedrooms across all examples to be with mu=0 and std=1, and you normalize the the square feet across all examples to be with mu=0 and std=1.

This is why you want C means and stds, because you want a mean and std per channel/feature.

After checking and testing it myself I realized the issue: there's a bit of a confusion/misconception here. The axis you specify in Keras is actually the axis which is not in the calculations. i.e. you get average over every axis except the one specified by this argument. This is confusing, as it is exactly the opposite behavior of how NumPy works, where the specified axis is the one you do the operation on (e.g. np.mean, np.std, etc.).

I actually built a toy model with only BN, and then calculated the BN manually - took the mean, std across all the 3 first dimensions [m, n_W, n_H] and got n_C results, calculated (X-mu)/std (using broadcasting) and got identical results to the Keras results.

Hope this helps anyone who was confused as I was.

  1. Firstly we need to make it clear that the depth of a kernel is determined by previous feature map's channel num, and the number of kernel in this layer determins the channel num of next feature map (the next layer).
  2. then we should make it clear that each kernel(three dimentional usually) will generate just one channel of feature map in the next layer.
  3. thirdly we should try to accept the idea of each points in the generated feature map (regardless of their position) are generated by the same kernel, by sliding on previous layer. So they could be seen as a distribution generated by this kernel, and they could be seen as samples of a stochastic variable. Then they should be averaged to obtain the mean and then the variance. (it not rigid, only helps to understand) This is what they say "so that different elements of the same feature map, at different locations, are normalized in the same way"