维奥拉-琼斯的人脸检测声称拥有18万个特征

我一直在改编 维奥拉 · 琼斯的人脸检测算法。该技术依赖于在图像中放置一个24x24像素的子帧,然后在每个位置尽可能放置矩形特征。

这些特性可以由两个、三个或四个矩形组成。

Rectangle features

他们声称全套装备超过18万(第2节) :

考虑到检测器的基本分辨率是24x24,所以矩形特征的穷举集相当大,超过180,000。注意,与 Haar 基不同,矩形的集合 功能过于完整。

以下声明在文件中没有明确说明,因此是我的假设:

  1. 其中只有2个双矩形特征,2个三矩形特征和1个四矩形特征。这背后的逻辑是,我们正在观察突出显示的矩形之间的 不同,而不是显式的颜色或亮度或任何类似的东西。
  2. 我们不能将特征类型 A 定义为1x1像素块; 它必须至少是1x2像素。此外,类型 D 必须至少是2x2像素,并且这个规则相应地适用于其他特性。
  3. 我们不能将特征类型 A 定义为1x3像素块,因为中间像素不能被分区,从中减去它本身就等于1x2像素块; 这种特征类型只定义为偶数宽度。另外,特征类型 C 的宽度必须可以被3整除,这个规则相应地适用于其他特征。
  4. 我们不能定义宽度和/或高度为0的特性。因此,我们将 X迭代到24减去特性的大小。

基于这些假设,我详细计算了一下:

const int frameSize = 24;
const int features = 5;
// All five feature types:
const int feature[features][2] = {{2,1}, {1,2}, {3,1}, {1,3}, {2,2}};


int count = 0;
// Each feature:
for (int i = 0; i < features; i++) {
int sizeX = feature[i][0];
int sizeY = feature[i][1];
// Each position:
for (int x = 0; x <= frameSize-sizeX; x++) {
for (int y = 0; y <= frameSize-sizeY; y++) {
// Each size fitting within the frameSize:
for (int width = sizeX; width <= frameSize-x; width+=sizeX) {
for (int height = sizeY; height <= frameSize-y; height+=sizeY) {
count++;
}
}
}
}
}

结果是 162,336

我发现唯一能接近 Viola & Jones 所说的“超过180,000”的方法是放弃假设 # 4,并在代码中引入 bug。这涉及将四行分别改为:

for (int width = 0; width < frameSize-x; width+=sizeX)
for (int height = 0; height < frameSize-y; height+=sizeY)

结果就是 180,625。(请注意,这将有效地防止功能永远接触的权利和/或底部的子框架。)

当然,现在的问题是: 他们在实施过程中是否犯了错误?考虑表面为零的特征有意义吗?还是我想错了?

25035 次浏览

Having not read the whole paper, the wording of your quote sticks out at me

Given that the base resolution of the detector is 24x24, the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectangle features is overcomplete.

"The set of rectangle features is overcomplete" "Exhaustive set"

it sounds to me like a set up, where I expect the paper writer to follow up with an explaination for how they cull the search space down to a more effective set, by, for example, getting rid of trivial cases such as rectangles with zero surface area.

edit: or using some kind of machine learning algorithm, as the abstract hints at. Exhaustive set implies all possibilities, not just "reasonable" ones.

There is no guarantee that any author of any paper is correct in all their assumptions and findings. If you think that assumption #4 is valid, then keep that assumption, and try out your theory. You may be more successful than the original authors.

Upon closer look, your code looks correct to me; which makes one wonder whether the original authors had an off-by-one bug. I guess someone ought to look at how OpenCV implements it!

Nonetheless, one suggestion to make it easier to understand is to flip the order of the for loops by going over all sizes first, then looping over the possible locations given the size:

#include <stdio.h>
int main()
{
int i, x, y, sizeX, sizeY, width, height, count, c;


/* All five shape types */
const int features = 5;
const int feature[][2] = \{\{2,1}, {1,2}, {3,1}, {1,3}, {2,2}};
const int frameSize = 24;


count = 0;
/* Each shape */
for (i = 0; i < features; i++) {
sizeX = feature[i][0];
sizeY = feature[i][1];
printf("%dx%d shapes:\n", sizeX, sizeY);


/* each size (multiples of basic shapes) */
for (width = sizeX; width <= frameSize; width+=sizeX) {
for (height = sizeY; height <= frameSize; height+=sizeY) {
printf("\tsize: %dx%d => ", width, height);
c=count;


/* each possible position given size */
for (x = 0; x <= frameSize-width; x++) {
for (y = 0; y <= frameSize-height; y++) {
count++;
}
}
printf("count: %d\n", count-c);
}
}
}
printf("%d\n", count);


return 0;
}

with the same results as the previous 162336


To verify it, I tested the case of a 4x4 window and manually checked all cases (easy to count since 1x2/2x1 and 1x3/3x1 shapes are the same only 90 degrees rotated):

2x1 shapes:
size: 2x1 => count: 12
size: 2x2 => count: 9
size: 2x3 => count: 6
size: 2x4 => count: 3
size: 4x1 => count: 4
size: 4x2 => count: 3
size: 4x3 => count: 2
size: 4x4 => count: 1
1x2 shapes:
size: 1x2 => count: 12             +-----------------------+
size: 1x4 => count: 4              |     |     |     |     |
size: 2x2 => count: 9              |     |     |     |     |
size: 2x4 => count: 3              +-----+-----+-----+-----+
size: 3x2 => count: 6              |     |     |     |     |
size: 3x4 => count: 2              |     |     |     |     |
size: 4x2 => count: 3              +-----+-----+-----+-----+
size: 4x4 => count: 1              |     |     |     |     |
3x1 shapes:                                |     |     |     |     |
size: 3x1 => count: 8              +-----+-----+-----+-----+
size: 3x2 => count: 6              |     |     |     |     |
size: 3x3 => count: 4              |     |     |     |     |
size: 3x4 => count: 2              +-----------------------+
1x3 shapes:
size: 1x3 => count: 8                  Total Count = 136
size: 2x3 => count: 6
size: 3x3 => count: 4
size: 4x3 => count: 2
2x2 shapes:
size: 2x2 => count: 9
size: 2x4 => count: 3
size: 4x2 => count: 3
size: 4x4 => count: 1

all. There is still some confusion in Viola and Jones' papers.

In their CVPR'01 paper it is clearly stated that

"More specifically, we use three kinds of features. The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions. The regions have the same size and shape and are horizontally or vertically adjacent (see Figure 1). A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature".

In the IJCV'04 paper, exactly the same thing is said. So altogether, 4 features. But strangely enough, they stated this time that the the exhaustive feature set is 45396! That does not seem to be the final version.Here I guess that some additional constraints were introduced there, such as min_width, min_height, width/height ratio, and even position.

Note that both papers are downloadable on his webpage.

Quite good observation, but they might implicitly zero-pad the 24x24 frame, or "overflow" and start using first pixels when it gets out of bounds, as in rotational shifts, or as Breton said they might consider some features as "trivial features" and then discard them with the AdaBoost.

In addition, I wrote Python and Matlab versions of your code so I can test the code myself (easier to debug and follow for me) and so I post them here if anyone find them useful sometime.

Python:

frameSize = 24;
features = 5;
# All five feature types:
feature = [[2,1], [1,2], [3,1], [1,3], [2,2]]


count = 0;
# Each feature:
for i in range(features):
sizeX = feature[i][0]
sizeY = feature[i][1]
# Each position:
for x in range(frameSize-sizeX+1):
for y in range(frameSize-sizeY+1):
# Each size fitting within the frameSize:
for width in range(sizeX,frameSize-x+1,sizeX):
for height in range(sizeY,frameSize-y+1,sizeY):
count=count+1
print (count)

Matlab:

frameSize = 24;
features = 5;
% All five feature types:
feature = [[2,1]; [1,2]; [3,1]; [1,3]; [2,2]];


count = 0;
% Each feature:
for ii = 1:features
sizeX = feature(ii,1);
sizeY = feature(ii,2);
% Each position:
for x = 0:frameSize-sizeX
for y = 0:frameSize-sizeY
% Each size fitting within the frameSize:
for width = sizeX:sizeX:frameSize-x
for height = sizeY:sizeY:frameSize-y
count=count+1;
end
end
end
end
end


display(count)

In their original 2001 paper they only state that they used three kinds of features:

we use three kinds of features

with two, three and four rectangles respectively.

Since each kind has two orientations (that differ by 90 degrees), perhaps for the computation of the total number of features they used 2*3 types of features: 2 two-rectangle features, 2 three-rectangle features and 2 four-rectangle features. With this assumption there are indeed over 180,000 features:

feature_types = [(1,2), (2,1), (1,3), (3,1), (2,2), (2,2)]
window_size = (24,24)


total_features = 0
for f_type in feature_types:
for f_height in range(f_type[0], window_size[0] + 1, f_type[0]):
for f_width in range(f_type[1], window_size[1] + 1, f_type[1]):
total_features += (window_size[0] - f_height + 1) * (window_size[1] - f_width + 1)
            

print(total_features)
# 183072

The second four-rectangle feature differs from the first only by a sign, so there is no need to keep it and if we drop it then the total number of features reduces to 162,336.