ansaurus

Question

Viola-Jones' face detection claims 180k features

Answer 1

+1 A:

Having not read the whole paper, the wording of your quote sticks out at me

Given that the base resolution of the detector is 24x24, the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectangle features is overcomplete.

"The set of rectangle features is overcomplete" "Exhaustive set"

it sounds to me like a set up, where I expect the paper writer to follow up with an explaination for how they cull the search space down to a more effective set, by, for example, getting rid of trivial cases such as rectangles with zero surface area.

edit: or using some kind of machine learning algorithm, as the abstract hints at. Exhaustive set implies all possibilities, not just "reasonable" ones.

Breton 2009-11-10 12:50:09

I should include the footnote after "overcomplete": "A complete basis has no linear dependence between basis elements and has the same number of elements as the image space, in this case 576. The full set of 180,000 thousand features is many times over-complete." They do not explicitly get rid of classifiers with no surface, they use AdaBoost to determine that "a very small number of these features can be combined to form an effective classifier". Ok, so the zero-surface features will be dropped immediately, but why consider them in the first place?

Paul Lammertsma 2009-11-10 12:57:06

Well it sounds like the reasoning of someone really into set theory.

Breton 2009-11-10 12:59:58

I agree, the exhaustive set would imply all possibilities. But consider that if you take 1 to 24 for *x* and width <= x, the feature will extend 1 pixel outside of the subframe!

Paul Lammertsma 2009-11-10 13:00:45

Are you sure your code isn't riddled with "off by one" bugs? I just had a closer look, and you sure do have a funny way of writing a for loop.

Breton 2009-11-10 13:03:40

I should qualify that- I just thought it over a bit, and if you have a rectangle that is 1 pixel tall, 2 pixels tall, 3 pixels tall, all the way to 24 pixels tall, you have 24 kinds of rectangle, all of which fit into a 24 pixel high subframe. What overhangs?

Breton 2009-11-10 13:16:05

You're right; the for-loops were sloppy. I had confused the dimensions with the location of the feature. I've edited it in the OP. You're also right about the overhang: there is none. The only way I can replicate 180k+ is by setting the for-loops for the width and height to begin at 0.

Paul Lammertsma 2009-11-10 13:23:38

Paul Lammertsma 2009-11-10 13:25:50

Well mostly, except that asa nikie points out above, you start your x/y coordinates at 1 instead of 0, which may account for your discrepency.

Breton 2009-11-10 13:35:16

Answer 2

A:

There is no guarantee that any author of any paper is correct in all their assumptions and findings. If you think that assumption #4 is valid, then keep that assumption, and try out your theory. You may be more successful than the original authors.

Michael Dillon 2009-11-10 13:00:39

Experimentation shows that it performs seemingly precisely the same. I believe AdaBoost simply drops those additional zero-surface features in the first cycle, but I haven't actually looked into this.

Paul Lammertsma 2009-11-10 13:03:09

Viola and Jones are very big names in computer vision. In fact, this particular paper is considered seminal. Everyone makes mistakes, but this particular algorithm has been proven to work very well.

Dima 2009-11-10 15:29:40

Definitely, and I don't doubt their method at all. It's efficient and works very well! The theory is sound, but I believe they might have mistakenly cropped their detector one pixel short and included needless zero-surface features. If not, I challenge you to demonstrate the 180k features!

Paul Lammertsma 2009-11-10 15:48:06

The fact is that everyone is human. Everyone makes mistakes. When a big name makes mistakes, they often lay hidden for generations because people are afraid to question the received wisdom. But true science, follows the scientific method and does not worship anybody, no matter how big their name is. If it is science, then mere mortals can put in the effort, understand how it works and adapt it to their circumstances.

Michael Dillon 2009-11-10 16:18:22

We'll see; I've sent an e-mail to the author.

Paul Lammertsma 2009-11-10 16:43:30

Answer 3

A:

I am guessing your assumption 3 is incorrect. I doubt that features of type A, B, or D must have even widths (heights), or that features of type C must have the width be divisible by 3. I would think that you would divide the side in half (or by 3) using integer division, and you may have the halves (thirds) differ by a pixel, which would result in a slightly different feature.

Generally, if I were you I would consider taking a look at other people's implementations, like this one

Dima 2009-11-10 15:45:44

But then they would get 393576 features. I guess that's "over 180,000", but I don't think that's what they would have written then.

nikie 2009-11-10 15:49:53

I'm actually fairly certain of assumption #3. Bear in mind that they're inspecting the difference between the feature rectangles (shaded vs. unshaded). If you take a two-rectangle feature over 3 pixels, then to which rectangle does the middle pixel belong? The left? The right? Both? If both, you subtract it's value from itself, and you effectively have the same as a 2 pixel version of the feature.

Paul Lammertsma 2009-11-10 16:01:25

I took a look at Ole Jensen's implementation in MATLAB, and discussed it with him. He has the same code as the above, but had mistakenly run to 23x23 pixels. Using my code or his, that sums up to 136,656.

Paul Lammertsma 2009-11-10 16:08:08

Answer 4

+10 A:

Upon closer look, your code looks correct to me; which makes one wonder whether the original authors had an off-by-one bug. I guess someone ought to look at how OpenCV implements it!

Nonetheless, one suggestion to make it easier to understand is to flip the order of the for loops by going over all sizes first, then looping over the possible locations given the size:

#include <stdio.h>
int main()
{
    int i, x, y, sizeX, sizeY, width, height, count, c;

    /* All five shape types */
    const int features = 5;
    const int feature[][2] = {{2,1}, {1,2}, {3,1}, {1,3}, {2,2}};
    const int frameSize = 24;

    count = 0;
    /* Each shape */
    for (i = 0; i < features; i++) {
        sizeX = feature[i][0];
        sizeY = feature[i][1];
        printf("%dx%d shapes:\n", sizeX, sizeY);

        /* each size (multiples of basic shapes) */
        for (width = sizeX; width <= frameSize; width+=sizeX) {
            for (height = sizeY; height <= frameSize; height+=sizeY) {
                printf("\tsize: %dx%d => ", width, height);
                c=count;

                /* each possible position given size */
                for (x = 0; x <= frameSize-width; x++) {
                    for (y = 0; y <= frameSize-height; y++) {
                        count++;
                    }
                }
                printf("count: %d\n", count-c);
            }
        }
    }
    printf("%d\n", count);

    return 0;
}

with the same results as the previous 162336

To verify it, I tested the case of a 4x4 window and manually checked all cases (easy to count since 1x2/2x1 and 1x3/3x1 shapes are the same only 90 degrees rotated):

2x1 shapes:
        size: 2x1 => count: 12
        size: 2x2 => count: 9
        size: 2x3 => count: 6
        size: 2x4 => count: 3
        size: 4x1 => count: 4
        size: 4x2 => count: 3
        size: 4x3 => count: 2
        size: 4x4 => count: 1
1x2 shapes:
        size: 1x2 => count: 12             +-----------------------+
        size: 1x4 => count: 4              |     |     |     |     |
        size: 2x2 => count: 9              |     |     |     |     |
        size: 2x4 => count: 3              +-----+-----+-----+-----+
        size: 3x2 => count: 6              |     |     |     |     |
        size: 3x4 => count: 2              |     |     |     |     |
        size: 4x2 => count: 3              +-----+-----+-----+-----+
        size: 4x4 => count: 1              |     |     |     |     |
3x1 shapes:                                |     |     |     |     |
        size: 3x1 => count: 8              +-----+-----+-----+-----+
        size: 3x2 => count: 6              |     |     |     |     |
        size: 3x3 => count: 4              |     |     |     |     |
        size: 3x4 => count: 2              +-----------------------+
1x3 shapes:
        size: 1x3 => count: 8                  Total Count = 136
        size: 2x3 => count: 6
        size: 3x3 => count: 4
        size: 4x3 => count: 2
2x2 shapes:
        size: 2x2 => count: 9
        size: 2x4 => count: 3
        size: 4x2 => count: 3
        size: 4x4 => count: 1

Amro 2009-11-10 21:02:39

Convincing. So convincing that I'm fairly sure that we're right. I've sent an e-mail to the author to see if I've made some fundamental mistake in my reasoning. We'll see if a guy that busy has time to respond.

Paul Lammertsma 2009-11-10 22:07:07

keep in mind this thing has been out for a couple of years now, and many improvements were made since then

Amro 2009-11-10 22:32:20

The original paper where the 180k was stated comes from the proceedings for the 2001 Conference on Computer Vision and Pattern Recognition. A revised paper, accepted in 2003 and published in the International Journal of Computer Vision in 2004, states on p. 139 (end of section 2): "the exhaustive set of rectangles is quite large, 160,000". Looks like we were right!

Paul Lammertsma 2009-11-17 11:16:25

Great, thanks for the update. For those interested, I found a link to the IJCV'04 paper: http://lear.inrialpes.fr/people/triggs/student/vj/viola-ijcv04.pdf

Amro 2009-11-17 16:53:02

Yes, that's it. 160k, not 180k.

Paul Lammertsma 2009-11-20 14:56:02

Answer 5

A:

hey can some one tell me where can i get the source of it?

venkat 2010-06-24 05:02:24

No. And please do not bump questions like this.

BoltClock 2010-06-24 05:05:45

Since it's your first day here: it can be found in [OpenCV](http://opencv.willowgarage.com/wiki/).

Paul Lammertsma 2010-06-25 19:21:04

Answer 6

+2 A:

Hi, all. There is still some confusion in Viola and Jones' papers.

In their CVPR'01 paper it is clearly stated that

"More specifically, we use three kinds of features. The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions. The regions have the same size and shape and are horizontally or vertically adjacent (see Figure 1). A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature".

In the IJCV'04 paper, exactly the same thing is said. So altogether, 4 features. But strangely enough, they stated this time that the the exhaustive feature set is 45396! That does not seem to be the final version.Here I guess that some additional constraints were introduced there, such as min_width, min_height, width/height ratio, and even position.

Note that both papers are downloadable on his webpage.

Laoma from Singapore 2010-07-21 12:42:54

ansaurus

tags:

views:

answers:

Viola-Jones' face detection claims 180k features

related questions