views:

315

answers:

3

Hi all, I am very new in image processing and pattern recognition. I am trying to implement SIFT algorithm where I am able to create the DoG pyramid and identify the local maximum or minimum in each octave. What I don't understand is that how to use these local max/min in each octave. How do I combine these points?

My question may sound very trivial. I have read Lowe's paper, but could not really understand what he did after he built the DoG pyramid. Any help is appreciated.

Thank you

+1  A: 

Basically what he does after building the DoG pyramid is detecting local extrema in those images. Afterwards, he discards some of the detected local extrema because they're probably unstable. Process of identifying those unstable keypoints/features is done by two steps:

  1. rejecting points that have low contrast
  2. rejecting points that are poorly localized along the edge (it means that they have strong edge response in one direction only)

To be able to do these steps, first you need to get the true location of extrema by taking a Taylor series expansion. It will give you information to solve those two steps.

Final step is to build descriptors ...

I'm in a process of studying this algorithm as well and i don't find it so trivial to understand. There are some details that are not included in Lowe's paper so that's what it makes it harder to understand. I haven't found many extra resources which will explain this algorithm more in depth but there are some open source implementations so you could also make use of them.

EDIT: more information :)

Paper you linked is his early work and you should get the newest version of paper because there are some modifications. Searching for more resources I've read his patent as well and it also contains old information so you shouldn't look there either.

So, my understanding of this scale-space extrema step is as it follows. First, we need to build a Gaussian pyramid. Paper says that for local extrema completeness we need to build s+3 Gaussian images in each octave. Having some tests Lowe concluded that for s = 3 he gets the best results. So that implies we have 6 Gaussian images in each octave from which we get 5 DoG images. Note that all these DoG images have the same resolution. Re-sampling is done only when passing to next octave.

Next step would be finding a local extrema. Lowe proposes to search within a 26 neighborhood which means that we should start our search from second image because that's the first image for which 26 neighborhood exists. Similarly we stop our search on fourth image. This process is repeated for each octave individually. For each extrema found, at least you should save its location and its scale. Having extrema found next step would be more accurate localization which is done with Taylor series.

This is my understanding how this step works and i hope I'm not too far from the truth :)

Hope this helped a little bit more.

Adi
Thanks for the answer and yes, I agree that the paper is not so clear.
Ahmet Keskin
A: 

vlfeat is an open source library implementing several computer vision algorithms, including SIFT. You should be able to look at that source code to get a better idea of what's being done.

If you're properly finding the extrema in each octave, you can then:

  1. Perform a more detailed fit for the scale and location of the extrema
  2. Rejection of low-contrast and edge responses

For each feature remaining at this point,

  1. Compute the dominant orientation within a window size relative to the scale of the detected feature
  2. Build the SIFT descriptor representation (by accumulating gradients into a spatial 4x4 grid of orientation histograms). This is described in 6.1 of the paper.

I'm not sure how much help this has been, because I don't know where you're getting hung up.

Sancho
Hi, thanks for the answer. Where I got stuck is that how to match maxima or minima in different levels of octaves after we build the DoG pyramid and find the local max and min in each octave. I just realized that Lowe mentions in his paper(http://www.cs.ubc.ca/~lowe/papers/iccv99.pdf) that first he builds pyramid by resampling the images with bilinear interpolation with 1,5 pixel spacing. Then after comparing the pixel with its 26 neighbors, he calculates the closest pixel location next lowest level of the pyramid taking account of the 1,5 times resampling.
Ahmet Keskin
But, Lowe does not mention anything about the key points in lover level of the pyramids, or how to take them in count in the paper you supplied(which is newer, 2004). That s where I was stuck. Maybe, the Taylor extension does the trick, I am not sure...
Ahmet Keskin
A: 

Hi Ahmet, We have two pyramids. A Gaussian and a DoG pyramid. Gaussian pyramid has 6 blurred images. DoG is difference of these images, so there are 5 images in DoG. You have nothing to do with Gaussian pyramid. Note that all these are in first octave! When you create your first pyramid, resize your image and start to build new pyramids for second octave.

Lets say your original image is 512x512. In first octave all images are 512x512 but in second octave, all images are 256x256. Again you have 6 images Gaussian pyramid and 5 in DoG pyramid. But all of them are 256x256 in second ocave. No need to mention 3rd octave.

Now for the matching of minima and maxima:(you are in first octave) Lets say you are looking maxima in first octave. You must use DoG pyramid and start from 2nd image. You take a pixel and calculate if it is maxima. In this calculation you should use 1st,2nd and 3rd images of DoG pyramid. If it is done go and find maxima in 3rd image by considering 2nd,3rd and 4th images. And lastly go find maxima in 4th image by considering 3rd,4th and 5th images. Now finding mixama in first ocatave is completed, go to next octave and repeat these steps.

For more information contact me at ramazanpolat at gmail dot com, I am in Turkey too.

algorian