Basically what he does after building the DoG pyramid is detecting local extrema in those images. Afterwards, he discards some of the detected local extrema because they're probably unstable. Process of identifying those unstable keypoints/features is done by two steps:
- rejecting points that have low contrast
- rejecting points that are poorly localized along the edge (it means that they have strong edge response in one direction only)
To be able to do these steps, first you need to get the true location of extrema by taking a Taylor series expansion. It will give you information to solve those two steps.
Final step is to build descriptors ...
I'm in a process of studying this algorithm as well and i don't find it so trivial to understand. There are some details that are not included in Lowe's paper so that's what it makes it harder to understand. I haven't found many extra resources which will explain this algorithm more in depth but there are some open source implementations so you could also make use of them.
EDIT: more information :)
Paper you linked is his early work and you should get the newest version of paper because there are some modifications. Searching for more resources I've read his patent as well and it also contains old information so you shouldn't look there either.
So, my understanding of this scale-space extrema step is as it follows. First, we need to build a Gaussian pyramid. Paper says that for local extrema completeness we need to build s+3 Gaussian images in each octave. Having some tests Lowe concluded that for s = 3 he gets the best results. So that implies we have 6 Gaussian images in each octave from which we get 5 DoG images. Note that all these DoG images have the same resolution. Re-sampling is done only when passing to next octave.
Next step would be finding a local extrema. Lowe proposes to search within a 26 neighborhood which means that we should start our search from second image because that's the first image for which 26 neighborhood exists. Similarly we stop our search on fourth image. This process is repeated for each octave individually. For each extrema found, at least you should save its location and its scale. Having extrema found next step would be more accurate localization which is done with Taylor series.
This is my understanding how this step works and i hope I'm not too far from the truth :)
Hope this helped a little bit more.