tags:

views:

107

answers:

4

I'm building a small application that will crawl sites where the content is growing (like on stackoverflow) the difference is that the content once created is rarely modified.

Now , in the first pass I crawl all the pages in the site.

But next, the paged content of that site - I don't want to re-crawl all of it , just the latest additions.

So if the site has 500 pages, on the second pass if the site has 501 pages then I would only crawl the first and second pages. Would this be a good way to handle the situation ?

In the end, the crawled content will end up in lucene - creating a custom search engine.

So, I would like to avoid crawling multiple times the same content. Any better ideas ?

EDIT :

Let's say the site has a page : Results that will be accessed like so :

Results?page=1 , Results?page=2 ...etc

I guess that keeping a track of how many pages there were at the last crawl and just crawl the difference would be enough. ( maybe using a hash of each result on the page - if I start running into the same hashes - I should stop)

+3  A: 

If each piece of content is at a unique location, just feed these locations (probably URLs) into a hash field and check for it before "crawling" content. The URL should probably be part of your stored data in Lucene anyway, so this should be easy to accomplish by searching before adding to the index.

dlamblin
Damn, you're right, I just realized that each result on the page must have a unique url. Thanks :)
sirrocco
+1  A: 

My approach would be to store a hash/fingerprint of the content of each page seen. That way, when you refetch a page, you validate the fingerprint, if it matches, nothing has changed and no parsing is needed, since you already process the page and all links on it.

lexu
But actually fetching the page is probably slower than parsing it and adding it to the index. And you'd need to fetch it to hash it.
dlamblin
That's the conflict .. unless you fetch it (or ask the server outrigh if it changed), you don't really know if it changed .. how does the OP know the page count has changed? Knowing the page name is not the same as knowing its content.. at least I understood the page was similar to SO, where pages DO change.
lexu
+1  A: 

Does the site issue effective e-tags for each resource being fetched? If so you could issue conditional GETs of known resources and in the case that the server sends the resource (i.e. it has changed) you could look for new links to crawl, update the content, etc.

Of course this only works if your site issues E-Tags and responds to the conditional get...

Jacob O'Reilly
A: 
  1. do a standard site-wide crawl of the website to get all the historical content
  2. track their RSS feed to find new content
  3. repeat site-wide crawl periodically to get updated content
Plumo