tags:

views:

43

answers:

2

I'm writing a crawler for Ruby, and I want to honour the headers that the server sends out in order to make the crawl more efficient. Is there a straightforward way in Ruby of determining whether a page needs to be re-downloaded by the client? I know I need to consider at least these headers:

  • Last Modified
  • Etags
  • Cache Control
  • Expires

What's the definitive way of determining this - is it specified anywhere?

A: 

You'll want to read about the head method in Net::HTTP -- http://www.ruby-doc.org/stdlib/

glenn jackman
+1  A: 

You are right on the headers you will need to look at, but you need to consider that the server is what is setting these. If they are set correctly, then you can use them to make the decision, but none of them are required.

Personally, I would probably start with tracking the expires value as I do the initial download, as well as logging the etag. Finally I'd look at last modified as I did the next pass, assuming the expires or etag showed some sign that I might need to re-download (or if they aren't even set). I wouldn't expect Cache Control to be all the useful.

danivovich