views:

15

answers:

2

I'm writing a simple crawler, and ideally to save bandwidth, I'd only like to download the text and links on the page. Can I do that using HTTP Headers? I'm confused about how they work.

+2  A: 

You're on the right track to solving the problem.

I'm not sure how much you already know about HTTP headers, but basically an HTTP header is just a string formatting for a web server - it follows a protocol - and is pretty straightforward in that aspect. You write a request, and receive a response. The requests look like the things you see in the Firefox plugin LiveHTTPHeaders at https://addons.mozilla.org/en-US/firefox/addon/3829/.

I wrote a small post at my site http://blog.gnucom.cc/2010/write-http-request-to-web-server-with-php/ that shows you how you can write a request to a web server and then later read the response. If you only accept text/html you'll only accept a subset of what is available on the web (so yes, it will "optimize" your script to an extent). Note this example is really low level, and if you're going to write a spider you may want to use an existing library like cURL or whatever other tools your implementation language offers.

gnucom
Yes, I'm using multi-curl to fetch pages, so are you sure that sending text/html will ignore all media types?
gAMBOOKa
Absolutely. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
gnucom
+1  A: 

Yes, with using Accept: text/html you should only get HTML as a valid responses. That’s at least how it ought to be.

But in practice there is a huge difference between the standards and the actual implementations. And proper content negotiation (that’s what Accept is for) is one of the things that are barely supported.

Gumbo