views:

39

answers:

2

I'm tasked with writing a web pseudo-crawler to calculate certain statistics. I need to measure the percentage of html files that start with <DOCTYPE against the number of html files that do not have it and compare this statisitic between sites on different subjects. To do so the idea is to search with google for different terms (like "Automobile", "Stock exchange", "Liposuction"...) and request the first 300 or so pages found.

I want the process to be very fast yet I do not want to be banned by google. Surely I want to minimize development time when possible. Maybe some stupid Perl script.

Is there any ready-made solution that I can and should reuse? With Google I did not find anything suitable cause what I want to measure is not part of HTML yet resides in HTML files.

+2  A: 

wget can do just about everything, including limiting your request rate.

John Paulett
+1 wget is awesome, I use it a lot. However, some people needs a GUI :)
Sune Rievers
Ability to run headless is a bonus for me. In fact it was my original idea. Should I call wget from perl in a loop with google's url and than run than wget in a nested loop? I did not find how to set quota for a single file in wget.
Muxecoid
I was thinking you could use `--wait=SECONDS` or `--random-wait`, possibly with the recursive flag, `-r`.
John Paulett
A: 

HTTrack is also pretty good and easy to use. Has a nice GUI and a lot of options.

Source is also available if you're looking for inspiration: here

Sune Rievers
wget is available for Windows: http://gnuwin32.sourceforge.net/packages/wget.htm
John Paulett
Nice, I thought it required Cygwin. Nice to know, downloading now... :)
Sune Rievers
Edited my answer based on above comment from John Paulett ;)
Sune Rievers