Writing pseudo-crawler for web statistics.

views:

answers:

+1 Q:

Writing pseudo-crawler for web statistics.

I'm tasked with writing a web pseudo-crawler to calculate certain statistics. I need to measure the percentage of html files that start with <DOCTYPE against the number of html files that do not have it and compare this statisitic between sites on different subjects. To do so the idea is to search with google for different terms (like "Automobile", "Stock exchange", "Liposuction"...) and request the first 300 or so pages found.

I want the process to be very fast yet I do not want to be banned by google. Surely I want to minimize development time when possible. Maybe some stupid Perl script.

Is there any ready-made solution that I can and should reuse? With Google I did not find anything suitable cause what I want to measure is not part of HTML yet resides in HTML files.

+2 A:

wget can do just about everything, including limiting your request rate.

John Paulett 2009-12-06 15:42:32

+1 wget is awesome, I use it a lot. However, some people needs a GUI :)

Sune Rievers 2009-12-06 15:46:23

Ability to run headless is a bonus for me. In fact it was my original idea. Should I call wget from perl in a loop with google's url and than run than wget in a nested loop? I did not find how to set quota for a single file in wget.

Muxecoid 2009-12-06 16:03:59

I was thinking you could use `--wait=SECONDS` or `--random-wait`, possibly with the recursive flag, `-r`.

John Paulett 2009-12-06 16:12:57

HTTrack is also pretty good and easy to use. Has a nice GUI and a lot of options.

Source is also available if you're looking for inspiration: here

Sune Rievers 2009-12-06 15:44:11

wget is available for Windows: http://gnuwin32.sourceforge.net/packages/wget.htm

John Paulett 2009-12-06 15:51:00

Nice, I thought it required Cygwin. Nice to know, downloading now... :)

Sune Rievers 2009-12-06 15:54:38

Edited my answer based on above comment from John Paulett ;)

Sune Rievers 2009-12-06 15:56:49

ansaurus

tags:

views:

answers:

Writing pseudo-crawler for web statistics.

related questions