I'm tasked with writing a web pseudo-crawler to calculate certain statistics. I need to measure the percentage of html files that start with <DOCTYPE
against the number of html files that do not have it and compare this statisitic between sites on different subjects. To do so the idea is to search with google for different terms (like "Automobile", "Stock exchange", "Liposuction"...) and request the first 300 or so pages found.
I want the process to be very fast yet I do not want to be banned by google. Surely I want to minimize development time when possible. Maybe some stupid Perl script.
Is there any ready-made solution that I can and should reuse? With Google I did not find anything suitable cause what I want to measure is not part of HTML yet resides in HTML files.