I am working on a browser plug-in for Firefox, and I would like to be able to do some automated testing to make sure that it's handling a variety of different HTML/JavaScript features correctly. Does anyone know of a good downloadable corpus of HTML and/or JavaScript pages that could be used for this type of testing?
I don't think so -- as far as I can tell, the Acid tests focus on standards compliance, especially w.r.t. DOM and JavaScript. I'd like more realistic pages that aren't completely compliant, have some other types of JavaScript features, etc.
Alex Jordan
2010-06-14 15:41:00
A:
This ECMAScript 5 test suite tests (almost?) all JavaScript features of the current standard. Only browser-specific features are not tested.
Marcel Korpel
2010-06-20 01:55:59
+1
A:
I don't know of a packaged up, ready to go corpus of HTML/JavaScript documents (although it looks like some other SO people do.) If I were in your situation, I'd build my own corpus (you'll know it's current and you'll know exactly what you're dealing with).
To build your own, you can snag one of the open source crawlers, or simply use wget recursively:
wget -t 7 -w 5 --waitretry=14 --random-wait -l 2 -m -k -K -e robots=off http://stackoverflow.com -o ./myLog.log
Want to extend the above? Script up something that grabs a top n list of sites from Google, and inject those URLs into the above wget command.
labratmatt
2010-06-25 17:34:32
Do you know how to stop `wget` from downloading large files? (ZIP, ISO, etc. linked on pages?) I've tried `wget` once, but ended up sucking a lot of non-HTML junk.Also you shouldn't suggest `robots=off` for general crawling. That's not a good netizenship.
porneL
2010-06-25 20:03:13
@pornel - A: I agree, robots=off is a bad idea for general crawling, but in single instances like the above, I don't see an issue. B: It seems that you might be able to add an option to wget to look at content-length in the header (if the server includes it in the response). I don't believe wget currently has this implemented, but I don't know a heck of a whole about wget. Anyone have any details on this?
labratmatt
2010-06-26 00:17:30