views:

87

answers:

5

I am working on a browser plug-in for Firefox, and I would like to be able to do some automated testing to make sure that it's handling a variety of different HTML/JavaScript features correctly. Does anyone know of a good downloadable corpus of HTML and/or JavaScript pages that could be used for this type of testing?

A: 

Do you mean like this page: http://acid3.acidtests.org/ ?

Rice Flour Cookies
I don't think so -- as far as I can tell, the Acid tests focus on standards compliance, especially w.r.t. DOM and JavaScript. I'd like more realistic pages that aren't completely compliant, have some other types of JavaScript features, etc.
Alex Jordan
A: 

The WebKit project uses SunSpider, which has tests based on "real-world" design patterns.

Ian Hickson's HTML test suite might have something along the lines you're looking for as well.

Mike
A: 

This ECMAScript 5 test suite tests (almost?) all JavaScript features of the current standard. Only browser-specific features are not tested.

Marcel Korpel
+2  A: 

Dotbot publishes torrent file with 14GB of HTML spidered in 2009.

porneL
This is pretty close to what I was thinking of. Thanks!
Alex Jordan
+1  A: 

I don't know of a packaged up, ready to go corpus of HTML/JavaScript documents (although it looks like some other SO people do.) If I were in your situation, I'd build my own corpus (you'll know it's current and you'll know exactly what you're dealing with).

To build your own, you can snag one of the open source crawlers, or simply use wget recursively:

wget -t 7 -w 5 --waitretry=14 --random-wait -l 2 -m -k -K -e robots=off http://stackoverflow.com -o ./myLog.log

Want to extend the above? Script up something that grabs a top n list of sites from Google, and inject those URLs into the above wget command.

labratmatt
Do you know how to stop `wget` from downloading large files? (ZIP, ISO, etc. linked on pages?) I've tried `wget` once, but ended up sucking a lot of non-HTML junk.Also you shouldn't suggest `robots=off` for general crawling. That's not a good netizenship.
porneL
@pornel - A: I agree, robots=off is a bad idea for general crawling, but in single instances like the above, I don't see an issue. B: It seems that you might be able to add an option to wget to look at content-length in the header (if the server includes it in the response). I don't believe wget currently has this implemented, but I don't know a heck of a whole about wget. Anyone have any details on this?
labratmatt