ansaurus

Question

Answer 1

A:

Do you mean like this page: http://acid3.acidtests.org/ ?

Rice Flour Cookies 2010-06-14 14:32:14

I don't think so -- as far as I can tell, the Acid tests focus on standards compliance, especially w.r.t. DOM and JavaScript. I'd like more realistic pages that aren't completely compliant, have some other types of JavaScript features, etc.

Alex Jordan 2010-06-14 15:41:00

Answer 2

A:

The WebKit project uses SunSpider, which has tests based on "real-world" design patterns.

Ian Hickson's HTML test suite might have something along the lines you're looking for as well.

Mike 2010-06-19 20:06:49

Answer 3

A:

This ECMAScript 5 test suite tests (almost?) all JavaScript features of the current standard. Only browser-specific features are not tested.

Marcel Korpel 2010-06-20 01:55:59

Answer 4

+2 A:

Dotbot publishes torrent file with 14GB of HTML spidered in 2009.

porneL 2010-06-25 17:00:42

This is pretty close to what I was thinking of. Thanks!

Alex Jordan 2010-06-28 11:13:59

Answer 5

+1 A:

I don't know of a packaged up, ready to go corpus of HTML/JavaScript documents (although it looks like some other SO people do.) If I were in your situation, I'd build my own corpus (you'll know it's current and you'll know exactly what you're dealing with).

To build your own, you can snag one of the open source crawlers, or simply use wget recursively:

wget -t 7 -w 5 --waitretry=14 --random-wait -l 2 -m -k -K -e robots=off http://stackoverflow.com -o ./myLog.log

Want to extend the above? Script up something that grabs a top n list of sites from Google, and inject those URLs into the above wget command.

labratmatt 2010-06-25 17:34:32

Do you know how to stop `wget` from downloading large files? (ZIP, ISO, etc. linked on pages?) I've tried `wget` once, but ended up sucking a lot of non-HTML junk.Also you shouldn't suggest `robots=off` for general crawling. That's not a good netizenship.

porneL 2010-06-25 20:03:13

@pornel - A: I agree, robots=off is a bad idea for general crawling, but in single instances like the above, I don't see an issue. B: It seems that you might be able to add an option to wget to look at content-length in the header (if the server includes it in the response). I don't believe wget currently has this implemented, but I don't know a heck of a whole about wget. Anyone have any details on this?

labratmatt 2010-06-26 00:17:30

ansaurus

tags:

views:

answers:

Downloadable HTML Test Corpus

related questions