ansaurus

Question

Answer 1

A:

I don't see why it matters where the html originates from as far as your unit tests are concerned. To clarify: Your unit test is processing the html content, where that content comes from is immaterial, so reading it from a file is fine for your unit tests. as you say in your comment you certainly don't want to hit the network for every test as that is just overhead.

You also might want to add an integration test or two to check you're processing urls correctly though (i.e. you are able to connect and process external urls).

Rich Seller 2009-08-10 17:20:51

Well, i'm wondering what is best practice in theese situations. I certainly don't want to do a http-request on every test.

alexn 2009-08-10 17:22:25

Answer 2

A:

You should probably query a static page on disk for all but one or two tests. But don't forget those tests that touch the web!

C. Ross 2009-08-10 17:25:51

Answer 3

A:

What you're suggesting sounds sensible. I'd perhaps have a directory of suitable test HTML files, plus data on what to expect for each one. You can further populate that with known problematic pages as/when you come across them, to form a complete regression test suite.

You should also perform integration tests for actually talking HTTP (including not just successful page fetches, but also 404 errors, unresponsive servers etc.)

Brian Agnew 2009-08-10 17:26:07

Answer 4

A:

I would say that depends on how many different tests you need to run.

If you need to check for a large number of different things in your unit test, you might be better off generating HTML output as part of your test initialization. It would still be file-based, but you would have an extensible pattern:

Initialize HTML file with fragments for Test A
Execute Test A
Delete HTML file

That way when you add test ZZZZZ down the road, you would have a consistent way of providing test data.

If you are just running a limited number of tests, and it will stay that way, a few pre-written static HTML files should be fine.

Certainly do some integration tests as Rich suggests.

Eric J. 2009-08-10 17:26:22

Answer 5

+3 A:

To guarantee that the test can be run over and over again, you should have a static page to test against. (Ie. from disk is OK)

If you write a test that touches the live page on the web, thats probably not a unit test, but an integration test. You could have those too.

Arjan Einbu 2009-08-10 17:28:43

Thanks, this answers my question.

alexn 2009-08-24 12:42:59

Answer 6

A:

To create your unit tests, you need to know how your scraper works and what sorts of information you think it should be extracting. Using simple web pages as unit tests could be OK depending on the complexity of your scraper.

For regression testing, you should absolutely keep files on disk.

But if your ultimate goal is to scrape the web, you should also keep a record of common queries and the HTML that comes back. This way, when your application fails, you can quickly capture all past queries of interest (using say wget or curl) and find out if and how the HTML has changed.

In other words, regression test both against known HTML and against unknown HTML from known queries. If you issue a known query and the HTML that comes back is identical to what's in your database, you don't need to test it twice.

Incidentally, I've had much better luck screen scraping ever since I stopped trying to scrape raw HTML and started instead to scrape the output of w3m -dump, which is ASCII and is so much easier to deal with!

Norman Ramsey 2009-08-10 17:39:35

Answer 7

+1 A:

You're creating an external dependency, which is going to be fragile.

Why not create a TestContent project, populated with a bunch of resources files? Copy 'n paste your source HTML into the resource file(s) and then you can reference them in your unit tests.

48klocs 2009-08-10 18:13:21

Answer 8

A:

You need to think about what it is you are scraping.

Static Html (html that is not bound to change drastically and break your scraper)
Dynamic Html (Loose term, html that may drastically change)
Unknown (html that you pull specific data from, regardless of format)

If the html is static, then I would just use a couple different local copies on disk. Since you know the html is not bound to change drastically and break your scraper, you can confidently write your test using a local file.

If the html is dynamic (again, loose term), then you may want to go ahead and use live requests in the test. If you use a local copy in this scenario and the test passes you may expect the live html to do the same, whereas it may fail. In this case, by testing against the live html every time, you immediately know if your screen scraper is up to par or not, before deployment.

Now if you simply don't care what format the html is, the order of the elements, or the structure because you are simply pulling out individual elements based on some matching mechanism (Regex/Other), then a local copy may be fine, but you may still want to lean towards testing against live html. If the live html changes, specifically parts of what you are looking for, then your test may pass if you're using a local copy, but come deployment may fail.

My opinion would be to test against live html if you can. This will prevent your local tests from passing when the live html may fail, and visa-versa. I don't think there is a best practice with screenscrapers, because screenscrapers in themselves are unusual little buggers. If a website or web service does not expose a API, a screenscraper is sort of a cheesy workaround to getting the data you want.

David Anderson 2009-08-10 18:26:06

unusual little buggers...lol

mhd 2010-07-08 10:48:45

Answer 9

A:

Sounds like you have several components here:

Something that fetches your HTML content
Something that strips away the chaff and produces just the text that must be scraped
Something that actually looks at the content and transforms it into your database/whatever

You should test (and probably) implement these parts of scraper independently.

There's no reason you shouldn't be able to get content from any where (i.e. no HTTP).

There's no reason you wouldn't want to strip away the chaff for purposes other than scraping.

There's no reason to only store data into your database via scraping.

So.. there's no reason to build and test all these pieces of your code as a single large program.

Then again... maybe we're over complicating things?

Armentage 2009-08-10 21:20:23

Answer 10

+2 A:

Files are ok but: your screen scraper processes text. You should have various unit tests that "scrapes" different pieces of text hard coded within each unit test. Each piece should "provoke" the various parts of your scraper method.

This way you completely remove dependencies to anything external, both files and web pages. And your tests will be easier to maintain individually since they no longer depends on external files. Your unit tests will also execute (slightly) faster ;)

Peter Lillevold 2009-08-11 13:01:35

Answer 11

A:

For my ruby+mechanize scrapers I've been experimenting with integration tests that transparently test against as many possible versions of the target page as possible.

Inside the tests I'm overloading the scraper HTTP fetch method to automatically re-cache a newer version of the page, in addition to an "original" copy saved manually. Then each integration test runs against:

the original manually-saved page (somewhat like a unit test)
the freshest version of the page we have
a live copy from the site right now (which is skipped if offline)

... and raises an exception if the number of fields returned by them is different, e.g. they've changed the name of a thumbnail class, but still provides some resilience against tests breaking because the target site is down.

jamiew 2009-08-26 17:46:55

This sounds very interesting as i'm not that familiar with integration tests. Are you able to provide an example for this?

alexn 2009-08-27 09:47:21

ansaurus

tags:

views:

answers:

Unit testing screen scraper

related questions