views:

335

answers:

11

I'm in the process of writing an HTML screen scraper. What would be the best way to create unit tests for this?

Is it "ok" to have a static html file and read it from disk on every test?

Do you have any suggestions?

Thanks

A: 

I don't see why it matters where the html originates from as far as your unit tests are concerned. To clarify: Your unit test is processing the html content, where that content comes from is immaterial, so reading it from a file is fine for your unit tests. as you say in your comment you certainly don't want to hit the network for every test as that is just overhead.

You also might want to add an integration test or two to check you're processing urls correctly though (i.e. you are able to connect and process external urls).

Rich Seller
Well, i'm wondering what is best practice in theese situations. I certainly don't want to do a http-request on every test.
alexn
A: 

You should probably query a static page on disk for all but one or two tests. But don't forget those tests that touch the web!

C. Ross
A: 

What you're suggesting sounds sensible. I'd perhaps have a directory of suitable test HTML files, plus data on what to expect for each one. You can further populate that with known problematic pages as/when you come across them, to form a complete regression test suite.

You should also perform integration tests for actually talking HTTP (including not just successful page fetches, but also 404 errors, unresponsive servers etc.)

Brian Agnew
A: 

I would say that depends on how many different tests you need to run.

If you need to check for a large number of different things in your unit test, you might be better off generating HTML output as part of your test initialization. It would still be file-based, but you would have an extensible pattern:

Initialize HTML file with fragments for Test A
Execute Test A
Delete HTML file

That way when you add test ZZZZZ down the road, you would have a consistent way of providing test data.

If you are just running a limited number of tests, and it will stay that way, a few pre-written static HTML files should be fine.

Certainly do some integration tests as Rich suggests.

Eric J.
+3  A: 

To guarantee that the test can be run over and over again, you should have a static page to test against. (Ie. from disk is OK)

If you write a test that touches the live page on the web, thats probably not a unit test, but an integration test. You could have those too.

Arjan Einbu
Thanks, this answers my question.
alexn
A: 

To create your unit tests, you need to know how your scraper works and what sorts of information you think it should be extracting. Using simple web pages as unit tests could be OK depending on the complexity of your scraper.

For regression testing, you should absolutely keep files on disk.

But if your ultimate goal is to scrape the web, you should also keep a record of common queries and the HTML that comes back. This way, when your application fails, you can quickly capture all past queries of interest (using say wget or curl) and find out if and how the HTML has changed.

In other words, regression test both against known HTML and against unknown HTML from known queries. If you issue a known query and the HTML that comes back is identical to what's in your database, you don't need to test it twice.

Incidentally, I've had much better luck screen scraping ever since I stopped trying to scrape raw HTML and started instead to scrape the output of w3m -dump, which is ASCII and is so much easier to deal with!

Norman Ramsey
+1  A: 

You're creating an external dependency, which is going to be fragile.

Why not create a TestContent project, populated with a bunch of resources files? Copy 'n paste your source HTML into the resource file(s) and then you can reference them in your unit tests.

48klocs
A: 

You need to think about what it is you are scraping.

  • Static Html (html that is not bound to change drastically and break your scraper)
  • Dynamic Html (Loose term, html that may drastically change)
  • Unknown (html that you pull specific data from, regardless of format)

If the html is static, then I would just use a couple different local copies on disk. Since you know the html is not bound to change drastically and break your scraper, you can confidently write your test using a local file.

If the html is dynamic (again, loose term), then you may want to go ahead and use live requests in the test. If you use a local copy in this scenario and the test passes you may expect the live html to do the same, whereas it may fail. In this case, by testing against the live html every time, you immediately know if your screen scraper is up to par or not, before deployment.

Now if you simply don't care what format the html is, the order of the elements, or the structure because you are simply pulling out individual elements based on some matching mechanism (Regex/Other), then a local copy may be fine, but you may still want to lean towards testing against live html. If the live html changes, specifically parts of what you are looking for, then your test may pass if you're using a local copy, but come deployment may fail.

My opinion would be to test against live html if you can. This will prevent your local tests from passing when the live html may fail, and visa-versa. I don't think there is a best practice with screenscrapers, because screenscrapers in themselves are unusual little buggers. If a website or web service does not expose a API, a screenscraper is sort of a cheesy workaround to getting the data you want.

David Anderson
unusual little buggers...lol
mhd
A: 

Sounds like you have several components here:

  • Something that fetches your HTML content
  • Something that strips away the chaff and produces just the text that must be scraped
  • Something that actually looks at the content and transforms it into your database/whatever

You should test (and probably) implement these parts of scraper independently.

There's no reason you shouldn't be able to get content from any where (i.e. no HTTP).

There's no reason you wouldn't want to strip away the chaff for purposes other than scraping.

There's no reason to only store data into your database via scraping.

So.. there's no reason to build and test all these pieces of your code as a single large program.

Then again... maybe we're over complicating things?

Armentage
+2  A: 

Files are ok but: your screen scraper processes text. You should have various unit tests that "scrapes" different pieces of text hard coded within each unit test. Each piece should "provoke" the various parts of your scraper method.

This way you completely remove dependencies to anything external, both files and web pages. And your tests will be easier to maintain individually since they no longer depends on external files. Your unit tests will also execute (slightly) faster ;)

Peter Lillevold
A: 

For my ruby+mechanize scrapers I've been experimenting with integration tests that transparently test against as many possible versions of the target page as possible.

Inside the tests I'm overloading the scraper HTTP fetch method to automatically re-cache a newer version of the page, in addition to an "original" copy saved manually. Then each integration test runs against:

  • the original manually-saved page (somewhat like a unit test)
  • the freshest version of the page we have
  • a live copy from the site right now (which is skipped if offline)

... and raises an exception if the number of fields returned by them is different, e.g. they've changed the name of a thumbnail class, but still provides some resilience against tests breaking because the target site is down.

jamiew
This sounds very interesting as i'm not that familiar with integration tests. Are you able to provide an example for this?
alexn