views:

1426

answers:

7

I was answering some quiz questions for an interview, and the question was about how would I do screen scraping. That is, picking content out of a web page, assuming you don't have a better structured way to query the information directly (e.g. a web service).

My solution was to use an XQuery expression. The expression was fairly long because the content I needed was pretty deep in the HTML hierarchy. I had to search up through the ancestors a fair way before I found an element with an id attribute. For example, scraping an Amazon.com page for Product Dimensions looks like this:

//a[@id="productDetails"]
/following-sibling::table
//h2[contains(child::text(), "Product Details")]
/following-sibling::div
//li
/b[contains(child::text(), "Product Dimensions:")]
/following-sibling::text()

That's a pretty nasty expression, but that's why Amazon provides a web service API. Anyway, it's just one example. The question was not about Amazon, it's about screen scraping.

The interviewer didn't like my solution. He thought it was fragile, because a change to the page design by Amazon could require rewriting the XQuery expression. Debugging an XQuery expression that doesn't match anything in the page it's applied against is hard.

I did not disagree with his statements, but I didn't think his solution was any improvement: he thought it's better to use a regular expression, and search for content and markup near the shipping weight. For example, using Perl:

$html =~ m{<li>\s*<b>\s*Product Dimensions:\s*</b>\s*(.*?)</li>}s;

My counter-argument was that this is also susceptible to Amazon changing their HTML code. They could spell HTML tags in capitals (<LI>), or add CSS attributes or change <b> to <span> or change the label "Product Dimensions:" to "Dimensions:" or many other kinds of changes. My point was that regular expressions don't solve the weaknesses he called out in my XQuery solution.

But in addition, regular expressions can find false positives, unless you add enough context to the expression. It can also unintentionally match content that happens to be inside a comment, or an attribute string, or a CDATA section.

My question is, what technology do you use to do screen scraping? Why did you choose that solution? Is there some compelling reason to use one? Or never use the other? Is there a third choice besides those I showed above?

PS: Assume for the sake of argument that there is no web service API or other more direct way to acquire the desired content.

+3  A: 

I'd use a regular expression, for the reasons the manager gave, pluss a few (more portable, easier for outside programmers to follow, etc).

Your counter argument misses the point that his solution was fragile with regard to local changes while yours is fragile with regard to global changes. Anything that breaks his will probably break yours, but not visa-versa.

Finally, it's a lot easier to build slop / flex into his solution (if, for example, you have to deal with multiple minor variations in the input).

MarkusQ
+4  A: 

I'd use a regular expression, but only because most HTML pages are not valid XML, so you'd never get the XQUERY to work.

I don't know XQuery, but that looks like an XPATH expression to me. If so, it looks a bit expensive with so many "//" operators in it.

John Saunders
This is the principal reason for avoiding XML-based web scraping. No XML processor is as forgiving as a browser. The page need only break one well-formedness rule to render XQuery completely useless.
harpo
+1  A: 

I'm using BeautifulSoup for scrapping.

Anonymous
+1  A: 

I actually find CSS search expressions easier to read than either. There probably exists at least one library in the language of your choice that will parse a page and allow you to write CSS directives for locating particular elements. If there's an appropriate class or ID hook nearby then the expression is pretty trivial. Otherwise, grab the elements that seem appropriate and iterate through them to find the ones that you need.

As for fragile, well, they're all fragile. Screen-scraping is by definition dependent on the author of that page not changing its layout drastically. Go with a solution that's readable and can be easily changed later.

Brian Guthrie
+1  A: 

A non-brittle solution for screen-scraping? Good luck to the interviewer for that: just because regular expressions toss away a lot of context doesn't mean they are any less brittle: just that they are brittle in other ways. Brittleness may not even be a drawback: if something changes in the source web page, you are frequently better off if your solution raises an alarm, rather than tries to compensate in a clever (and unpredictable) way. As you noted. These things always depend on your assumptions: in this case, on what constitutes a likely change.

I'm rather fond of the HTML agility pack: you get tolerance of non-XHTML-compliant web pages combined with the expressive power of XPath.

Pontus Gagge
+1  A: 

Try JTidy or BeautifulSoup works fine for me. certainly // XPATH experssion is quite costly to scrap.

+1  A: 

Regular expressions are really fast and work with non XML documents. Those are really good points against XQuery. However I think that using some converter to XHTML like tidy and maybe somewhat simpler XQuery, like only the last part from yours:

//b[contains(child::text(), "Product Dimensions:")]/following-sibling::text()

is a very good alternative.

Regards,

Rafal Rusin

Rafal Rusin