I was answering some quiz questions for an interview, and the question was about how would I do screen scraping. That is, picking content out of a web page, assuming you don't have a better structured way to query the information directly (e.g. a web service).
My solution was to use an XQuery expression. The expression was fairly long because the content I needed was pretty deep in the HTML hierarchy. I had to search up through the ancestors a fair way before I found an element with an id
attribute. For example, scraping an Amazon.com page for Product Dimensions looks like this:
//a[@id="productDetails"]
/following-sibling::table
//h2[contains(child::text(), "Product Details")]
/following-sibling::div
//li
/b[contains(child::text(), "Product Dimensions:")]
/following-sibling::text()
That's a pretty nasty expression, but that's why Amazon provides a web service API. Anyway, it's just one example. The question was not about Amazon, it's about screen scraping.
The interviewer didn't like my solution. He thought it was fragile, because a change to the page design by Amazon could require rewriting the XQuery expression. Debugging an XQuery expression that doesn't match anything in the page it's applied against is hard.
I did not disagree with his statements, but I didn't think his solution was any improvement: he thought it's better to use a regular expression, and search for content and markup near the shipping weight. For example, using Perl:
$html =~ m{<li>\s*<b>\s*Product Dimensions:\s*</b>\s*(.*?)</li>}s;
My counter-argument was that this is also susceptible to Amazon changing their HTML code. They could spell HTML tags in capitals (<LI>
), or add CSS attributes or change <b>
to <span>
or change the label "Product Dimensions:" to "Dimensions:" or many other kinds of changes. My point was that regular expressions don't solve the weaknesses he called out in my XQuery solution.
But in addition, regular expressions can find false positives, unless you add enough context to the expression. It can also unintentionally match content that happens to be inside a comment, or an attribute string, or a CDATA section.
My question is, what technology do you use to do screen scraping? Why did you choose that solution? Is there some compelling reason to use one? Or never use the other? Is there a third choice besides those I showed above?
PS: Assume for the sake of argument that there is no web service API or other more direct way to acquire the desired content.