views:

90

answers:

2

I've got a list of websites for each US Congress member that I'm programmatically crawling to scrape addresses. Many of the sites vary in their underlying markup, but this wasn't initially a problem until I started seeing that hundreds of sites were not giving the expected results for the script I had written.

After taking some more time to evaluate potential causes, I found that calling strip_tags() on the results of file_get_contents() was erasing most of the source of the page many times! This was not only removing the HTML, it was removing the non-HTML that I wanted to scrape!

So I removed the call to strip_tags(), substituted a call to remove all non-alphanumeric characters and gave the process another run. It turned up other results, but still lacked many. This time it was because my regular expressions weren't matching the desired patterns. After looking at the returned code, I realized that I had the remnants of HTML attributes interspersed throughout the text, breaking my patterns.

Is there a way around this? Is it the result of malformed HTML? Can I do anything about it?

+4  A: 

There's a warning in the PHP manual that reads:

Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.

Since you are scraping many different sites, and you can't account for the validity of their HTML, this will always be a problem. Unfortunately, regexps aren't going to do it for you either, as regexps simply aren't cut out to be document parsers.

I would use something like PHP Simple HTML DOM Parser, or even the built-in DOMDocument->loadHTML() method.

You could keep a small database that recorded each page you wanted to scrape, and where the information was found in the structure of that page. Each time you scraped it, you could do a quick check to see if the structure had changed, in which case you could update your database with the new path location for your DOM parser, and get it on the next scrape.

zombat
A: 

Malformed html may very well be the cause.
You could try to load the pages via DOMDocument::loadhtmlfile(). May it is able to "fix" the errors.
Also take a look at libxml_use_internal_errors() as it might help you to identify and handle the problems.

VolkerK