I've got a list of websites for each US Congress member that I'm programmatically crawling to scrape addresses. Many of the sites vary in their underlying markup, but this wasn't initially a problem until I started seeing that hundreds of sites were not giving the expected results for the script I had written.
After taking some more time to evaluate potential causes, I found that calling strip_tags()
on the results of file_get_contents()
was erasing most of the source of the page many times! This was not only removing the HTML, it was removing the non-HTML that I wanted to scrape!
So I removed the call to strip_tags()
, substituted a call to remove all non-alphanumeric characters and gave the process another run. It turned up other results, but still lacked many. This time it was because my regular expressions weren't matching the desired patterns. After looking at the returned code, I realized that I had the remnants of HTML attributes interspersed throughout the text, breaking my patterns.
Is there a way around this? Is it the result of malformed HTML? Can I do anything about it?