views:

273

answers:

5

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.

It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.

A: 

First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.

To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.

PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.

BrianLy
@BrainLy: Just because new html file has a different hash, does not mean the HTML structure has changed.
codaddict
The hashes will *always* differ because the data I'm scraping changes on an hourly basis! What I meant was, what if they changed the design of the site, how can that be detected in an efficient way?
Yeti
Dynamic pages will consistently produce different hashes, usually without major structural changes.
Tim Post
You are correct, I misread the original question. I've added more detail on how you might compare the structure. I would tend to use the hash as an initial check before doing something more complicated to save on some processing. The value is going to depend on the number of pages.
BrianLy
Hashes is not a valid strategy.
systempuntoout
+4  A: 

I think you don't have any clean solutions if you are scraping a page where content change.

I have developed several python scrapers and i know how can be frustrating when site just makes a subtle change on its layout.

You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).

Another possibile approach would be to code some constraints and check them before store to db.

For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.

If you are scraping plain text, it will be more difficult to check.

systempuntoout
Why downvote??Please add some comment if you downvote.
systempuntoout
Hey that was me. Sorry because it was not intentional!! I pressed the the wrong button and now I'm not able to change it. It says - "vote too old to be changed, unless this answer is edited". Sorry again, please make some change to the answer so that I can vote it up. It wasn't intentional.
Yeti
Edited.No problem at all :)
systempuntoout
A: 

Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.

spender
Anyone care to explain their downvote?
spender
A: 

If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.

There are lot of way you can do it:- SaxParser DOmParser etc

I have a small blog which will give some pointers to what I mean http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html

or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.

You need to be very careful when trying to use XML parsers with HTML. They tend to blow up at the slightest malformed HTML.
BrianLy
+1  A: 

Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.

http://php.net/manual/en/book.dom.php

If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?

(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)

phphelpplz