tags:

views:

56

answers:

3

I need to compare a webpage's DOM structure at various points in point. What are the ways to retrieve and snapshot it.

I need the DOM on server-side for processing.

I basically need to track structural changes to a webpage. Such as removing of a div tag, or inserting a p tag. Changing data (innerHTML) on those tags should not be seen as a difference.

+4  A: 
$html_page = file_get_contents("http://awesomesite.com");
$html_dom = new DOMDocument();
$html_dom->loadHTML($html_page);

That uses PHP DOM. Very simple and actually a bit fun to use. Reference

EDIT: After clarification, a better answer lies here.

Codeacula
Nice! How then do I compare two DOM objects?
gAMBOOKa
Define compare for me.
Codeacula
I basically need to track structural changes to a webpage. Such as removing of a div tag, or inserting a p tag. Changing data (innerHTML) on those tags should not be seen as a difference.
gAMBOOKa
You should update your question with this data, then, because that goes beyond what you originally have in there. I can easily tell you how to retrieve the DOM, but I'm at a loss of an easy way for you to compare the DOM. I would likely end up iterating over it myself in a recursive function based off the last instance.
Codeacula
@Codeacula: maybe an "easy" way to compare the DOM would be to iterate through DOM, output the nodes in plain text format, and then use a diff?
MainMa
If you don't get an answer, a 'simple' way would be to use a readily available tool like `diff` and load the output into the program.
Codeacula
@MainMa Got me in between answering. They wouldn't even need to go through the DOM, just load the site using `file_get_contents` and compared with the last saved version.
Codeacula
Dear downvoter: Why? I can't fix the answer if you ninja in a downvote and leave like an AC off to another question.
Codeacula
@Codeacula: if I understand well the original question, it *is* required to pass through the DOM, since only structure, and not the HTML is compared; the the contents of the elements don't matter. In other words, there is no difference between `<div>Hello</div>` and `<div>Hello World!</div>`, but there is a difference between `<div>A</div>` and `<div><span>A</span></div>`.
MainMa
Duh. I totally forgot that stipulation.
Codeacula
+2  A: 

Perform the following steps on server-side:

  • Retrieve a snapshot of the webpage via HTTP GET
  • Save consecutive snapshots of a page with different names for later comparison
  • Compare the files with an HTML-aware diff tool (see HtmlDiff tool listing page on ESW wiki).

As a proof-of-concept example with Linux shell, you can perform this comparison as follows:

wget --output-document=snapshot1.html http://example.com/
wget --output-document=snapshot2.html http://example.com/
diff snapshot1.html snapshot2.html

You can of course wrap up these commands into a server-side program or a script.

For PHP, I would suggest you to take a look at daisydiff-php. It readily provides a PHP class that enables you to easily create an HTML-aware diff tool. Example:

<?
require_once('HTMLDiff.php');
$file1 = file_get_contents('snapshot1.html');
$file2 = file_get_contents('snapshot1.html');
HTMLDiffer->htmlDiffer( $file1, $file2 );
?>

Note that with file_get_contents, you can also retrieve data from a given URL as well.

Note that DaisyDiff itself is very fine tool for visualisation of structural changes as well.

jsalonen
I'd prefer not to make changes to the actual, rather retrieve them remotely and track the changes.
gAMBOOKa
Thank you for the additional information. I changed my answer accordingly. The point is that the same approach applies also for server-side only processing.
jsalonen
PHP, I mentioned it as a tag, I guess I should've been more clearer.
gAMBOOKa
I admire your dedication to help me out. Although I'm struggling with a weird 'No memory' error, I must say I'm now on the right track with this tool. Just a note though. The actual parsing syntax for that tool is `HTMLDiffer->htmlDiffer( $file1, $file2 );`
gAMBOOKa
Thank you for the syntax correction. I wish you good luck with this work. Also, please ask for additional details or open a new question if you got into trouble again!
jsalonen
A: 

If you use firefox, firebug lets you view the DOM structure of any web page.

luq
I know, but I need to implement in an application and process it.
gAMBOOKa