views:

314

answers:

6

I need to take two text blocks with html tags and render a comparison - merge the two text blocks and then highlight what was added or removed from one version to the next.

I have used the PEAR Text_Diff class to successfully render comparisons of plain text, but when I try to throw text with html tags in it, it gets UGLY. Because of the word and character-based compare algorithms the class uses, html tags get broken and I end up with ugly stuff like <p><span class="new"> </</span>p>. It slaughters the html.

Is there a way to generate a text comparison while preserving the original valid html markup?

Thanks for the help. I've been working on this for weeks :[

This is the best solution I could think of: find/replace each type of html tag with 1 special non-standard character like the apple logo (opt shift k), render the comparison with this kind of primative markdown, then revert the non-standard characters back into tags. Any feedback?

A: 

Try running your HTML blocks through this function first:

htmlentities();

That should convert all of your "<"'s and ">"'s into their corresponding codes, perhaps fixing your problem.

//Example:
$html_1 = "<html><head></head><body>Something</body></html>"
$html_2 = "<html><head></head><body><p id='abc'>Something Else</p></body></html>"

//Below code taken from http://www.go4expert.com/forums/showthread.php?t=4189.
//Not sure if/how it works exactly

$diff = &new Text_Diff(htmlentities($html_1), htmlentities($html_2));
$renderer = &new Text_Diff_Renderer();
echo $renderer->render($diff);
Mike Trpcic
Thanks for the swift answer... but that would actually make the problem worse :/ because then I would a tags being converted into even longer multi-char strings, which the compare class will break apart.The end result needs to be valid HTML markup so that it can be shown on a webpage. I don't want the end user to see any html tags - they need to see rendered html on a page. The text I'm dealing with can be thought of as like blog articles - just h,p,a, and img tags. I just want to add highlighting to show what changed.
Stephen Gacka
+1  A: 

Simple Diff, by Paul Butler, looks as though it's designed to do exactly what you need: http://github.com/paulgb/simplediff/blob/5bfe1d2a8f967c7901ace50f04ac2d9308ed3169/simplediff.php

Notice in his php code that there's an html wrapper: htmlDiff($old, $new)

(His blog post on it: http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/

micahwittman
This algorithm works much better than the PEAR one. Thanks for pointing out the resource.
Stephen Gacka
Great. You're most welcome.
micahwittman
+1  A: 

The problem seems to be that your diff program should be treating existing HTML tags as atomic tokens rather than as individual characters.

If your engine has the ability to limit itself to working on word boundaries, see if you can override the function that determines word boundaries so it recognizes and treats HTML tags as a single "word".

You could also do as you are saying and create a lookup dictionary of distinct HTML tags that replaces each with a distinct unused Unicode value (I think there are some user-defined ranges you can use). However, if you do this, any changes to markup will be treated as if they were a change to the previous or following word, because the Unicode character will become part of that word to the tokenizer. Adding a space before and after each of your token Unicode characters would keep the HTML tag changes separate from the plain text changes.

richardtallent
The unicode token find/replace is what finally worked. I just did a key=>value array with each opening and closing tag and its associated unicode character. Then I generated the comparison, and reversed the token/tag swap.
Stephen Gacka
I also found Paul Butler's Simple Diff script to work much better for long text than the PEAR package. PEAR focused word-to-word whereas Butcher's setup produced better output with differences remaining chunked together as strings. Link: http://github.com/paulgb/simplediff/blob/5bfe1d2a8f967c7901ace50f04ac2d9308ed3169/simplediff.php
Stephen Gacka
A: 

Use the Pretty Diff tool for the markup and diff options. Please read the documentation to know if there are any limitations that you find incompatible.

http://mailmarkup.org/prettydiff/prettydiff.html

A: 

What about using an html tidier / formatter on each block first? This will create a standard "structure" which your diff might find easier to swallow

Steve
A: 

A copy of my own answer from here.


What about DaisyDiff (Java and PHP vesions available).

Following features are really nice:

  • Works with badly formed HTML that can be found "in the wild".
  • The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
  • In addition to the default visual diff, HTML source can be diffed coherently.
  • Provides easy to understand descriptions of the changes.
  • The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
elhoim