Hi,
I have to compare different versions of HTML pages for formatting and text changes. Unfortunately the guy/company who creates them uses some kind of HTML editor that re-wraps all the HTML every time (and adds tons of whitespace), which makes it hard to diff them. So I am looking for a tool (preferrably a Java library) that can reformat my HTML in a way that all insignificant spaces and newlines get removed.
That means, in
<h1>First Headline</h1> <h2>Second headline</h2>
the space between </h1>
and <h2>
should be removed, but in
<b>formatted</b> <i>text</i>
the whitespace may not be removed. I do not care about <pre>
, <textarea>
or <script>
blocks, and also not about CSS whitespace attributes that can change the behavior - I am just looking for a solution that strips most of the unnecessary whitespace (and better leave too much whitespace in than too little).
(I am already collapsing multiple whitespaces and re-adding newlines instead of whitespaces before tags to make the text more readable - but there are still too many cases where for example a new newline between headlines or table cells/rows breaks my simple "solution".)