tags:

views:

249

answers:

3

How can remove the comments and contents of the comments from an html file using Java where the comments are written like:

<!--

Any idea or help needed on this.

+5  A: 

Take a look at JTidy, the java port of HTML Tidy. You could override the print methods of the PPrint object to ignore the commen tags.

Kees de Kooter
+4  A: 

If you don't have valid xhtml, which a comment posted reminded me of, you should at first apply jtidy to tidy up the html and make it valid xhtml.

See this for example code on jtidy.

Then I'd convert the html to a DOM instance.

Like so:

final DocumentBuilderFactory newFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder documentBuilder = newFactory.newDocumentBuilder();
Document document = documentBuilder.parse( new InputSource( new StringReader( string ) ) );

Then I'd navigate through the document tree and modify nodes as needed.

dhiller
Most HTML around is still not XHTML, so JTidy should probably be the first option, not an afterthought.
Joachim Sauer
+1  A: 

try a simple regex like

String commentless = pageString.replaceAll("<!--[\w\W]*?-->", "");

edit: to explain the regex:

  • <!-- matches the literal comment start
  • [\w\W] matches every character (even newlines) which will be inside the comment
  • *? matches multiple of the 'any character' but matches the smallest amount possible (not greedy)
  • --> closes the comment
cobbal
A simple regex should be able to do the job - but this one doesn't ... comments are not always opened and closed on the same line. I just found this link on google that seems better: http://ostermiller.org/findhtmlcomment.html
Simon Groenewolt
if you try this, it works. the \w\W catches everything, including newlines, unlike '.'
cobbal
Not exactly sure why this is downvoted. Regardless of whether or not this particular RegEx works, RegEx IS the way to go here.
Dalin Seivewright
No, it isn't. It would remove "comment" from this too: <input type="text" value="<!-- Hello world -->">, which would be incorrect. <!-- doesn't always start the comment.
Peter Štibraný
good point. is it legal to have < in a string? I'm fairly sure > will throw most browsers.
cobbal
Hmm, ones I tried (FF, IE) simply displayed textfield with <!-- Hello world --> text inside. But you're right ... < should not be unescaped I guess. (In XML it would be an error... On the other hand, in XML simple counter-example would be <[![CDATA<!-- hello world -->]]> :-) )
Peter Štibraný
Or a processing instruction. And native HTML comments have a more complicated syntax than just <!--...--> anyway (eg. multiple instances of ‘--’). Regex is in general not powerful enough to parse even valid XML/HTML, let alone real-world tag soup.
bobince
the question asked specifically about <!-- though. I'm not saying that a regex will work for everything, but that it may be all that you need in some cases
cobbal
ah - I didn't really catch the \W bit that would allow for whitespace/linebreaks etc.
Simon Groenewolt