tags:

views:

80

answers:

1

I have a text file which is the result of parsed HTML into plain text file. I need to get rid of which is something like XHTML comment like the following shows:

<!--
if (!document.phpAds_used)
 document.phpAds_used = ',';
 phpAds_random = new String
 (Math.random()); phpAds_random =
 phpAds_random.substring(2,11);
 document.write ("<" + "script
 language='JavaScript'
 type='text/javascript' src='");
 document.write
 ("http://www.writers.net/Openads/adjs.php?n="
 + phpAds_random); document.write ("&what=zone:5&target=_blank");
 document.write ("&exclude=" +
 document.phpAds_used); if
 (document.referrer) document.write
 ("&referer=" +
 escape(document.referrer));
 document.write ("'><" + "/script>");
 // -->

How can I get rid of anything between <!-- and //--> using Java?

+1  A: 

A simple solution would be to use the String.replaceAll() method.

For example, something like the following code should work:

String x = "wow <!-- // --> zip, here's <!-- comment here //--> another one";
x = x.replaceAll("<!--.*?//\\s*-->", "");
System.out.println(x);  // prints out "wow  zip, here's  another one"

The \\s* matches none or many spaces since your example had a space but your description did not. The .*? makes this a non-greedy match so it will match up to the first //-->

If you are running this over and over, you could use the Pattern instead and just regenerate the matcher for each block you are processing:

Pattern.compile("<!--.*?//\\s*-->").matcher(x).replaceAll("")
Gray