tags:

views:

61

answers:

3

I'm trying to write a regex to remove all but a handful of closing xml tags.

The code seems simple enough:

String stringToParse = "<body><xml>some stuff</xml></body>";
Pattern pattern = Pattern.compile("</[^(a|em|li)]*?>");
Matcher matcher = pattern.matcher(stringToParse);
stringToParse = matcher.replaceAll("");

However, when this runs, it skips the "xml" closing tag. It seems to skip any tag where there is a matching character in the compiled group (a|em|li), i.e. if I remove the "l" from "li", it works.

I would expect this to return the following string: "<body><xml>some stuff" (I am doing additional parsing to remove the opening tags but keeping it simple for the example).

+4  A: 

You probably shouldn't use regex for this task, but let's see what happens...

Your problem is that you are using a negative character class, and inside character classes you can't write complex expressions - only characters. You could try a negative lookahead instead:

"</(?!a|em|li).*?>"

But this won't handle a number of cases correctly:

  • Comments containing things that look like tags.
  • Tags as strings in attributes.
  • Tags that start with a, em, or li but are actually other tags.
  • Capital letters.
  • etc...

You can probably fix these problems, but you need to consider whether or not it is worth it, or if it would be better to look for a solution based on a proper HTML parser.

Mark Byers
+1 for the explanation and the push in the right direction
akf
Awesome, Mark, thanks for the explanation. I did not understand that aspect of character classes.
Chris B
A: 

You cannot use an alternation inside a character class. A character class always matches a single character.

You likely want to use a negative lookahead or lookbehind instead:

"</(?!a|em|li).*?>"
Anon.
+1  A: 

I would really use a proper parser for this (e.g. JTidy). You can't parse XML/HTML using regular expressions as it's not regular, and no end of edge cases abound. I would rather use the XML parsing available in the standard JDK (JAXP) or a suitable 3rd party library (see above) and configure your output accordingly.

See this answer for more passionate info re. parsing XML/HTML via regexps.

Brian Agnew