ansaurus

Question

Java regex to retain specific closing tags

Answer 1

+4 A:

You probably shouldn't use regex for this task, but let's see what happens...

Your problem is that you are using a negative character class, and inside character classes you can't write complex expressions - only characters. You could try a negative lookahead instead:

"</(?!a|em|li).*?>"

But this won't handle a number of cases correctly:

Comments containing things that look like tags.
Tags as strings in attributes.
Tags that start with a, em, or li but are actually other tags.
Capital letters.
etc...

You can probably fix these problems, but you need to consider whether or not it is worth it, or if it would be better to look for a solution based on a proper HTML parser.

Mark Byers 2010-02-02 22:52:06

+1 for the explanation and the push in the right direction

akf 2010-02-02 22:53:13

Awesome, Mark, thanks for the explanation. I did not understand that aspect of character classes.

Chris B 2010-02-02 22:59:47

Answer 2

A:

You cannot use an alternation inside a character class. A character class always matches a single character.

You likely want to use a negative lookahead or lookbehind instead:

"</(?!a|em|li).*?>"

Anon. 2010-02-02 22:52:58

Answer 3

+1 A:

I would really use a proper parser for this (e.g. JTidy). You can't parse XML/HTML using regular expressions as it's not regular, and no end of edge cases abound. I would rather use the XML parsing available in the standard JDK (JAXP) or a suitable 3rd party library (see above) and configure your output accordingly.

See this answer for more passionate info re. parsing XML/HTML via regexps.

Brian Agnew 2010-02-02 23:10:43

ansaurus

tags:

views:

answers:

Java regex to retain specific closing tags

related questions