ansaurus

Question

Regular expression to match empty HTML tags that may contain embedded JSTL?

Answer 1

+2 A:

It's not a good idea to use regexes for HTML as there are many constructs that cannot be matched by most regex systems. Also much HTML (as opposed to XHTML) has many difficult constructs. Suggest you use an HTML parser. [This has been frequently addressed on SO and the universal answer is don't use regex).

peter.murray.rust 2009-11-10 05:14:26

Thx, but see edit.

Keith Bentrup 2009-11-10 05:24:31

Why is this the answer to every regexp question? This is not the correct answer, he doesn't want to parse html he wants to match a pattern in text which is exactly what regular expressions were desgined for.

Rob 2009-11-10 21:45:32

It's not the answer to every regex question. It's just the answer when regex are the inferior tool.

brian d foy 2009-11-10 22:05:04

The OP asked "a regular expression to look for empty html tags that may have embedded JSTL" and this was my answer. Lookinig for any HTML tags with a regex is a poor idea. If the Q had been "how can I parse this JSTL with a regex" the answer might have been different

peter.murray.rust 2009-11-10 22:07:03

Answer 2

+1 A:

Using an HTML parser doesn't mean you're interpreting or running the content: it means you are transforming it from a string of characters into a nested object. HTML is not regular, so regular expressions aren't the best solution to this problem.

See the docs for HTML::TreeBuilder as a good place to start. Other good resources include HTML::Parser and of course this site. :)

Edit: I'll pretend that your question has nothing to do with HTML and is just an interesting regex puzzle, and as such will ponder it... ~~...[still thinking.. edit coming]~~ (puzzle abandoned in the face of a really awesome solution presented above)

Ether 2009-11-10 06:07:41

He doesn't want to parse html, he wants to match a pattern in text. I hear regexps are good at that.

Rob 2009-11-10 21:46:32

Technically he should be parsing html and then using regexps to search inside the content of particular tags..

Ether 2009-11-10 22:00:57

Answer 3

+3 A:

Try

<(\w+)(?:\s+\w+="[^"]+(?:"\$[^"]+"[^"]+)?")*>\s*</\1>

A short explanation:

<            # match a '<'
(\w+)        # match one or more a-z, A-Z, 0-9 or '_' and store it in group 1 
(?:          # open non-matching-group 1
  \s+        #   match one or more white space characters 
  \w+        #   match one or more a-z, A-Z, 0-9 or '_'
  ="         #   match '="'
  [^"]+      #   match one or more characters other than '"'
  (?:        #   open non-matching-group 2
    "\$      #     match '"$'
    [^"]+    #     match one or more characters other than '"'
    "        #     match '"'
    [^"]+    #     match one or more characters other than '"'
  )?         #   close non-matching-group 2, and make it optional
  "          #   match '"'
)*           # close non-matching-group 1, and make repeat itself zero or more times
>            # match '>'
\s*          # match zero or more white space characters
</\1>        # match '</X>' where `X` is what is captured in group 1

This works for both you examples but I am sure someone can construct html that you want to match but will not be matched by the regex. But after reading your 'edit', it seems you are aware of that.

Bart Kiers 2009-11-10 07:18:15

+1 for the explanation.

BalusC 2009-11-10 11:09:09

Answer 4

A:

If you assume that your input is valid XML, as you say, my tool of choice would be XML::Twig.

brian d foy 2009-11-10 21:39:08

Answer 5

A:

based on what i have read i believe the (?: is a non-capturing group not a non-matching group, thus the comment on the regex should be changed.

A non-matching group would be (?!

2010-04-01 22:18:56

ansaurus

tags:

views:

answers:

Regular expression to match empty HTML tags that may contain embedded JSTL?

related questions