views:

614

answers:

5

I'm trying to construct a regular expression to look for empty html tags that may have embedded JSTL. I'm using Perl for my matching.

So far I can match any empty html tag that does not contain JSTL with the following?

/<\w+\b(?!:)[^<]*?>\s*<\/\w+/si

The \b(?!:) will avoid matching an opening JTSL tag but that doesn't address the whether JSTL may be within the HTML tag itself (which is allowable). I only want to know if this HTML tag has no children (only whitespace or empty). So I'm looking for a pattern that would match both the following:

<div id="my-id"> 
</div>
<div class="<c:out var="${my.property}" />"></div>

Currently the first div matches. The second does not. Is it doable? I tried several variations using lookahead assertions, and I'm starting to think it's not. However, I can't say for certain or articulate why it's not.

Edit: I'm not writing something to interpret the code, and I'm not interested in using a parser. I'm writing a script to point out potential issues/oversights. And at this point, I'm curious, too, to see if there is something clever with lookaheads or lookbehinds that I may be missing. If it bothers you that I'm trying to "solve" a problem this way, don't think of it as looking for a solution. To me it's more of a challenge now, and an opportunity to learn more about regular expressions.

Also, if it helps, you can assume that the html is xhtml strict.

+2  A: 

It's not a good idea to use regexes for HTML as there are many constructs that cannot be matched by most regex systems. Also much HTML (as opposed to XHTML) has many difficult constructs. Suggest you use an HTML parser. [This has been frequently addressed on SO and the universal answer is don't use regex).

peter.murray.rust
Thx, but see edit.
Keith Bentrup
Why is this the answer to every regexp question? This is not the correct answer, he doesn't want to parse html he wants to match a pattern in text which is exactly what regular expressions were desgined for.
Rob
It's not the answer to every regex question. It's just the answer when regex are the inferior tool.
brian d foy
The OP asked "a regular expression to look for empty html tags that may have embedded JSTL" and this was my answer. Lookinig for any HTML tags with a regex is a poor idea. If the Q had been "how can I parse this JSTL with a regex" the answer might have been different
peter.murray.rust
+1  A: 

Using an HTML parser doesn't mean you're interpreting or running the content: it means you are transforming it from a string of characters into a nested object. HTML is not regular, so regular expressions aren't the best solution to this problem.

See the docs for HTML::TreeBuilder as a good place to start. Other good resources include HTML::Parser and of course this site. :)

Edit: I'll pretend that your question has nothing to do with HTML and is just an interesting regex puzzle, and as such will ponder it... ...[still thinking.. edit coming] (puzzle abandoned in the face of a really awesome solution presented above)

Ether
He doesn't want to parse html, he wants to match a pattern in text. I hear regexps are good at that.
Rob
Technically he should be parsing html and then using regexps to search inside the content of particular tags..
Ether
+3  A: 

Try

<(\w+)(?:\s+\w+="[^"]+(?:"\$[^"]+"[^"]+)?")*>\s*</\1>

A short explanation:

<            # match a '<'
(\w+)        # match one or more a-z, A-Z, 0-9 or '_' and store it in group 1 
(?:          # open non-matching-group 1
  \s+        #   match one or more white space characters 
  \w+        #   match one or more a-z, A-Z, 0-9 or '_'
  ="         #   match '="'
  [^"]+      #   match one or more characters other than '"'
  (?:        #   open non-matching-group 2
    "\$      #     match '"$'
    [^"]+    #     match one or more characters other than '"'
    "        #     match '"'
    [^"]+    #     match one or more characters other than '"'
  )?         #   close non-matching-group 2, and make it optional
  "          #   match '"'
)*           # close non-matching-group 1, and make repeat itself zero or more times
>            # match '>'
\s*          # match zero or more white space characters
</\1>        # match '</X>' where `X` is what is captured in group 1

This works for both you examples but I am sure someone can construct html that you want to match but will not be matched by the regex. But after reading your 'edit', it seems you are aware of that.

Bart Kiers
+1 for the explanation.
BalusC
A: 

If you assume that your input is valid XML, as you say, my tool of choice would be XML::Twig.

brian d foy
A: 

based on what i have read i believe the (?: is a non-capturing group not a non-matching group, thus the comment on the regex should be changed.

A non-matching group would be (?!