tags:

views:

95

answers:

3

Hey all!

I need to match all 'tags' (e.g. %thisIsATag%) that occur within XML attributes. (Note: I'm guaranteed to receive valid XML, so there is no need to use full DOM traversal). My regex is working, except when there are two tags in a single attribute, only the last one is returned.

In other words, this regex should find tag1, tag2, ..., tag6. However, it omits tag2 and tag5.

Here's a fun little test harness for you (PHP):

<?php

$xml = <<<XML
<data>
 <slideshow width="625" height="250">

  <screen delay="%tag1%">
   <text x="30%" y="50%" animatefromx="800">
    <line fontsize="32" fontstyle="bold" text="Screen One!%tag2% %tag3%"/>
   </text>
  </screen>

  <screen delay='%tag4%'>
   <text x="30%" y="50%" animatefromx="800">
    <line fontsize='32' fontstyle='bold' text='Screen 2!%tag5%%tag6%'/>
   </text>
  </screen>

  <screen>
   <text x="30%" y="50%" animatefromx="800">
    <line fontsize="32" fontstyle="bold"  text="Screen Tres!"/>
   </text>
  </screen>

  <screen>
   <text x="30%" y="50%" animatefromx="800">
    <line fontsize="32" fontstyle="bold"  text="Screen FOURRRR!"/>
   </text>
  </screen>

 </slideshow>
</data>
XML;

$matches = null;
preg_match_all('#<[^>]+("([^%>"]*%([^%>"]+)%[^%>"]*)+"|\'([^%>\']*%([^%>\']+)%[^%>\']*)+\')[^>]*>#i', $xml, $matches);

print_r($matches);
?>

Thanks! :)

+2  A: 

Is this:

(%[a-zA-Z0-9]+%)

not enough? In your example, tags don't appear anywhere outside of attribute values - can they?

RichieHindle
+1 haha! funny how sometimes we overlook the simplest solutions... :) I suppose this would work for most cases. THe only thing that makes me nervous is that the XML **does** get more complex, and it's possible that tag-like text could also appear within the body of an element... But again, this is probably a sufficient solution for now. Thanks !:)
+2  A: 

%\w+% would be an even simpler way of doing this.

Mentee
+1 for simplifying things even further.
The Mentee is the ultimate regex guru
Dan
+1  A: 

What you're trying to do is recover intermediate captures from groups that match more than once per regex match. As far as I know, only .NET and Perl 6 provide that capability. You'll have to do the job in two stages: match an attribute value with one or more %tag% sequences in it, then break out the individual sequences.

You don't seem to care which XML tag or attribute the values are associated with, so you could use this, somewhat simpler regex to find the values with %tag% sequences in them:

'#"([^"%<>]*+%[^%"]++%[^"]*+)"|\'([^\'%<>]*+%[^%\']++%[^\']*+)\'#'

EDIT: That regex captures the attribute value in group 1 or group 2, depending in which quotes it used. Here's another version that merges the alternatives so it can always save the value in group 2:

'#(["\'])((?:(?![%<>]|\1).)*+%(?:(?!%|\1).)++%(?:(?!\1).)*+)\1#'
Alan Moore
While the other solutions are much simpler and still solve the same essential problem, this one solves the mystery at the core of my question. THe key takeaway is, that in PHP (and most languages), I can't "recover intermediate captures". Makes sense, I suppose! Good to know. :)
The other answers also assume `%tag%` names can consist only of alphanumeric or "word" characters, and that `%ThingsThatLookLikeTags%` will always in fact be tags, no matter where they appear. Mine only matches them in quoted strings--which assumes **they** will always be attribute values. But I could extend it to match the strings only within (XML) tags.
Alan Moore