views:

154

answers:

4

Background


Consider the following input:

<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
>

After processing, I need it to look like the following:

<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
    CustomAttribute="TRUE"
>

Implementation


This is all I need to do for no more than 5 files, so using anything other than a regular expression seems like overkill. Anyway, I came up with the following (Perl) regular expression to accomplish this:

$data =~ s/(<\s*Foo)(.*?)>/$1$2 CustomAttribute="TRUE">/sig;

Problems


This works well, however, there is one obvious problem. This sort of pattern is "dumb" because if CustomAttribute has already been added, the operation outlined above will simply append another CustomAttribute=... blindly.

A simple solution, of course, is to write a secondary expression that will attempt to match for CustomAttribute prior to running the replacement operation.

Questions


Since I'm rather new to the scripting language and regular expression worlds, I'm wondering whether it's possible to solve this problem without introducing any host language constructs (i.e., an if-statement in Perl), and simply use a more "intelligent" version of what I wrote above?

+5  A: 

I won't beat you over the head with how you should not use a regex for this. I mean, you shouldn't, but you obviously know that from what you said in your question, so moving on...

Something that will accomplish what you're asking for is called a negative lookahead assertion (usually (?!...)), which basically says that you don't want the match to apply if the pattern inside the assertion is found ahead of this point. In your example, you don't want it to apply if CustomAttribute is already present, so:

$data =~ s/(<\s*Foo)(?![^>]*\bCustomAttribute=)(.*?)>/$1$2CustomAttribute="TRUE">/sig;
Adam Bellaire
Awesome, thanks for actually answering the question and not lecturing about RegEx + XML.
HCO2
XML allows plain `>` in attribute values, so `<Foo Bar=">">` would fail. And `<Foo Bar="CustomAttribute=">` would also fail.
Gumbo
@Gumbo: Good point, +1 to your answer.
Adam Bellaire
@Gumbo: Yes! Thank you for pointing out why regexp for HTML is inherently fragile. Without descent parsing one is bound to find these sorts of anomalies; probably months or years after the fragile code has been in use by users who expected it to be more robust.
Jim Dennis
A: 

You can send your matches through a function with the 'e' modifier for more processing.

my $str = qq`
<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
    CustomAttribute="TRUE"
>
<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
>
`;

sub foo {
    my $guts = shift;
    $guts .= qq` CustomAttribute="TRUE"` if $guts !~ m/CustomAttribute/;
    return $guts;
}
$str =~ s/(<Foo )([^>]*)(>)/$1.foo($2).$3/xsge;
Rob
What about `<Foo Bar="CustomAttribute">`?
Gumbo
+5  A: 

This sounds like it might be a job for XML::Twig, which can process XML and change parts of it as it runs into them, including adding attributes to tags. I suspect you'd spend as much time getting used to Twig and you would finding a regex solution that only mostly worked. And, at the end you'd know enough Twig to use it on the next project. :)

brian d foy
+3  A: 

Time for a lecture I guess ;--)

I am not sure why you think using a full-blown XML processor is overkill. It is actually easier to write the code using the proper tool. A regexp will be more complex and will rely on unwritten assumptions about the data, which is dangerous. Some of those assumptions are likely to be: no '>' in attribute values, no CDATA sections, no non-ascii characters in tag or attribute names, consistent attribute value quoting...

The only thing a regexp will give you is the assurance that the output keeps the original format of the data (in your case the fact that the attributes are each on a separate line). But if your format is consistent that can be done, and if not it should not matter, unless you keep you XML in a line-oriented revision control system.

Here is an example with XML::Twig. It assumes you have enough memory to keep any entire Foo element in memory, and it works even on the admittedly contrived bit of XML in the DATA section. It would probably be just as easy to do with XML::LibXML (read the XML in memory, select all Foo elements, add attribute to each of them, output, that's 5 easy to understand lines by my count).

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my( $tag, $att, $val)= ( 'Foo', 'CustomAttribute', 'TRUE');

XML::Twig->new(                 # only process those elements
                twig_roots => { $tag => sub { 
                                              # add/set attribute
                                              $_->set_att( $att => $val); 
                                              # output and free memory
                                              $_->flush;
                                            }
                              },
                twig_print_outside_roots => 1, # output everything else
                pretty_print => 'cvs',         # seems to be the right format
              )
         ->parse( \*DATA)  # use parsefile( $file) if parsing... a file
         ->flush;          # not needed in XML::Twig 3.33
__DATA__
<doc>
  <Foo
      Bar="bar"
      Baz="1"
      Bax="bax"
  >
  here is some text
  </Foo>
  <Foo CustomAttribute="TRUE"><Foo no_att="1"/></Foo>
  <bar><![CDATA[<Foo no_att="1">tricked?</Foo>]]></bar>
  <Foo><![CDATA[<Foo no_att="1" CustomAttribute="TRUE">tricked?</Foo>]]></Foo>
  <Foo
      Bar=">"
      Baz="1"
      Bax="bax"
  ></Foo>
  <Foo
      Bar="
>"
      Baz="1"
      Bax="bax"
  ></Foo>
  <Foo
      Bar=">"
      Baz="1"
      Bax="bax"
      CustomAttribute="TRUE"
  ></Foo>
  <Foo
      Bar="
>"
      Baz="1"
      Bax="b
ax"
      CustomAttribute="TR
UE"
  ></Foo>
</doc>
mirod