ansaurus

Question

Answer 1

+5 A:

I won't beat you over the head with how you should not use a regex for this. I mean, you shouldn't, but you obviously know that from what you said in your question, so moving on...

Something that will accomplish what you're asking for is called a negative lookahead assertion (usually (?!...)), which basically says that you don't want the match to apply if the pattern inside the assertion is found ahead of this point. In your example, you don't want it to apply if CustomAttribute is already present, so:

$data =~ s/(<\s*Foo)(?![^>]*\bCustomAttribute=)(.*?)>/$1$2CustomAttribute="TRUE">/sig;

Adam Bellaire 2009-12-11 18:02:29

Awesome, thanks for actually answering the question and not lecturing about RegEx + XML.

HCO2 2009-12-11 18:05:55

XML allows plain `>` in attribute values, so `<Foo Bar=">">` would fail. And `<Foo Bar="CustomAttribute=">` would also fail.

Gumbo 2009-12-11 18:17:20

@Gumbo: Good point, +1 to your answer.

Adam Bellaire 2009-12-11 19:37:08

@Gumbo: Yes! Thank you for pointing out why regexp for HTML is inherently fragile. Without descent parsing one is bound to find these sorts of anomalies; probably months or years after the fragile code has been in use by users who expected it to be more robust.

Jim Dennis 2009-12-11 20:58:16

Answer 2

A:

You can send your matches through a function with the 'e' modifier for more processing.

my $str = qq`
<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
    CustomAttribute="TRUE"
>
<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
>
`;

sub foo {
    my $guts = shift;
    $guts .= qq` CustomAttribute="TRUE"` if $guts !~ m/CustomAttribute/;
    return $guts;
}
$str =~ s/(<Foo )([^>]*)(>)/$1.foo($2).$3/xsge;

Rob 2009-12-11 18:12:53

What about `<Foo Bar="CustomAttribute">`?

Gumbo 2009-12-11 18:15:56

Answer 3

+5 A:

This sounds like it might be a job for XML::Twig, which can process XML and change parts of it as it runs into them, including adding attributes to tags. I suspect you'd spend as much time getting used to Twig and you would finding a regex solution that only mostly worked. And, at the end you'd know enough Twig to use it on the next project. :)

brian d foy 2009-12-11 19:41:03

Answer 4

+3 A:

Time for a lecture I guess ;--)

I am not sure why you think using a full-blown XML processor is overkill. It is actually easier to write the code using the proper tool. A regexp will be more complex and will rely on unwritten assumptions about the data, which is dangerous. Some of those assumptions are likely to be: no '>' in attribute values, no CDATA sections, no non-ascii characters in tag or attribute names, consistent attribute value quoting...

The only thing a regexp will give you is the assurance that the output keeps the original format of the data (in your case the fact that the attributes are each on a separate line). But if your format is consistent that can be done, and if not it should not matter, unless you keep you XML in a line-oriented revision control system.

Here is an example with XML::Twig. It assumes you have enough memory to keep any entire Foo element in memory, and it works even on the admittedly contrived bit of XML in the DATA section. It would probably be just as easy to do with XML::LibXML (read the XML in memory, select all Foo elements, add attribute to each of them, output, that's 5 easy to understand lines by my count).

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my( $tag, $att, $val)= ( 'Foo', 'CustomAttribute', 'TRUE');

XML::Twig->new(                 # only process those elements
                twig_roots => { $tag => sub { 
                                              # add/set attribute
                                              $_->set_att( $att => $val); 
                                              # output and free memory
                                              $_->flush;
                                            }
                              },
                twig_print_outside_roots => 1, # output everything else
                pretty_print => 'cvs',         # seems to be the right format
              )
         ->parse( \*DATA)  # use parsefile( $file) if parsing... a file
         ->flush;          # not needed in XML::Twig 3.33
__DATA__
<doc>
  <Foo
      Bar="bar"
      Baz="1"
      Bax="bax"
  >
  here is some text
  </Foo>
  <Foo CustomAttribute="TRUE"><Foo no_att="1"/></Foo>
  <bar><![CDATA[<Foo no_att="1">tricked?</Foo>]]></bar>
  <Foo><![CDATA[<Foo no_att="1" CustomAttribute="TRUE">tricked?</Foo>]]></Foo>
  <Foo
      Bar=">"
      Baz="1"
      Bax="bax"
  ></Foo>
  <Foo
      Bar="
>"
      Baz="1"
      Bax="bax"
  ></Foo>
  <Foo
      Bar=">"
      Baz="1"
      Bax="bax"
      CustomAttribute="TRUE"
  ></Foo>
  <Foo
      Bar="
>"
      Baz="1"
      Bax="b
ax"
      CustomAttribute="TR
UE"
  ></Foo>
</doc>

mirod 2009-12-12 08:22:18

ansaurus

tags:

views:

answers:

Intelligent RegEx in Perl?

Background

Implementation

Problems

Questions

related questions