tags:

views:

108

answers:

5

Suppose you want to match text that is delimited by double characters like so:

a = <<
Hello
World!
>>

The regular expression /<<(.*)>>/ would seem to do it, but unfortunately when these can be repeated greedy matching gets too much:

a = <<
Hello
World!
>>

b = <<
Goodbye
World!
>>

The previous regexp will capture

Hello
World!
>>

b = <<
Goodbye
World!

The obvious answer is to make the regexp non-greedy: /<<(.*?)>>/

Unfortunately this has extreme performance problems for long strings (in Perl at least). If the delimiters were single characters, then we could use a character class (everything but the character) to solve the greedy problem.

Any ideas on a regular expression to make this match without the performance penalty?

NB: I have to use Perl and this has to be a regular expression because of the larger system it's embedded in.

Thanks.

+1  A: 

Please see if the performance of a dedicated parser (such as Text::Balanced) would be acceptable in this case. It's not regex, but without more details on your "NB" poststcriptum it sounds like you might have an XY problem when looking for a regex-only solution.

If you absolutely must use a regex, please look at using a look-ahead functionality - it may improve the speed.

DVK
When you say "look-ahead functionality" can you elaborate? I think that the capturing group would have the greediness quantifier and the lookahead does not fix this, no? ie, any variant of positive lookahead, look behind, will still need to limit the scope of the match. `/<<(.*)(?=>>)/` will still have the same problem of matching to the final delimiter. The only easy solution I see is a negative character class on the first character of the closing delimiter. `<<([^>]*)>>` is as efficient as `<<(.*)>>` in terms of total probe count and gives the right answer.
drewk
Text::Balanced is cool but I don't think it really fits this problem, both because OP didn't ask for nesting and because Text::Balanced is really meant to deal with single-char delimiters and constructs like backslashing.
hobbs
BTW, I tried lookahead and it works, but with identical performance to the non-greedy regexp specification.
Phil Windley
+2  A: 

Using a negated character class in this case will work:

/<<([^>]*)>>/ is the same probe count as /<<(.*)>>/ so should be just as fast with less backtracking as /<<(.*?)>>/

I do agree with DVK however; is a regex the only way?

drewk
Except that > can occur inside the <<...>> delimiters:a = <<<a href="...">hey</a>>>
Phil Windley
Ahmm, that was not specified in your post. Are your parsing HTML with a regex? Please look at http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for some discussion on why this is not a great idea.
drewk
No, not trying to parse HTML. I understand the dangers there. Just trying to consume everything between the delimiters (including white space). Sorry not to have specified the need for > inside the delimiters.
Phil Windley
+4  A: 

Expanding drewk's answer so it actually works:

/<<((?:(?>[^>]+)|>(?!>))*)>>/

Match "<<", then a sequence of 0 or more chunks which are either any number of non-">" characters, or a single ">" not followed by another ">", then finally ">>".

hobbs
+1: Yes that works, but almost the same number of probes as `<<(.*?)>>`, no? At least on my regex sim...
drewk
If that's the case, then I'm pretty sure they're both as efficient as it gets, because this is really a minimal representation of the state machine for matching `<<`..`>>`-delimited stuff.
hobbs
@drewk I made a change that should reduce the amount of backtracking that the pattern is allowed to make -- does that help any?
hobbs
@Hobbs: Yes, better!
drewk
+1  A: 

Say you have a simple grammar

my $p = Parse::RecDescent->new(<<'EOGrammar');
  program: assignment(s)

  assignment: id '=' '<<' angle_text '>>'
              { $return = [ $item{id}, $item{angle_text} ] }

  angle_text: <skip:undef> / ( [^>] | >(?!>) )* /x

  id: /\w+/
EOGrammar

and a source text of

a = <<
Hello

World!

>>

b = <<


Goodbye
World!
>>

When you process the result with

for (@{ $p->program($text) }) {
  my($name,$what) = @$_;
  print "$name: [[[$what]]]\n";
}

you'll see output of

a: [[[
Hello

World!

]]]
b: [[[


Goodbye
World!
]]]
Greg Bacon
+3  A: 

Are you using Perl 5.10? Try this:

/<<([^>]*+(?:>(?!>)[^>]*+)*+)>>/

Like the regex @hobbs posted, this one performs lookahead only after it finds a > (as opposed to the non-greedy quantifier, which effectively does a lookahead at every position). But this one uses Friedl's "unrolled loop" technique, which should be slightly faster than the alternation approach. Also, all quantifiers are possessive, so it doesn't bother saving the state information that would make backtracking possible.

Alan Moore
+1: Sweet regex! It handles ever case I can think of and it as fast as `/<<(.*)>>/` Should be the answer IMHO.
drewk