ansaurus

Question

Matching double char delimited strings with regular expressions

Answer 1

+1 A:

Please see if the performance of a dedicated parser (such as Text::Balanced) would be acceptable in this case. It's not regex, but without more details on your "NB" poststcriptum it sounds like you might have an XY problem when looking for a regex-only solution.

If you absolutely must use a regex, please look at using a look-ahead functionality - it may improve the speed.

DVK 2010-07-12 20:42:06

When you say "look-ahead functionality" can you elaborate? I think that the capturing group would have the greediness quantifier and the lookahead does not fix this, no? ie, any variant of positive lookahead, look behind, will still need to limit the scope of the match. `/<<(.*)(?=>>)/` will still have the same problem of matching to the final delimiter. The only easy solution I see is a negative character class on the first character of the closing delimiter. `<<([^>]*)>>` is as efficient as `<<(.*)>>` in terms of total probe count and gives the right answer.

drewk 2010-07-12 21:48:53

Text::Balanced is cool but I don't think it really fits this problem, both because OP didn't ask for nesting and because Text::Balanced is really meant to deal with single-char delimiters and constructs like backslashing.

hobbs 2010-07-12 22:06:37

BTW, I tried lookahead and it works, but with identical performance to the non-greedy regexp specification.

Phil Windley 2010-07-12 22:42:46

Answer 2

+2 A:

Using a negated character class in this case will work:

/<<([^>]*)>>/ is the same probe count as /<<(.*)>>/ so should be just as fast with less backtracking as /<<(.*?)>>/

I do agree with DVK however; is a regex the only way?

drewk 2010-07-12 21:14:37

Except that > can occur inside the <<...>> delimiters:a = <<<a href="...">hey</a>>>

Phil Windley 2010-07-12 21:46:17

Ahmm, that was not specified in your post. Are your parsing HTML with a regex? Please look at http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for some discussion on why this is not a great idea.

drewk 2010-07-12 22:24:36

No, not trying to parse HTML. I understand the dangers there. Just trying to consume everything between the delimiters (including white space). Sorry not to have specified the need for > inside the delimiters.

Phil Windley 2010-07-12 22:41:58

Answer 3

+4 A:

Expanding drewk's answer so it actually works:

/<<((?:(?>[^>]+)|>(?!>))*)>>/

Match "<<", then a sequence of 0 or more chunks which are either any number of non-">" characters, or a single ">" not followed by another ">", then finally ">>".

hobbs 2010-07-12 22:00:45

+1: Yes that works, but almost the same number of probes as `<<(.*?)>>`, no? At least on my regex sim...

drewk 2010-07-12 22:19:35

If that's the case, then I'm pretty sure they're both as efficient as it gets, because this is really a minimal representation of the state machine for matching `<<`..`>>`-delimited stuff.

hobbs 2010-07-12 23:13:06

@drewk I made a change that should reduce the amount of backtracking that the pattern is allowed to make -- does that help any?

hobbs 2010-07-12 23:22:06

@Hobbs: Yes, better!

drewk 2010-07-13 16:54:45

Answer 4

+1 A:

Say you have a simple grammar

my $p = Parse::RecDescent->new(<<'EOGrammar');
  program: assignment(s)

  assignment: id '=' '<<' angle_text '>>'
              { $return = [ $item{id}, $item{angle_text} ] }

  angle_text: <skip:undef> / ( [^>] | >(?!>) )* /x

  id: /\w+/
EOGrammar

and a source text of

a = <<
Hello

World!

>>

b = <<


Goodbye
World!
>>

When you process the result with

for (@{ $p->program($text) }) {
  my($name,$what) = @$_;
  print "$name: [[[$what]]]\n";
}

you'll see output of

a: [[[
Hello

World!

]]]
b: [[[


Goodbye
World!
]]]

Greg Bacon 2010-07-12 22:43:31

Answer 5

+3 A:

Are you using Perl 5.10? Try this:

/<<([^>]*+(?:>(?!>)[^>]*+)*+)>>/

Like the regex @hobbs posted, this one performs lookahead only after it finds a > (as opposed to the non-greedy quantifier, which effectively does a lookahead at every position). But this one uses Friedl's "unrolled loop" technique, which should be slightly faster than the alternation approach. Also, all quantifiers are possessive, so it doesn't bother saving the state information that would make backtracking possible.

Alan Moore 2010-07-13 08:12:35

+1: Sweet regex! It handles ever case I can think of and it as fast as `/<<(.*)>>/` Should be the answer IMHO.

drewk 2010-07-13 15:45:51

ansaurus

tags:

views:

answers:

Matching double char delimited strings with regular expressions

related questions