tags:

views:

66

answers:

4

Hey guys - I'm tearing my hair out trying to create a regular expression to match something like:

{TextOrNumber{MoreTextOrNumber}}

Note the matching number of open/close {}. Is this even possible?

Many thanks.

A: 

This is not possible with 1 regex if you don't have a recursive extension available. You'll have to match a regex like the following one multiple times

/\{[a-z0-9]+([a-z0-9\{\}]+)?\}/i

capture the "MoreTextOrNumber" and let it match again until you are through or it fails.

Etan
+2  A: 

Note the matching number of open/close {}. Is this even possible?

Historically, no. However, modern regular expressions aren’t actually regular and some allow such constructs:

\{TextOrNumber(?R)?\}

(?R) recursively inserts the pattern again. Notice that not many regex engines support that (yet).

Konrad Rudolph
A: 

If you need to do an arbitrary number of braces, you can use a parser generator, or create a regex inside a nested function. The following is an example of a recursive regex in ruby.

def parse(s)
  if s =~ /^\{([A-Za-z0-9]*)({.*})?\}$/ then
    puts $1
    parse($2)
  end
end

parse("{foo{bar{baz}}}")
brianegge
A: 

Not easy but possible

Officially, regular expressions are not designed for parsing nested paired brackets --- and if you try to do this, you run into all sorts of problems. There are other other tools (like parser generators, e.g. yacc or bison) that are designed for such structures and can handle them well. But it can be done --- and if you do it right it may even be simpler than a yacc grammar with all the support code to work around the problems of yacc.

Here are some hints:

First of all, my suggestions work best if you have some characters that will never appear in the input. Often, characters like \01 and \02 should never appear, so you can do

s/[\01\02]/ /g;

to make sure they are not there. Otherwise, you may want to escape them (e.g. convert them to text like %0 and %1) with an expression like

s/([\01\02%])/"%".ord($1)/ge;

Notice, that I also escaped the escape character "%".

Now, I suggest to parse brackets from the inside out: replace any substring "{ text }" where "text" does not contain any brackets by a place holder "\01$number\2" and store the included text in $array[$number]:

$number=1;
while (s/\{([^{}]*)\}/"\01$number\02"/e) { $array[$number]=$1; $number++; }
$array[0]=$_;  # $array[0] corresponds to your input

As a final step, you may want to process each element in @array to pull out and process the "\01$number\02" markers. This is easy because they are no longer nested.

I happily use this idea in a few parsers (including separating matching bracket types like "(){}[]" etc).

But before you go down this road, make sure to have used regular expressions in simpler applications: You will run into many small problems and you need experience to resolve them (rather than turning one small problem into two small problems etc.).

Yaakov Belch