tags:

views:

314

answers:

5

Hello,

I've build a complex (for me) regex to parse some file names, and it broadly works, except for a case where there are additional inside brackets.

(?'field'F[0-9]{1,4})(?'term'\(.*?\))(?'operator'_(OR|NOT|AND)_)?

In the following examples, I need to get the groups after the comment, but in the 3rd example, I am getting ((brackets) instead of ((brackets)are valid).

For the life of me I can't work out how to extend it to search for the final bracket.

C:\Temp\[DB_3][DT_2][F30(green)].vsl // F30 (green)
C:\Temp\[DB_3][DT_2][F21(red)_OR_F21(blue)_NOT_F21(pink)].vsl // F21 (red) _OR_ OR
C:\Temp\[DB_3][DT_2][F21((brackets)are valid)].vsl // F21 ((brackets)are valid)
C:\Temp\[DB_3][DT_2][F21(any old brackets)))))are valid)].vsl // F21 (any old brackets)))))are valid)
C:\Temp\[DB_3][DT_2][F21(brackets))))))_OR_F21(blue)].vsl // F21 (brackets)))))) _OR_ OR

Thanks


UPDATE: I'm using RegExr to experiment, then implementing in C# like this:

Regex r = new Regex(pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

foreach(Match m in r.Matches(foo))
{
    //etc
}


UPDATE 2: I don't need to match up the brackets. Inside the one set of brackets can be any data, I just need it to terminate with the outside bracket.


UPDATE 3:

Another attempt, this works with extra brackets (example 3 and 4), but still fails to split out the extra terms (example 5), but unfortunatly includes the terminating ] in the group. How can I get it to search for (but not include) either )_ or )] as the delimiter, but just include the bracket?

(?'field'F[0-9]{1,4})(?'term'\(.*?\)[\]])(?'operator'_(OR|NOT|AND)_)?


Final update: I've decided it's not worth the effort in trying to parse this stupid format, so I'm going to ditch support for it and do something more productive with my time. Thank you all for your help, I have now seen the light!

+2  A: 

Matching nested parenthesis with regex is a) not possible*, or b) results in a regex that is unmaintainable.

If you're simply trying to match the first ( until the last ) (not checking if the opening- and closing-parenthesis properly match), then just remove the ? after .*?.

* depending what regex flavour you're using.

Bart Kiers
I don't need to match up the inside data, just use the outside bracket as a terminator. am I right in thinking you mean it should like this? `(?'field'F[0-9]{1,4})(?'term'\(.*\))(?'operator'_(OR|NOT|AND)_)?` That does actually resolve the problem, for my 3rd example, but breaks the groups in the second example (i.e. group 2 then catches `(red)_OR_F21(blue)_NOT_F21(pink)` )
Colin Pickard
After reading your update, I must confess that it is unclear to me what the rules are for matching the substrings you've indicated in comments. It seems that sometimes you want to match parenthesis (`((brackets)are valid)`) and sometimes you don't (`(brackets))))))`).
Bart Kiers
Note that your title *"using surrounding brackets as delimiters while ignoring any inside brackets"* is a bit misleading. It suggests that `((brackets)` would be a valid match.
Bart Kiers
`((brackets)` is unfortunately a valid match too. The format is `F##(*)_OR|NOT|AND_F##(*)` where the * can be literally anything. This does actually mean that it could be e.g. `)_AND_(` ....
Colin Pickard
Simply using a greedy quantifier won't work in example #2, if I understand Colin's needs correctly
kemp
+1  A: 

Hmm, this usually isn't possible with most regex engines. Although it is possible in perl:

PerlMonks

By using a recursive regexp:

use strict;
use warnings;

my $textInner =
  '(outer(inner(most "this (shouldn\'t match)" inner)))';
my $innerRe;
my $idx=0;
my(@match);

$innerRe = qr/
                \(
                (
                   (?:
                      [^()"]+
                   |
                      "[^"]*"
                   |
                      (??{$innerRe})
                   )*
                )
                \)(?{$match[$idx++]=$1;})
             /sx;

$textInner =~ /^$innerRe/g;

print "inner: $match[0]\n";

It's also possible to do it in most regex engines provided that you want to do it to a fixed depth of bracket nesting. I wrote something in java a while ago that would construct a regex that would match brackets up to 6 deep.

Here's my java function for producing the regex:

public static String generateParensMatchStr(int depth, char openParen, char closeParen)
{
 if (depth == 0)
  return ".*?";
 else
  return "(?:\\" + openParen + generateParensMatchStr(depth - 1, openParen, closeParen) + "\\" +closeParen + "|.*?)+?";
}
Benj
A: 
re.findall("((?:F[0-9]{1,4}\(.*\))(?:_(?:OR|NOT|AND)_)?)+?",YOURTEXT)

gots

['F30(green)', 'F21(red)_OR_F21(blue)_NOT_F21(pink)', 'F21((brackets)are valid)', 'F21(any old brackets)))))are valid)', 'F21(brackets))))))_OR_F21(blue)']

in python, what do you think?

S.Mark
+1  A: 

Try this

/(F[0-9]{1,4})(\([^_\]]+\))(?:_(OR|NOT|AND)_)?/

tested with PHP, seems to give the expected results (as long as the strings inside round brackets don't contain _ or ]).

kemp
+1  A: 

here is my another test results in python

x="""C:\Temp\[DB_3][DT_2][F30(green)].vsl // F30 (green)
C:\Temp\[DB_3][DT_2][F21(red)_OR_F21(blue)_NOT_F21(pink)].vsl // F21 (red) _OR_ OR
C:\Temp\[DB_3][DT_2][F21((brackets)are valid)].vsl // F21 ((brackets)are valid)
C:\Temp\[DB_3][DT_2][F21(any old brackets)))))are valid)].vsl // F21 (any old brackets)))))are valid)
C:\Temp\[DB_3][DT_2][F21(brackets))))))_OR_F21(blue)].vsl // F21 (brackets)))))) _OR_ OR"""
x=re.sub("//.*","",x)
x=re.sub("(_(OR|NOT|AND)_).*?]"," \\1 \\2]",x)
x=re.findall("(?:F[0-9]{1,4}\(.*\).*(?=]))",x)
for x in x:print x

this gives

F30(green)
F21(red) _OR_ OR
F21((brackets)are valid)
F21(any old brackets)))))are valid)
F21(brackets)))))) _OR_ OR

Thats will meet your expected result?

S.Mark