tags:

views:

537

answers:

4

I have a C# Regex class matching multiple subgroups such as

(?<g1>abc)|(?<g2>def)|(?<g3>ghi)

but with much more complicated sub-patterns. I basically want to match anything that doesn't belong to any of those groups, in addition to existing groups.

I tried

(?<g1>abc)|(?<g2>def)|(?<g3>ghi)|(.+?)

but it turned out too slow. I can't do negation because I don't want to copy those complex subpatterns redundantly. Using just (.+) overrides all other groups as expected.

Is there any other way? If that doesn't work I'll have to write an ad-hoc parser.

Additional details: All these groups are evaluated against a MatchEvaluator. So a Regex class behavior that sends "unmatched strings" to the MatchEvaluator will also work.

A sample text would be

.......abc........ghi.....def.....abc....def...ghi......abc.......

I want to catch parts inbetween.

A: 

If your regex is four pages long, writing a state machine yourself would probably be a better idea...

Rex M
He said "Such as" ... "but with *much more complicated* sub-patterns"
Daniel LeCheminant
If I had pasted the actual Regex, this question would be four pages long :)
ssg
If your regular expression is 4 pages long, you shouldn't be using a regular expression
Gareth
Because of performance? I'm very happy with its performance without the last case.
ssg
If your Regex is 4 pages long, something is horribly wrong on more than one level.
Chris Ballance
Regex'es implementation is a state machine. State machine is internally constructed when you create Regex object.
Bartek Szabat
I don't understand "horribly wrong" part. Why do you think it's horribly wrong?
ssg
if your Regex takes up "four pages," you need to go home and rethink your life.
Chris Ballance
I need to go home and rethink my life.
ssg
@ssg or at least rethink your regex approach!
Rex M
Folks I understand why you are upset with a very long Regex but you are falsely assuming that my regex doesn't contain any code formatting (indentation, multiple lines, comments) and a long regex will always perform bad, which both are wrong. But I still feel like I should rethink my life. :)
ssg
No one said a long regex will perform poorly. My point is that regex complexity increases considerably with size from maintainability and modifiability points of view. Creating the state machine yourself, instead of relying on the regex abstraction, is often better for such complicated parsing.
Rex M
+2  A: 

but it turned out too slow. I can't do negation because I don't want to copy those complex subpatterns redundantly.

Why not something like:

const string COMPLEX_REGEX_PATTERN = "\Gobbel[dy]go0\k"

Ryan
Not a bad idea at all. I'll think about this if I don't receive any better answer.
ssg
+1  A: 

Have you tried setting the regex option to be compiled? I find using a static compiled regex can speed things up considerably.

Nicholas Mancuso
It's still slow even when compiled. Around 6 times slower than the one without the last group.
ssg
Compiling the regex helps, but if you're only using it once, then you don't gain anything from compiling it. Only gains are through reuse of the same regex.
Chris Ballance
Exactly. I wasn't sure or not how many times he would be matching. If I know I'm going to use it multiple times, and the pattern is static. I usually use static readonly Regex rx = new Regex("somepatter", Compiled);
Nicholas Mancuso
I'm using it multiple times and I'm compiling it only once. But as I said, 6x speed difference does not change.
ssg
+2  A: 

Your regex generates separate match for every single character outside g1,g2,g3. So when you use it with MatchEvaluator it generates lots of evaluator calls. Thats why its slow.

If you try following regex:

(?<rest>.*?)((?<g1>abc)|(?<g2>def)|(?<g3>ghi)|$)

you will get single "rest" group match for entire fragment of text that doesnt contain "g" group.

Regex C# code:

Regex regex = new Regex(
    @"(?<rest>.*?)((?<g1>abc)|(?<g2>def)|(?<g3>ghi)|$)",
    RegexOptions.Singleline
    | RegexOptions.Compiled
    );
Bartek Szabat
Yes this worked, thanks!
ssg