ansaurus

Question

Answer 1

+14 A:

The fastest (as shown below) is a simple string split:

line.Split(new char[] { '{', '<', '>', '}', ' ', '^', '"' },
           StringSplitOptions.RemoveEmptyEntries);

The next fastest is an anchored regular expression (ugly):

Regex lineParse
    = new Regex(@"^\{(<([^>]+)>\s*){3,4}(""([^""]+)""\^\^<([^>]+)>\s*)?\}$",
                RegexOptions.Compiled);
Match m = lineParse.Match(line);
if (m.Groups[2].Captures.Count == 3)
{
    Data data = new Data { C = m.Groups[2].Captures[0].Value,
        S = m.Groups[2].Captures[1].Value, P = m.Groups[2].Captures[2].Value,
        O = m.Groups[4].Value, T = m.Groups[5].Value };
} else {
    Data data = new Data { C = m.Groups[2].Captures[0].Value,
        S = m.Groups[2].Captures[1].Value, P = m.Groups[2].Captures[2].Value,
        O = m.Groups[2].Captures[3].Value, T = String.Empty };
}

Timings for 1M lines of random data (String.Split as the baseline):

Method                #1  Wall ( Diff)     #2  Wall ( Diff)
------------------------------------------------------------
line.Split                3.6s (1.00x)         3.1s (1.00x)
myRegex.Match             5.1s (1.43x)         3.3s (1.10x)
itDependsRegex.Matches    6.8s (1.85x)         4.4s (1.44x)
stateMachine              8.4s (2.34x)         5.6s (1.82x)
alanM.Matches             9.1s (2.52x)         7.8s (2.56x)
yourRegex.Matches        18.3s (5.06x)        12.1s (1.82x)

Updated to include @AlanM and @itdepends regular expressions. It appears that Regex.Matches is slower than Regex.Match, however, the more context clues you give the parser, the better it performs. The single negative character class used by @AlanM is the simplest to read, yet slower than the most cryptic (mine). Hats off to @itdepends for the simplest regex that produces the fastest time. Ok, and while I thought it would be crazy to write a state machine to parse the line, it actually doesn't perform poorly at all...kudos to @RexM for the suggestion. I've also added times from my Q6600 at home (#2) v. an older Xeon at work (#1).

sixlettervariables 2009-02-04 23:39:17

Thank you for turning this into a really good post.

it depends 2009-02-05 05:00:49

really great answer

spoon16 2009-02-05 10:21:28

Answer 2

+2 A:

After some testing I came up with:

@"<(?<capture>[^>]+)>|""(?<capture>[^""]+)"""

The value needs to be acessed with match.Groups[1].Value.

In my unscientific test it was around 75-80% faster then the one in the original question.

Match vs Matches

In production I usually use Match, but used Matches for the above. I've never really considered the performance implications so did a little testing, so with exactly the same regex:

for(Match match = regex.Match(input); match.Success; match = match.NextMatch())
// min 5.01 sec
// max 5.15 sec

foreach(Match match in regex.Matches(input))
// min 5.66 sec
// max 6.07 sec

So Match certainly can be faster than Matches.

it depends 2009-02-05 00:10:31

Added timing for yours to mine, shaves roughly 3s from OP's regex

sixlettervariables 2009-02-05 00:15:46

+1, I think it is a combination of backtracking and context provided in the regular expression. The angle brackets give your regex context that just a negative character class would not give.

sixlettervariables 2009-02-05 15:54:56

Answer 3

+4 A:

Sometimes a state machine is significantly faster than a Regex.

Rex M 2009-02-05 02:50:26

+1, I was going to scoff at your answer, until I wrote it up myself. It isn't the worst, nor is it the best. Surprisingly decent for the amount of work required...

sixlettervariables 2009-02-05 04:12:51

Since regex's are just a special case of a state-machine, with the right optimizations, you should be able to beat any regex engine.

Eclipse 2009-02-05 05:00:57

Answer 4

+1 A:

As far as I can see, the regexes that have been offered so far are much more complicated than they need to be. If @sixlettervariables' Split approach works, Matches should work with this regex:

@"[^{}<> ^""]+"

But I would still expect the String.Split approach to be faster.

Alan Moore 2009-02-05 02:52:41

On average 9.1s, I wonder if .Matches is just slower?

sixlettervariables 2009-02-05 03:37:01

I think what makes String.Split faster is that it doesn't use regexes. Try it with Regex.Split.

Alan Moore 2009-02-05 04:20:15

Oh, I see you're talking about Matches vs. Match performance. But you're only doing one match for each line, and your regex has virtually no backtracking opportunities. All the others have to perform several matches per line.

Alan Moore 2009-02-05 04:36:04

ansaurus

tags:

views:

answers:

A more efficient Regex or alternative?

related questions