tags:

views:

489

answers:

4

Hi! Again a regex question.

What's more efficient? Cascading a lot of Regex.Replace with each one a specific pattern to search for OR only one Regex.Replace with an or'ed pattern (pattern1|pattern2|...)?

Thanks in advance, Fabian

+3  A: 

My answer sucks but: it depends. How many do you have? Will the few milliseconds you save really make a difference? Which solution is the most readable, the easiest to maintain, the most scalable?

Try both methods with your specific requirements at hand and you will see. You could be surprised.

Coincoin
+1  A: 

Depends entirely on the pattern and the implementation logic - if simple (and I imagine most real world cases would be) the regex will be faster, if complex multiple operations might be, but benchmarking is the answer if it's a situation where this actually matters.

Otherwise it'll relatively be so close you shouldn't care, premature optimisation and all that.

annakata
+1  A: 

It depends on how big your text is and how many matches you expect. If at all possible, put a text literal or anchor (e.g. ^) at the front of the Regex. The .NET Regex engine will optimize this so that it searches for that text using a fast Boyer-Moore algorithm (which can skip characters) rather than a standard IndexOf that looks at each character. In the case that you have several patterns with literal text at the front, there is an optimization to create a set of possible start characters. All others are ignored quickly.

In general, you might want to consider reading Mastering Regular Expressions which highlights general optimizations to get an idea for better performance (especially chapter 6).

I'd say you might get faster perf if you put everything in one Regex, but put the most likely option first, followed by the second most likely, etc. The number one thing to watch out for is backtracking. If you do something like

".*"

to match a quoted string, realize that once it finds the first " then it will always go to the end of the line by default and then start backing up until it finds another ".

The Mastering Regular Expressions book heavily goes into how to avoid this.

Jeff Moser
After some benchmarking, a separate Regex is more efficient. Plus, they all start en finish with the same pattern, so I first look if the generic pattern is found before processing, otherwise I just skip this string.
Fabian Vilers
A: 

I'm surprised that your benchmarking revealed the use of multiple separate expressions to be faster, and I'd be curious to see an example of the regexes you are using. Basic regular expressions (i.e. without advanced features like backtracking) can be compiled to "finite state machines" whose speed is O(n) relative to the length of the string being searched, and unrelated to the length of the pattern. So, running 10 difference regular expressions should on average need 10 times longer than a single regular expression which combines those patterns with "|".

(I know this is an old question from march, but I couldn't resist adding my 2 cents :)

Todd Owen