views:

131

answers:

4

I was wondering if there are any general guidelines for when to use regex VS "string".contains("anotherString") and/or other String API calls?

While above given decision for .contains() is trivial (why bother with regex if you can do this in a single call), real life brings more complex choices to make. For example, is it better to do two .contains() calls or a single regex?

My rule of thumb was to always use regex, unless this can be replaced with a single API call. This prevents code against bloating, but is probably not so good from code readability point of view, especially if regex tends to get big.

Another, often overlooked, argument is performance. How do I know how many iterations (as in "Big O") does this regex require? Would it be faster than sheer iteration? Somehow everybody assumes that once regex looks shorter than 5 if statements, it must be quicker. But is this always the case? This is especially relevant if regex cannot be pre-compiled in advance.

+1  A: 

I would strongly suggest that you write the code for both and time it. It's pretty simple to do this and you'll get an answers that is not a generic "rule of thumb" but instead a very specific answer that holds for your problem domain.

Vance Morrison has an excellent post about micro benchmarking, and has a tool that makes it really simple for you to answer questions like this...

http://msdn.microsoft.com/en-us/magazine/cc500596.aspx

If you want my personal "rule of thumb" then it's that RegEx is often slower for this sort of thing, but you should ignore me and measure it yourself :-)

If, for non-performance reasons, you continue to use Regular Expressions then I can really recommend two things. Get a profiler (such as ANTS) and see what your code does in production. Then, get a copy of the Regular Expression Cookbook...

http://www.amazon.co.uk/Regular-Expressions-Cookbook-Jan-Goyvaerts/dp/0596520689/ref=sr%5F1%5F1?ie=UTF8&s=books&qid=1259147763&sr=8-1

... as it has loads of tips on speeding up RegEx code. I've optimized RegEx code by a factor of 10 following tips from this book.

Martin Peck
I'm glad to hear you like Regular Expressions Cookbook. If any of your friends don't have a copy yet, O'Reilly and I are doing a giveaway at regexguru.com in which anyone can participate until the end of the month (28 Feb 2010).
Jan Goyvaerts
@Jay Cool. I'll forward this on. Thanks for the tip-off.
Martin Peck
+2  A: 

It's hard to estimate performance without using a profiler, generally the best strategy is to write what makes the most logical sense and is easier to understand/read.

If two .contains() calls are easier to logically understand then that's the better route, the same logic applies if a regex makes more sense.

It's also important to consider that other developers on your team may not have a great understanding of regex.

If at a later time in production the use of regex over .contains() (or vice versa) is identified as a bottleneck, try and profile both.

Rule of thumb: Write code to be readable, use a profiler to identify bottlenecks and only then replace the readable code with faster code.

VoDurden
+1 Don't optimize prematurely.
Tim Pietzcker
A: 

The answer (as usual) is that it depends.

In your particular case, I guess the alternative would be to do the regex "this|that" and then do a find. This particular construct really pokes at regex's weaknesses. The "OR" in this case doesn't really know what the sub-patterns are trying to do and so can't easily optimize. It ends up doing the equivalent of (in pseudo code):

for( i = 0; i < stringLength; i++ ) {
    if( stringAt pos i starts with "this" )
       found!
    if( stringAt pos i starts with "that" )
       found!
}

There almost isn't a slower way to do it. In this case, two contains() calls will be much faster.

On the other hand, a full match on: ".*this.*|.*that.*" may optimize better.

To me, regex should be used when the code to do otherwise is complicated or unwieldy. So if you want to find one of two or three strings in a target string then just use contains. But if you wanted to find words starting with 'A' or 'B' and ending in 'g'-'m'... then use regex.

And then you won't be so worried about a few cycles here and there.

PSpeed
Your answer makes no sense at all. The regex this|that does one linear search through the string, with only a little bit of extra logic when "th" is encountered, stopping at the first match of this or that. Making two contains() calls does two linear searches through the string, requiring the full string to be searched if it doesn't contain the first word. This will always have worse performance. .*this.*|.*that.* will definitely not optimize better than the much simpler this|that because the initial .* matches the whole string until the end, then backtracking to find the words.
Jan Goyvaerts
The worst cases are the same either way, every pattern will be tried at every relevant character position. For a handful of straight matches, I agree that "this|that" has more optimal cases (for example, 'that' occurring in the string but not 'this'). As the list of patterns grows and the chances of false starts grows it changes. In this case, I'm probably off base. Straight literal matches may always favor the regex (though Java's particular implementation seems to perform abismally for several hundred patterns from experience).
PSpeed
For non-literal patterns where the matching itself is expensive, it may pay to run several separate operations instead of one big regex... especially if you don't care about the earliest match (position-wise).
PSpeed
+1  A: 

RegexBuddy has a built-in regular expressions debugger. It shows how many steps the regular expression engine needed to find a match or to fail to find a match. By using the debugger on strings of different lengths, you can get an idea of the complexity (big O) of the regular expression. If you look up "benchmark" in the index of RegexBuddy's help file you'll get some more tips on how to interpret this.

When judging the performance of a regular expression, it is particularly important to test situations where the regex fails to find a match. It is very easy to write a regular expression that finds its matches in linear time, but fails in exponential time in a situation that I call catastrophic backtracking.

To use your 5 if statements as an example, the regex one|two|three|four|five scans the input string once, doing a little bit of extra work when an o, t, or f is encountered. But 5 if statements checking if the string contains a word will search the entire string 5 times if none of the words can be found. If five occurs at the start of the string, then the regex finds the match instantly, while the first 4 if statements scan the whole string in vain before the 5th if statement finds the match.

Jan Goyvaerts