tags:

views:

2734

answers:

8

If I'm trying to match a quote-delimited string with a regex, which of the following is "better" (where "better" means both more efficient and less likely to do something unexpected):

/"[^"]+"/ # match quote, then everything that's not a quote, then a quote

or

/".+?"/   # match quote, then *anything* (non-greedy), then a quote

Assume for this question that empty strings (i.e. "") are not an issue. It seems to me (no regex newbie, but certainly no expert) that these will be equivalent.

Update: Upon reflection, I think changing the + characters to * will handle empty strings correctly anyway.

+3  A: 

I'd go for number two since it's much easier to read. But I'd still like to match empty strings so I would use:

/".*?"/
PEZ
@Graeme Perrow: .*? is the defacto standard of non greedy matching
slf
A: 

I prefer the first regex, but it's certainly a matter of taste.

The first one might be more efficient?

Search for double-quote
add double-quote to group
for each char:
    if double-quote:
        break
    add to group
add double-quote to group

Vs something a bit more complicated involving back-tracking?

Douglas Leeder
+4  A: 

I'd say the second one is better, because it fails faster when the terminating " is missing. The first one will backtrack over the string, a potentially expensive operation. An alternative regexp if you are using perl 5.10 would be /"[^"]++"/. It conveys the same meaning as version 1 does, but is as fast as version two.

Leon Timmermans
Why is the second one able to fail faster?
innaM
I added the explanation a few seconds before you asked. It the second one doesn't backtrack.
Leon Timmermans
If you want to get really fancy and support escaped quotes in the regexp, you could do this: /"(?:[^"]|(?<!\\)(?>\\\\)*\\")++"/. I've explained that one at http://stackoverflow.com/questions/56554/what-is-the-proper-regular-expression-for-an-unescaped-backslash-before-a-chara#56671
Leon Timmermans
Good point regarding the backtracking. For my "long running loop" example, this would surely cause more hurt than the lazy quantifier. Also good that you mentioned possessive quantifiers. +1
Tomalak
Leon is wrong about the backtracking. .*? backtracks for each and every character in the string, when the closing " is present. When the closing " is missing, both regexes backtrack.
Jan Goyvaerts
"forward" tracking has some cost, but about the same as the character class in the first regexp. AFAIK it doesn't have to backtrack .*? itself because it has already "forward" tracked that, but I don't know the details of the implementation. It would have to backtrack the first " though :-/.
Leon Timmermans
I've done a quick benchmark. They're pretty much the same on both matching and nonmatching strings on perl 5.8, but on perl 5.10 the second one is about 3 times faster on matching strings. What I don't understand is that non-matching is significantly faster :-/.
Leon Timmermans
The term "backtracking" means that the regex engine goes back to a previous token in the regular expression. It doesn't say anything about going backwards or forwards in the subject string. In my answer I give general theory. Some regex engines may have optimizations for specific cases.
Jan Goyvaerts
A: 

Considering that I didn't even know about the "*?" thing until today, and I've been using regular expressions for 20+ years, I'd vote in favour of the first. It certainly makes it clear what you're trying to do - you're trying to match a string that doesn't include quotes.

Paul Tomblin
I've been using regular expressions for many years too, and I knew there was a way to do things in a non-greedy way, but didn't realize how easy it was until today. That's why I asked - I find it easier to read (now that I know what it means), so is there a reason I shouldn't use it?
Graeme Perrow
The only limitation is in your regex engine. There's the remote possibility that you are facing one that does not support non-greedy quantifiers. Modern ones generally do.
Tomalak
Some regex engines supports changing the default greediness (making .* be nongreedy and .*? be greedy). In PHP you can use the U regex modifier for this. I have had use of that when scraping html.
PEZ
PEZ: I strongly recomment you use /.*?/ instead of /.*/U Most people will recognize .*? as a lazy quantifier, or at least as something they don't know. A /U tucked away at the end of the regex is easily missed. It's about keeping your code human-readable.
Jan Goyvaerts
+1  A: 

From a performance perspective (extremely heavy, long-running loop over long strings), I could imagine that

"[^"]*"

is faster than

".*?"

because the latter would do an additional check for each step: peeking at the next character. The former would be able to mindlessly roll over the string.

As I said, in real-world scenarios this would hardly be noticeable. Therefore I would go with number two (if my current regex flavor supports it, that is) because it is much more readable. Otherwise with number one, of course.

Tomalak
+2  A: 

I would suggest:

([\"'])(?:\\\1|.)*?\1

But only because it handles escaped quote chars and allows both the ' and " to be the quote char. I would also suggest looking at this article that goes into this problem in depth:

http://blog.stevenlevithan.com/archives/match-quoted-string

However, unless you have a serious performance issue or cannot be sure of embedded quotes, go with the simpler and more readable:

/".*?"/

I must admit that non-greedy patterns are not the basic Unix-style 'ed' regular expression, but they are getting pretty common. I still am not used to group operators like (?:stuff).

Harold Bamford
What's the "(:?stuff)" you mention supposed to mean? I know "(?:stuff)", but not the other.
Tomalak
In Perl, these are called "Extended Patterns". Check out http://perldoc.perl.org/perlre.html under "Extended Patterns" (about 1/3 of the way down). In this case, it is just like (stuff) except there is no capture ($1 or \1) involved.
Harold Bamford
I think it's just a typo.
PEZ
It was. I just now fixed it. I didn't understand the original comment (I thought there was some obscure escaping bug in SO). Sorry for the confusion.
Harold Bamford
No worries. ;-) Essentially, you can delete this whole dialog as it has no substance anymore.
Tomalak
+7  A: 

You should use number one, because number two is bad practice. Consider that the developer who comes after you wants to match strings that are followed by an exclamation point. Should he use:

"[^"]*"!

or:

".*?"!

The difference appears when you have the subject:

"one" "two"!

The first regex matches:

"two!"

while the second regex matches:

"one" "two"!

Always be as specific as you can. Use the negated character class when you can.

Another difference is that [^"]* can span across lines, while .* doesn't unless you use single line mode. [^"\n]* excludes the line breaks too.

As for backtracking, the second regex backtracks for each and every character in every string that it matches. If the closing quote is missing, both regexes will backtrack through the entire file. Only the order in which then backtrack is different. Thus, in theory, the first regex is faster. In practice, you won't notice the difference.

Jan Goyvaerts
This counterexample is exactly what I was looking for when I wrote the question. Thanks Jan
Graeme Perrow
+1  A: 

Using the negated character class prevents matching when the boundary character (doublequotes, in your example) is present elsewhere in the input.

Your example #1:

/"[^"]+"/ # match quote, then everything that's not a quote, then a quote

matches only the smallest pair of matched quotes -- excellent, and most of the time that's all you'll need. However, if you have nested quotes, and you're interested in the largest pair of matched quotes (or in all the matched quotes), you're in a much more complicated situation.

Luckily Damian Conway is ready with the rescue: Text::Balanced is there for you, if you find that there are multiple matched quote marks. It also has the virtue of matching other paired punctuation, e.g. parentheses.

Trochee