ansaurus

Question

How can I match a quote-delimited string with a regex?

Answer 1

+3 A:

I'd go for number two since it's much easier to read. But I'd still like to match empty strings so I would use:

/".*?"/

PEZ 2008-12-17 16:27:23

@Graeme Perrow: .*? is the defacto standard of non greedy matching

slf 2008-12-17 16:35:11

Answer 2

A:

I prefer the first regex, but it's certainly a matter of taste.

The first one might be more efficient?

Search for double-quote
add double-quote to group
for each char:
    if double-quote:
        break
    add to group
add double-quote to group

Vs something a bit more complicated involving back-tracking?

Douglas Leeder 2008-12-17 16:32:02

Answer 3

+4 A:

I'd say the second one is better, because it fails faster when the terminating " is missing. The first one will backtrack over the string, a potentially expensive operation. An alternative regexp if you are using perl 5.10 would be /"[^"]++"/. It conveys the same meaning as version 1 does, but is as fast as version two.

Leon Timmermans 2008-12-17 16:38:14

Why is the second one able to fail faster?

innaM 2008-12-17 16:46:50

I added the explanation a few seconds before you asked. It the second one doesn't backtrack.

Leon Timmermans 2008-12-17 16:48:31

If you want to get really fancy and support escaped quotes in the regexp, you could do this: /"(?:[^"]|(?<!\\)(?>\\\\)*\\")++"/. I've explained that one at http://stackoverflow.com/questions/56554/what-is-the-proper-regular-expression-for-an-unescaped-backslash-before-a-chara#56671

Leon Timmermans 2008-12-17 17:06:31

Good point regarding the backtracking. For my "long running loop" example, this would surely cause more hurt than the lazy quantifier. Also good that you mentioned possessive quantifiers. +1

Tomalak 2008-12-17 19:37:03

Leon is wrong about the backtracking. .*? backtracks for each and every character in the string, when the closing " is present. When the closing " is missing, both regexes backtrack.

Jan Goyvaerts 2008-12-18 10:47:41

"forward" tracking has some cost, but about the same as the character class in the first regexp. AFAIK it doesn't have to backtrack .*? itself because it has already "forward" tracked that, but I don't know the details of the implementation. It would have to backtrack the first " though :-/.

Leon Timmermans 2008-12-18 12:22:18

I've done a quick benchmark. They're pretty much the same on both matching and nonmatching strings on perl 5.8, but on perl 5.10 the second one is about 3 times faster on matching strings. What I don't understand is that non-matching is significantly faster :-/.

Leon Timmermans 2008-12-18 13:51:34

The term "backtracking" means that the regex engine goes back to a previous token in the regular expression. It doesn't say anything about going backwards or forwards in the subject string. In my answer I give general theory. Some regex engines may have optimizations for specific cases.

Jan Goyvaerts 2008-12-30 13:46:32

Answer 4

A:

Considering that I didn't even know about the "*?" thing until today, and I've been using regular expressions for 20+ years, I'd vote in favour of the first. It certainly makes it clear what you're trying to do - you're trying to match a string that doesn't include quotes.

Paul Tomblin 2008-12-17 16:40:27

I've been using regular expressions for many years too, and I knew there was a way to do things in a non-greedy way, but didn't realize how easy it was until today. That's why I asked - I find it easier to read (now that I know what it means), so is there a reason I shouldn't use it?

Graeme Perrow 2008-12-17 16:46:37

The only limitation is in your regex engine. There's the remote possibility that you are facing one that does not support non-greedy quantifiers. Modern ones generally do.

Tomalak 2008-12-17 17:39:22

Some regex engines supports changing the default greediness (making .* be nongreedy and .*? be greedy). In PHP you can use the U regex modifier for this. I have had use of that when scraping html.

PEZ 2008-12-17 18:05:58

PEZ: I strongly recomment you use /.*?/ instead of /.*/U Most people will recognize .*? as a lazy quantifier, or at least as something they don't know. A /U tucked away at the end of the regex is easily missed. It's about keeping your code human-readable.

Jan Goyvaerts 2008-12-19 00:17:55

Answer 5

+1 A:

From a performance perspective (extremely heavy, long-running loop over long strings), I could imagine that

"[^"]*"

is faster than

".*?"

because the latter would do an additional check for each step: peeking at the next character. The former would be able to mindlessly roll over the string.

As I said, in real-world scenarios this would hardly be noticeable. Therefore I would go with number two (if my current regex flavor supports it, that is) because it is much more readable. Otherwise with number one, of course.

Tomalak 2008-12-17 17:27:17

Answer 6

+2 A:

I would suggest:

([\"'])(?:\\\1|.)*?\1

But only because it handles escaped quote chars and allows both the ' and " to be the quote char. I would also suggest looking at this article that goes into this problem in depth:

http://blog.stevenlevithan.com/archives/match-quoted-string

However, unless you have a serious performance issue or cannot be sure of embedded quotes, go with the simpler and more readable:

/".*?"/

I must admit that non-greedy patterns are not the basic Unix-style 'ed' regular expression, but they are getting pretty common. I still am not used to group operators like (?:stuff).

Harold Bamford 2008-12-17 17:37:24

What's the "(:?stuff)" you mention supposed to mean? I know "(?:stuff)", but not the other.

Tomalak 2008-12-17 17:44:24

In Perl, these are called "Extended Patterns". Check out http://perldoc.perl.org/perlre.html under "Extended Patterns" (about 1/3 of the way down). In this case, it is just like (stuff) except there is no capture ($1 or \1) involved.

Harold Bamford 2008-12-17 17:51:37

I think it's just a typo.

PEZ 2008-12-17 18:09:11

It was. I just now fixed it. I didn't understand the original comment (I thought there was some obscure escaping bug in SO). Sorry for the confusion.

Harold Bamford 2008-12-17 18:38:50

No worries. ;-) Essentially, you can delete this whole dialog as it has no substance anymore.

Tomalak 2008-12-17 20:08:58

Answer 7

+7 A:

You should use number one, because number two is bad practice. Consider that the developer who comes after you wants to match strings that are followed by an exclamation point. Should he use:

"[^"]*"!

or:

".*?"!

The difference appears when you have the subject:

"one" "two"!

The first regex matches:

"two!"

while the second regex matches:

"one" "two"!

Always be as specific as you can. Use the negated character class when you can.

Another difference is that [^"]* can span across lines, while .* doesn't unless you use single line mode. [^"\n]* excludes the line breaks too.

As for backtracking, the second regex backtracks for each and every character in every string that it matches. If the closing quote is missing, both regexes will backtrack through the entire file. Only the order in which then backtrack is different. Thus, in theory, the first regex is faster. In practice, you won't notice the difference.

Jan Goyvaerts 2008-12-18 10:45:54

This counterexample is exactly what I was looking for when I wrote the question. Thanks Jan

Graeme Perrow 2008-12-18 20:24:10

Answer 8

+1 A:

Using the negated character class prevents matching when the boundary character (doublequotes, in your example) is present elsewhere in the input.

Your example #1:

/"[^"]+"/ # match quote, then everything that's not a quote, then a quote

matches only the smallest pair of matched quotes -- excellent, and most of the time that's all you'll need. However, if you have nested quotes, and you're interested in the largest pair of matched quotes (or in all the matched quotes), you're in a much more complicated situation.

Luckily Damian Conway is ready with the rescue: Text::Balanced is there for you, if you find that there are multiple matched quote marks. It also has the virtue of matching other paired punctuation, e.g. parentheses.

Trochee 2008-12-29 18:04:30

ansaurus

tags:

views:

answers:

How can I match a quote-delimited string with a regex?

related questions