tags:

views:

765

answers:

6

I'm looking for a regex that can pull out quoted sections in a string, both single and double quotes.

IE:

"This is 'an example', \"of an input string\""

Matches:

  • an example
  • of an input string

I wrote up this:

 [\"|'][A-Za-z0-9\\W]+[\"|']

It works but does anyone see any flaws with it?

EDIT: The main issue I see is that it can't handle nested quotes.

A: 

It works but doesn't match other characters in quotes (e.g., non-alphanumeric, like binary or foreign language chars). How about this:

[\"']([^\"']*)[\"']

My C# regex is a little rusty so go easy on me if that's not exactly right :)

Chris Bunch
That doesn't return any matches at all.
FlySwat
I changed it to use the parens instead of the [], since I think it was thinking the period as a literal period instead of wildcard. I tested it out in Ruby with your example string and it seems to match them fine.
Chris Bunch
But the greedy start runs over any quotes there are and you will get the longest match, but not the right match.
Tomalak
In that case, the first match also contains the rest of the string in my test string
FlySwat
ah, i missed that one. this regex seems to work better: just capture everything that's not a quote
Chris Bunch
+1  A: 

Like that?

"([\"'])(.*?)\1"

Your desired match would be in sub group 2, and the kind of quote in group one.

The flaw in your regex is 1) the greedy "+" and 2) [A-Za-z0-9] is not really matching an awful lot. Many characters are not in that range.

Tomalak
I think you mean "\1", not "$1".
Michael Carman
Corrected that already. Sometimes I confuse regex dialects a bit, "$1" is the back reference of the VBScript regex implementation.
Tomalak
+2  A: 
Bill the Lizard
This does not work for strings like this one: "foo foo \"match\" foo \"match\" foo", where it returns "\"match\" foo \"match\"" as the only match.
Tomalak
That's because \W is the non-word character class, not the whitespace class, as I thought. My memory's not what it used to be.
Bill the Lizard
No. :-) Is is because the "+" greedily matches to the end of the string, before backtracking occurs and the last applicable quote is given to back-reference "\1".
Tomalak
And for that matter, with the "\s" you now have in place it is not going to match punctuation, or accented characters, or greek characters, etc...
Tomalak
Okay, that's my fault. I misunderstood what was to be matched. I thought it was matching alphanumerics and spaces. So changing to a reluctant quantifier is the ticket here.
Bill the Lizard
A: 
@"(\"|')(.*?)\1"
J.F. Sebastian
A: 

You might already have one of these, but, in case not, here's a free, open source tool I use all the time to test my regular expressions. I typically have the general idea of what the expression should look like, but need to fiddle around with some of the particulars.

http://renschler.net/RegexBuilder/

joshua.ewer