tags:

views:

70

answers:

3

I'm looking for a regular expression that allows for either single-quoted or double-quoted strings, and allows the opposite quote character within the string. For example, the following would both be legal strings: "hello 'there' world" 'hello "there" world'

The regexp I'm using uses negative lookahead and is as follows:

(['"])(?:(?!\1).)*\1

This would work I think, but what about if the language didn't support negative lookahead. Is there any other way to do this? Without alternation?

EDIT:

I know I can use alternation. This was more of just a hypothetical question. Say I had 20 different characters in the initial character class. I wouldn't want to write out 20 different alternations. I'm trying to actually negate the captured character, without using lookahead, lookbehind, or alternation.

+1  A: 

Sure:

'([^']*)'|"([^"]*)"

On a successful match, the $+ variable will hold the contents of whichever alternate matched.

Sean
+7  A: 

This is actually much simpler than you may have realized. You don't really need the negative look-ahead. What you want to do is a non-greedy (or lazy) match like this:

(['"]).*?\1

The ? character after the .* is the important part. It says, consume the minimum possible characters before hitting the next part of the regex. So, you get either kind of quote, and then you go after 0-M characters until you encounter a character matching whichever quote you first ran into. You can learn more about greedy matching vs. non-greedy here and here.

mattmc3
thank you! this is what I was looking for. totally forgot about lazy quantifiers. well now I feel stupid
Sean Nilan
No need to feel bad - regex's are powerful, but complicated. It's hard to keep it all in your head. That's what SO.com is for.
mattmc3
The regex can be slightly improved by removing the expensive match all `.` by using - `(['"])[^\1]*?\1`
Peter Ajtai
@Peter Ajtai, no it can't; backreferences aren't allowed in character classes. That class gives you any character but \001 aka chr(1).
ysth
@ysth - Whoops. I just realized that. Thanks for the clarification.
Peter Ajtai
+1  A: 

In the general case, regexps are not really the answer. You might be interested in something like Text::ParseWords, which tokenizes text, accounting for nested quotes, backslashed quotes, backslashed spaces, and other oddities.

Ryan Thompson