tags:

views:

765

answers:

4

I would like to be able to match a string literal with the option of escaped quotations. For instance, I'd like to be able to search "this is a 'test with escaped\' values' ok" and have it properly recognize the backslash as an escape character. I've tried solutions like the following:

import re
regexc = re.compile(r"\'(.*?)(?<!\\)\'")
match = regexc.search(r""" Example: 'Foo \' Bar'  End. """)
print match.groups() 
# I want ("Foo \' Bar") to be printed above

After looking at this, there is a simple problem that the escape character being used, "\", can't be escaped itself. I can't figure out how to do that. I wanted a solution like the following, but negative lookbehind assertions need to be fixed length:

# ...
re.compile(r"\'(.*?)(?<!\\(\\\\)*)\'")
# ...

Any regex gurus able to tackle this problem? Thanks.

+1  A: 

If I understand what you're saying (and I'm not sure I do) you want to find the quoted string within your string ignoring escaped quotes. Is that right? If so, try this:

/(?<!\\)'((?:\\'|[^'])*)(?<!\\)'/

Basically:

  • Start with a single quote that isn't preceded by a backslash;
  • Match zero or more occurrences of: backslash then quote or any character other than a quote;
  • End in a quote;
  • Don't group the middle parentheses (the ?: operator); and
  • The closing quote can't be preceded by a backslash.

Ok, I've tested this in Java (sorry that's more my schtick than Python but the principle is the same):

private final static String TESTS[] = {
        "'testing 123'",
        "'testing 123\\'",
        "'testing 123",
        "blah 'testing 123",
        "blah 'testing 123'",
        "blah 'testing 123' foo",
        "this 'is a \\' test'",
        "another \\' test 'testing \\' 123' \\' blah"
};

public static void main(String args[]) {
    Pattern p = Pattern.compile("(?<!\\\\)'((?:\\\\'|[^'])*)(?<!\\\\)'");
    for (String test : TESTS) {
        Matcher m = p.matcher(test);
        if (m.find()) {
            System.out.printf("%s => %s%n", test, m.group(1));
        } else {
            System.out.printf("%s doesn't match%n", test);
        }
    }
}

results:

'testing 123' => testing 123
'testing 123\' doesn't match
'testing 123 doesn't match
blah 'testing 123 doesn't match
blah 'testing 123' => testing 123
blah 'testing 123' foo => testing 123
this 'is a \' test' => is a \' test
another \' test 'testing \' 123' \' blah => testing \' 123

which seems correct.

cletus
I found something close, except I forgot to check against the escaped initial quote... I don't know (?>! though. Did you meant (?<! or is it some construct I don't know?
PhiLho
I did, type corrected.
cletus
Not bad, but it fails in cases where the first quote is preceded by an escaped backslash.
Evan Fosmark
The last test case has an initial escaped quote. Can you give me an example?
cletus
+2  A: 

I think this will work:

import re
regexc = re.compile(r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'")

def check(test, base, target):
    match = regexc.search(base)
    assert match is not None, test+": regex didn't match for "+base
    assert match.group(1) == target, test+": "+target+" not found in "+base
    print "test %s passed"%test

check("Empty","''","")
check("single escape1", r""" Example: 'Foo \' Bar'  End. """,r"Foo \' Bar")
check("single escape2", r"""'\''""",r"\'")
check("double escape",r""" Example2: 'Foo \\' End. """,r"Foo \\")
check("First quote escaped",r"not matched\''a'","a")
check("First quote escaped beginning",r"\''a'","a")

The regular expression r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'" is forward matching only the things that we want inside the string:

  1. Chars that aren't backslash or quote.
  2. Escaped quote
  3. Escaped backslash

EDIT:

Add extra regex at front to check for first quote escaped.

Douglas Leeder
-1 Doesn't work when the first quote encountered is escaped (ie \').
cletus
It only allows quotes and backslashes to be escaped.
MizardX
MixardX, that is all I was looking for. And this pattern appears to be extensible enough for if I decide to add more escapable characters.
Evan Fosmark
I think if you start getting into escaping lots of chars, it's time to look at a proper parser. I wouldn't want to maintain a much more complicated regex.
Douglas Leeder
A: 

Using cletus' expression with Python's re.findall():

re.findall(r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'", s)

A test finding several matches in a string:

>>> re.findall(r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'",
 r"\''foo bar gazonk' foo 'bar' gazonk 'foo \'bar\' gazonk' 'gazonk bar foo\'")
['foo bar gazonk', 'bar', "foo \\'bar\\' gazonk"]
>>>

Using cletus' TESTS array of strings:

["%s => %s" % (s, re.findall(r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'", s)) for s in TESTS]

Works like a charm. (Test it yourself or take my word for it.)

PEZ
+1  A: 

Douglas Leeder's pattern ((?:^|[^\\])'(([^\\']|\\'|\\\\)*)') will fail to match "test 'test \x3F test' test" and "test \\'test' test". (String containing an escape other than quote and backslash, and string preceded by an escaped backslash.)

cletus' pattern ((?<!\\)'((?:\\'|[^'])*)(?<!\\)') will fail to match "test 'test\\' test". (String ending with an escaped backslash.)

My proposal for single-quoted strings is this:

(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'

For both single-quoted or double-quoted stings, you could use this:

(?<!\\)(?:\\\\)*("|')((?:\\.|(?!\1)[^\\])*)\1

Test run using Python:

Doublas Leeder´s test cases:
"''" matched successfully: ""
" Example: 'Foo \' Bar'  End. " matched successfully: "Foo \' Bar"
"'\''" matched successfully: "\'"
" Example2: 'Foo \\' End. " matched successfully: "Foo \\"
"not matched\''a'" matched successfully: "a"
"\''a'" matched successfully: "a"

cletus´ test cases:
"'testing 123'" matched successfully: "testing 123"
"'testing 123\\'" matched successfully: "testing 123\\"
"'testing 123" didn´t match, as exected.
"blah 'testing 123" didn´t match, as exected.
"blah 'testing 123'" matched successfully: "testing 123"
"blah 'testing 123' foo" matched successfully: "testing 123"
"this 'is a \' test'" matched successfully: "is a \' test"
"another \' test 'testing \' 123' \' blah" matched successfully: "testing \' 123"

MizardX´s test cases:
"test 'test \x3F test' test" matched successfully: "test \x3F test"
"test \\'test' test" matched successfully: "test"
"test 'test\\' test" matched successfully: "test\\"
MizardX
Does it still function when I have more than one escaped escape character? For instance, "Example 'foo \\\\' bar'" where it should get foo with two escape chars.
Evan Fosmark
Yes, it works with multiple escape chars, before both the initial quote, and the ending quote.
MizardX