ansaurus

Question

How can I capture an escaped " but not an unescaped one?

Answer 1

+5 A:

You're making it way too complicated; there's no rule that says you have to do all your parsing in one monolithic regex. Since your string looks like a comma-delimited sequence, first parse it as such:

my @fields = split /(?<!\\),/, $string;   # use comma as a delimiter (except when escaped)

...And then parse your first field accordingly:

shift @fields unless $fields[0];     # pull off the potentially null first field
$fields[0] =~ s/^"//g;               # remove the leading "
$fields[0] =~ s/(?<!\\)"$//g;        # remove the trailing " that isn't preceded by a \

You could parse all your fields this way by wrapping the above code in a for loop or map().

Note that this code does not account for such occurrences such as \\, (the comma is a valid delimiter here, even though it will pass through the regex incorrectly). Therefore, it would be much preferred to use a proper parser for your format (whatever it is). You may want to take a look at Text::CSV.

Ether 2010-02-01 06:28:24

That will break if one of the fields contains a comma. It may be safe to assume that's true, but also may not.

eaolson 2010-02-01 06:32:47

@eaolson: the OP doesn't specify whether that would be valid. As I noted, the complications of parsing escape characters necessitates a more sophisticated parser than simple regexes can provide. One would need to start at the beginning of the string and interpret each character individually. Consider a long sequence of backslaxhes followed by a semicolon - is it escaped or not? One would have to start counting at the beginning of the sequence to know.

Ether 2010-02-01 06:36:03

Using a CSV parser will take into account quotes.

Tor Valamo 2010-02-01 11:25:27

+1 for the gentle reminder that not all parsing needs to be done in one monolithic regex

mobrule 2010-02-01 14:22:15

when I said "semicolon" in my comment above I meant "double quote" :)

Ether 2010-02-01 15:59:03

Answer 2

A:

Your problem calls for the infamous zero-width negative look-behind assertion

...which lets you match a foo that doesn't follow a bar.

The doc is here: http://perldoc.perl.org/perlre.html#Extended-Patterns

and you want something like this in your regexp:

"(.+?)(?<!\\)"

that matches a double quote, as few as possible of any char(s), then another double quote not preceded by a backslash (escaped by doubling, I think). The first set of parens captures as you intend, the second parentheses are not capturing.

Edit: Meanwhile tested using http://www.internetofficer.com/seo-tool/regex-tester/ and it seems to work fine.

Edit: As outis points out, this expression will not correctly match a PORTION in which the final character before the closing quote is an escaped backslash. If you don't anticipate backslashes in your text you should be fine though.

Carl Smotricz 2010-02-01 06:41:56

Answer 3

+1 A:

Don't forget to allow for escaped backslashes along with escaped quotes. Using REs to matched balanced anything gets ugly fast:

/(?<=")((?:[^"\\]+|\\+[^"\\]|(?:\\\\)+|(?<!\\)\\(?:\\\\)*")*)(?=")/

Do yourself a favor and use a parser, as Ether suggests.

outis 2010-02-01 06:46:22

Escaped backslashes aren't that hard to deal with: `(?>[^"\\]+|\\.)*`. You could also write it as `(?>[^"\\]+|\\["\\])*`, but the dot doesn't cause any problems, and it takes care of any other backslash-escape sequence you might want to support.

Alan Moore 2010-02-01 12:23:28

Answer 4

A:

if your data is comma delimited and do not have embedded commas, just split on "," and get the appropriate fields

while(<>){
    chomp;
    @s = split /,/;
    if ($s[0] eq ""){
        print "$s[1]\n";
    }else{
        print $s[0]."\n";
    }
}

output

$ perl perl.pl file
"\"abc123"
"abc123\" "
"\"abc123\""
"abc\"123\""
"abc123"

ghostdog74 2010-02-01 07:20:18

Answer 5

A:

If you need to consider escaped backslashes as mentioned by outis, you can use this:

m/"((\\\\|\\"|[^"])+)"/

(It seems I can not leave comment on outis' answer, but outis solution does not work with this:

"abc\\\"123"

will produce

abc\\\

)

Input:

,"\"abc123","","a",["some_string"]
,"abc123\" ","","a",["some_string"]
"\"abc123\"","","a",["some_string"]
"abc\"123\"","","a",["some_string"]
"abc123","","a",["some_string"]
"ab\\c123","","a",["some_string"]
"abc123\\","","a",["some_string"]
"abc123\\\"","","a",["some_string"]
"abc\\\"123\"","","a",["some_string"]
"abc123\\\\\"","","a",["some_string"]

Output:

\"abc123
abc123\" 
\"abc123\"
abc\"123\"
abc123
ab\\c123
abc123\\
abc123\\\"
abc\\\"123\"
abc123\\\\\"

xiechao 2010-02-01 11:23:16

Answer 6

+3 A:

Just use Text::CSV

Christoffer Hammarström 2010-02-01 11:40:14

ansaurus

tags:

views:

answers:

How can I capture an escaped " but not an unescaped one?

related questions