tags:

views:

205

answers:

6

Suppose the portion that needs to be captured by regex is indicated by PORTION in the following string

,"PORTION","","a",["some_string"]  

Examples of PORTION are

  • \"abc123
  • abc123\"
  • \"abc123\"
  • abc\"123\"
  • abc123

so the strings actually look like

  • ,"\"abc123","","a",["some_string"]
  • ,"abc123\" ","","a",["some_string"]
  • "\"abc123\"","","a",["some_string"]
  • "abc\"123\"","","a",["some_string"]
  • "abc123","","a",["some_string"]

PORTION is surrounded by double quotes. Double quotes inside PORTION are escaped by backslash. My current pattern is

my $pattern = '(.?([\\"]|[^"][^,][^"])*)';

which produces the results for the above examples as follows

  • \"abc123","","a"
  • abc123
  • \"abc12
  • abc\"123\""
  • abc123"

The pattern tries to match everything in front of a sequence that is not ","
and also allow the capturing of \"
But it's not working as intended. How can I make it work?

+5  A: 

You're making it way too complicated; there's no rule that says you have to do all your parsing in one monolithic regex. Since your string looks like a comma-delimited sequence, first parse it as such:

my @fields = split /(?<!\\),/, $string;   # use comma as a delimiter (except when escaped)

...And then parse your first field accordingly:

shift @fields unless $fields[0];     # pull off the potentially null first field
$fields[0] =~ s/^"//g;               # remove the leading "
$fields[0] =~ s/(?<!\\)"$//g;        # remove the trailing " that isn't preceded by a \

You could parse all your fields this way by wrapping the above code in a for loop or map().

Note that this code does not account for such occurrences such as \\, (the comma is a valid delimiter here, even though it will pass through the regex incorrectly). Therefore, it would be much preferred to use a proper parser for your format (whatever it is). You may want to take a look at Text::CSV.

Ether
That will break if one of the fields contains a comma. It may be safe to assume that's true, but also may not.
eaolson
@eaolson: the OP doesn't specify whether that would be valid. As I noted, the complications of parsing escape characters necessitates a more sophisticated parser than simple regexes can provide. One would need to start at the beginning of the string and interpret each character individually. Consider a long sequence of backslaxhes followed by a semicolon - is it escaped or not? One would have to start counting at the beginning of the sequence to know.
Ether
Using a CSV parser will take into account quotes.
Tor Valamo
+1 for the gentle reminder that not all parsing needs to be done in one monolithic regex
mobrule
when I said "semicolon" in my comment above I meant "double quote" :)
Ether
A: 

Your problem calls for the infamous zero-width negative look-behind assertion

...which lets you match a foo that doesn't follow a bar.

The doc is here: http://perldoc.perl.org/perlre.html#Extended-Patterns

and you want something like this in your regexp:

"(.+?)(?<!\\)"

that matches a double quote, as few as possible of any char(s), then another double quote not preceded by a backslash (escaped by doubling, I think). The first set of parens captures as you intend, the second parentheses are not capturing.

Edit: Meanwhile tested using http://www.internetofficer.com/seo-tool/regex-tester/ and it seems to work fine.

Edit: As outis points out, this expression will not correctly match a PORTION in which the final character before the closing quote is an escaped backslash. If you don't anticipate backslashes in your text you should be fine though.

Carl Smotricz
+1  A: 

Don't forget to allow for escaped backslashes along with escaped quotes. Using REs to matched balanced anything gets ugly fast:

/(?<=")((?:[^"\\]+|\\+[^"\\]|(?:\\\\)+|(?<!\\)\\(?:\\\\)*")*)(?=")/

Do yourself a favor and use a parser, as Ether suggests.

outis
Escaped backslashes aren't that hard to deal with: `(?>[^"\\]+|\\.)*`. You could also write it as `(?>[^"\\]+|\\["\\])*`, but the dot doesn't cause any problems, and it takes care of any other backslash-escape sequence you might want to support.
Alan Moore
A: 

if your data is comma delimited and do not have embedded commas, just split on "," and get the appropriate fields

while(<>){
    chomp;
    @s = split /,/;
    if ($s[0] eq ""){
        print "$s[1]\n";
    }else{
        print $s[0]."\n";
    }
}

output

$ perl perl.pl file
"\"abc123"
"abc123\" "
"\"abc123\""
"abc\"123\""
"abc123"
ghostdog74
A: 

If you need to consider escaped backslashes as mentioned by outis, you can use this:

m/"((\\\\|\\"|[^"])+)"/

(It seems I can not leave comment on outis' answer, but outis solution does not work with this:

"abc\\\"123"

will produce

abc\\\

)

Input:

,"\"abc123","","a",["some_string"]
,"abc123\" ","","a",["some_string"]
"\"abc123\"","","a",["some_string"]
"abc\"123\"","","a",["some_string"]
"abc123","","a",["some_string"]
"ab\\c123","","a",["some_string"]
"abc123\\","","a",["some_string"]
"abc123\\\"","","a",["some_string"]
"abc\\\"123\"","","a",["some_string"]
"abc123\\\\\"","","a",["some_string"]

Output:

\"abc123
abc123\" 
\"abc123\"
abc\"123\"
abc123
ab\\c123
abc123\\
abc123\\\"
abc\\\"123\"
abc123\\\\\"
xiechao
+3  A: 

Just use Text::CSV

Christoffer Hammarström