tags:

views:

89

answers:

4

Given a string of pipe-separated values (call it $psv), I want to be able to split by those pipes and populate an array. However, the string can also contain escaped pipes (\|) and escaped escapes (\\), both of which are to be considered mere literals. I have a couple solutions for this problem in mind:

  • Replace both escape sequences with some random strings not-otherwise found in the $psv, split(/\|/, $psv), replace back original characters
  • Loop through $psv, character-by-character

And I think both of those would work. But for a maximum dopamine flood, I'd like to just do this with a single split() call and nothing else. So is there a regular expression for this?

+3  A: 

You don't need to use split for this task. An alternative is:

my $psv = "aaa|bbb||ccc|\\|\\|\\||\\\\\\\\\\\\";
print "$psv\n";

my @words = map { s/\\([\\|])/$1/g; $_; } ($psv =~ /(?:^|\|) ((?:\\[\\|] | [^|])*)/gx);
printf("%s\n", join(", ", @words));

The regular expression may look scary, but is easily explained. It matches each of the words which are separated by the pipes. It starts either at the beginning of the string or a pipe separator. Then follows an arbitrary number of either an escape sequence (\ + one of \|) or an arbitrary character except pipe.

The regular expression inside the map just replaces the escape sequences with what they really mean.

Roland Illig
I didn't see your solution, but we have similar Regex's. Some credit to ikegami @ perlmonks.org for regex debugging.
vol7ron
+2  A: 

If Perl supported variable-width look-behind assertions, you might be able to do it with something like this:

split(/(?<!(?<!\\)(?:\\\\)*\\)\|/, $psv);

That should match a pipe character which is not preceded by (an odd number of backslashes not preceded by a backslash). But only fixed-width look-behind assertions are allowed, so that's not an option. It's possible that some regex guru could come up with something that would actually work for you, but personally I'd say a finite state machine (looping through $psv a character at a time) might be a better option.

Something else I suppose you could try is to just split the string on the pipe character, and then check each element of the resulting list to see if it ends with an odd number of backslashes. If it does, join it back to the next element of the list with | between them. Basically you'd be doing the split ignoring the escape sequences, then going back and accounting for the escapes afterwards.

David Zaslavsky
I suppose "no, but here is what such a regex would look like given one additional feature" will have to do.
BipedalShark
+4  A: 

Is there a specific reason that you require pure regex solution? (unless this question was more of a mental challenge and less of a practical problem, of course).

A proper way to handle X-separated data in real code is by using a proper parser - a very common one is Text::CSV_XS (don't let the name fool you - it can handle any separator characters, not just commas). It will handle escapes correctly, along with quoting.

DVK
+1  A: 

Sweeter Solution

This method does not use split, but does use a simple regex.


#!/usr/bin/perl -w

use strict;

   sub main{
      (my $psv = <DATA>) =~ s/\s+$//s;

      my @arr = $psv =~ /(?:^|\G\|)((?:[^\\|]|\\.)*)/sg;

      {
         local $" = ', ';      # $" - sets the pretty print
         print "@arr \n";      # outputs: abc, def, g\|i, jkl, m\|o, pqr, s\\u, v\w, x\\, , z 
      }

   }

   main();


__DATA__
abc|def|g\|i|jkl|m\|o|pqr|s\\u|v\w|x\\||z
vol7ron