tags:

views:

688

answers:

4

I have the following string:

StartProgram    1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30

I need a regular expression to split this line but ignore spaces in double quotes in Perl.

The following is what I tried but it does not work.

(".*?"|\S+)
+3  A: 

Update: It looks like the fields are actually tab separated, not space. If that is guaranteed, just split on \t.

First, let's see why (".*?"|\S+) "does not work". Specifically, look at ".*?" That means zero or more characters enclosed in double-quotes. Well, the field that is giving you problems is ""C:\Program Files\ABC\ABC XYZ"". Note that each "" at the beginning and end of that field will match ".*?" because "" consists of zero characters surrounded with double quotes.

It is better to match as specifically as possible rather than splitting. So, if you have a configuration file with directives and a fixed format, form a regular expression match that is as close to the format you are trying to match as possible.

Move the quotation marks outside of the capturing parentheses if you don't want them.

#!/usr/bin/perl

use strict;
use warnings;

my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};

my @parts = $s =~ m{\A(\w+) ([0-9]) (""[^"]+"") (\w+) ([0-9]) ([0-9]{2})};

use Data::Dumper;
print Dumper \@parts;

Output:

$VAR1 = [
          'StartProgram',
          '1',
          '""C:\\Program Files\\ABC\\ABC XYZ""',
          'CleanProgramTimeout',
          '1',
          '30'
        ];

In that vein, here is a more involved script:

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

my @strings = split /\n/, <<'EO_TEXT';
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30
EO_TEXT

my $re = qr{
    (?<directive>StartProgram)\s+
    (?<instance>[0-9][0-9]?)\s+
    (?<path>"".+?""|\S+)\s+
    (?<timeout_directive>CleanProgramTimeout)\s+
    (?<timeout_instance>[0-9][0-9]?)\s+(?<timeout_seconds>[0-9]{2})
}x;

for (@strings) {
    if ( $_ =~ $re ) {
        print Dumper \%+;
    }
}

Output:

$VAR1 = {
          'timeout_directive' => 'CleanProgramTimeout',
          'timeout_seconds' => '30',
          'path' => '""C:\\Program Files\\ABC\\ABC XYZ""',
          'directive' => 'StartProgram',
          'timeout_instance' => '1',
          'instance' => '1'
        };
$VAR1 = {
          'timeout_directive' => 'CleanProgramTimeout',
          'timeout_seconds' => '30',
          'path' => 'c:\\opt\\perl',
          'directive' => 'StartProgram',
          'timeout_instance' => '1',
          'instance' => '1'
        };

Update: I cannot get Text::Balanced or Text::ParseWords to parse this correctly. I suspect the problem is the repeated quotation marks that delineate the substring that should not be split. The following code is my best (not very good) attempt at solving the generic problem by using split and then selective re-gathering of parts of the string.

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};

my $t = q{StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30};

print Dumper parse_line($s);
print Dumper parse_line($t);

sub parse_line {
    my ($line) = @_;
    my @parts = split /(\s+)/, $line;
    my @real_parts;

    for (my $i = 0; $i < @parts; $i += 1) {
        unless ( $parts[$i] =~ /^""/ ) {
            push @real_parts, $parts[$i] if $parts[$i] =~ /\S/;
            next;
        }
        my $part;
        do {
            $part .= $parts[$i++];
        } until ($part =~ /""$/);
        push @real_parts, $part;
    }
    return \@real_parts;
}
Sinan Ünür
Maybe the question isn't clear but your answer seems to be different from what was asked. I thought he wanted a way to find a regular expression which would split any line using spaces, but ignoring spaces between quotes. Your answer is a regex to parse one particular format.
Kinopiko
@Kinopiko - His answer is also "This way to do it is better and less buggy than trying to split on questionable delimiters. Consider trying it instead of how you're currently doing it, since it achieves more or less the same result."
Chris Lutz
The thing is, is that the question isn't necessarily a questionable delimiter. Being able to parse an arbitrary line by spaces while ignoring spaces in a quoted string is useful, and this answer completely ignores the question, saying "You should parse by tabs instead". While it's useful in this specific case, it doesn't answer how to split generic string by spaces while ignoring spaces within quoted strings,
Oesor
Oesor I was not able to come up with a satisfying working way of dealing with the general problem. Is that not clear from my comment to Colin Fine's answer (which I upvoted)? Please post a better way of solving the OP's problem, and I will upvote it.
Sinan Ünür
A: 
 my $x = 'StartProgram 1    ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30';

 my @parts = $x =~ /("".*?""|[^\s]+?(?>\s|$))/g;
Kinopiko
`[^\s]+?(?>\s|$)` can be simplified to `\S+\b`
John Kugelman
Bzzt! You are right about \S but \b is not the same as (?>\s|$).
Kinopiko
I copied parts of Sinan Unur's answer to demonstrate a different way of doing it with a regex which doesn't depend on the exact format. I've also left a comment on his answer explaining that. Your answer was almost identical to mine, down to the form of the regex and the variable names, and it also contained the correction from John Kugelman. I don't see why you want to duplicate my answer like that.
Kinopiko
@Kinopiko Arguing over variable names now? My post uses `@parts`. Your post uses `@parts`. @FM's post used `@parts`. The only original part of your answer as the regex pattern. @FM edited the pattern and therefore posted an original answer. Relax.
Sinan Ünür
+6  A: 

Once upon a time I also tried to re-invent the wheel, and solve this myself.

Now I just use Text::ParseWords and let it do the job for me.

Colin Fine
A working example would be great because I have not had success getting 6 fields using `Text::Balanced` and `Text::ParseWords`. `quotewords('"', 1, $_)` gives me `'StartProgram 1 '`, `'"C:\\Program Files\\ABC\\ABC XYZ"'`, `'CleanProgramTimeout 1 30'`
Sinan Ünür
And `quotewords('\s+', 1, $_)` splits the filename along spaces and gives eight fields.
Sinan Ünür
From reading the documentation, all you have to do is substitute single quotes with '\"' and double quotes with '"' and quotewords() should work fine.
Oesor
Sorry, to make that more readable: From reading the documentation, all you have to do is substitute single quotes with `'\"'` and double quotes with `'"'` and quotewords() should work fine.
Oesor
@Oesor and @Colin Fine: Could you please post a working example?
Sinan Ünür
Any time the requirment says "except if its [before,inside,after] ....." then its a job for a parser. You can do just about this with regular expresions but it would break as soon as something like ' \" should not end the quoted string.
James Anderson
A Working example.Line cut from a real current module: my ($path, @items) = Text::ParseWords::parse_line('\s+', 0, line);Operation (from the debugger)Before: DB<3> x $line 0 'oldstruct/testsource/demo demo-1_025 -dummy "value with spaces"'After: DB<6> x $path, @items 0 'oldstruct/testsource/demo' 1 'demo-1_025' 2 '-dummy' 3 'value with spaces'
Colin Fine
A: 
my $str = 'StartProgram    1    ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout    1       30';

print "str:$str\n";

@A =  $str =~ /(".+"|\S+)/g;

foreach my $l (@A) {
        print "<$l>\n";
}

That gives me:

$ ./test.pl 
str:StartProgram    1   ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 130
<StartProgram>
<1>
<""C:\Program Files\ABC\ABC XYZ"">
<CleanProgramTimeout>
<1>
<30>
khearn