Update: It looks like the fields are actually tab separated, not space. If that is guaranteed, just split on \t
.
First, let's see why (".*?"|\S+)
"does not work". Specifically, look at ".*?"
That means zero or more characters enclosed in double-quotes. Well, the field that is giving you problems is ""C:\Program Files\ABC\ABC XYZ""
. Note that each ""
at the beginning and end of that field will match ".*?"
because ""
consists of zero characters surrounded with double quotes.
It is better to match as specifically as possible rather than splitting. So, if you have a configuration file with directives and a fixed format, form a regular expression match that is as close to the format you are trying to match as possible.
Move the quotation marks outside of the capturing parentheses if you don't want them.
#!/usr/bin/perl
use strict;
use warnings;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my @parts = $s =~ m{\A(\w+) ([0-9]) (""[^"]+"") (\w+) ([0-9]) ([0-9]{2})};
use Data::Dumper;
print Dumper \@parts;
Output:
$VAR1 = [
'StartProgram',
'1',
'""C:\\Program Files\\ABC\\ABC XYZ""',
'CleanProgramTimeout',
'1',
'30'
];
In that vein, here is a more involved script:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @strings = split /\n/, <<'EO_TEXT';
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30
EO_TEXT
my $re = qr{
(?<directive>StartProgram)\s+
(?<instance>[0-9][0-9]?)\s+
(?<path>"".+?""|\S+)\s+
(?<timeout_directive>CleanProgramTimeout)\s+
(?<timeout_instance>[0-9][0-9]?)\s+(?<timeout_seconds>[0-9]{2})
}x;
for (@strings) {
if ( $_ =~ $re ) {
print Dumper \%+;
}
}
Output:
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => '""C:\\Program Files\\ABC\\ABC XYZ""',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => 'c:\\opt\\perl',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
Update: I cannot get Text::Balanced
or Text::ParseWords
to parse this correctly. I suspect the problem is the repeated quotation marks that delineate the substring that should not be split. The following code is my best (not very good) attempt at solving the generic problem by using split and then selective re-gathering of parts of the string.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my $t = q{StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30};
print Dumper parse_line($s);
print Dumper parse_line($t);
sub parse_line {
my ($line) = @_;
my @parts = split /(\s+)/, $line;
my @real_parts;
for (my $i = 0; $i < @parts; $i += 1) {
unless ( $parts[$i] =~ /^""/ ) {
push @real_parts, $parts[$i] if $parts[$i] =~ /\S/;
next;
}
my $part;
do {
$part .= $parts[$i++];
} until ($part =~ /""$/);
push @real_parts, $part;
}
return \@real_parts;
}