tags:

views:

131

answers:

3

I'm seeking a solution to splitting a string which contains text in the following format:

"abcd efgh 'ijklm no pqrs' tuv"

which will produce the following results:

['abcd', 'efgh', 'ijklm no pqrs', 'tuv']

In other words, it splits by whitespace unless inside of a single quoted string. I think it could be done with .NET regexps using "Lookaround" operators, particularly balancing operators. I'm not so sure about Perl.

+13  A: 

Use Text::ParseWords:

#!/usr/bin/perl

use strict; use warnings;
use Text::ParseWords;

my @words = parse_line('\s+', 0, "abcd efgh 'ijklm no pqrs' tuv");

use Data::Dumper;
print Dumper \@words;

Output:

C:\Temp> ff
$VAR1 = [
          'abcd',
          'efgh',
          'ijklm no pqrs',
          'tuv'
        ];

You can look at the source code for Text::ParseWords::parse_line to see the pattern used.

Sinan Ünür
I love how "how do I do this?" question I have ever had about Perl has been quickly answered by "Use this module that does exactly what you want."
Jergason
Figures there is a package to do exactly what I need. I wasn't sure what I was looking for. You're a rock star, thanks!
Kivin
@Jergason blame it on the wonderful people who, when they *don't* find exactly what they need, and have to write it themselves, CPAN the result afterwards. :)
hobbs
And then blame the wonderful people who write CPAN modules that use every other possible CPAN module, no matter how tiny, so that you must pull in ten other mostly-useless modules.
Zan Lynx
@zan FWIW, `Text::ParseWords` is in the core. Also, modules or distributions with giant dependency lists are not that common.
Sinan Ünür
+2  A: 

So you've decided to use a regex? Now you have two problems.

Allow me to infer a little bit. You want an arbitrary number of fields, where a field is composed of text without containing a space, or it is separated by spaces and begins with a quote and ends with a quote (possibly with spaces inbetween).

In other words, you want to do what a command line shell does. You really should just reuse something. Failing that, you should capture a field at a time, with a regex something like:

^ *([^ ]+|'[^']*')(.*)

Where you append group one to your list, and continue the loop with the contents of group 2.

A single pass through a regex wouldn't be able to capture an arbitrarily large number of fields. You might be able to split on a regex (python will do this, not sure about perl), but since you are matching the stuff outside the spaces, I'm not sure that is even an option.

Mark Santesson
+3  A: 
use strict; use warnings;

my $text = "abcd efgh 'ijklm no pqrs' tuv 'xwyz 1234 9999' 'blah'";
my @out;

my @parts = split /'/, $text;

for ( my $i = 1; $i < $#parts; $i += 2 ) {
    push @out, split( /\s+/, $parts[$i - 1] ), $parts[$i];
}

push @out, $parts[-1];

use Data::Dumper;
print Dumper \@out;