tags:

views:

3668

answers:

6

I'm having some issues with parsing CSV data with quotes. My main problem is with quotes within a field. In the following example lines 1 - 4 work correctly but 5,6 and 7 don't.

COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,

I'd like to avoid Text::CSV as it isn't installed on the target server. Realising that CSV's are are more complicated than they look I'm using a recipe from the Perl Cookbook.

sub parse_csv {
  my $text = shift; #record containg CSVs
  my @columns = ();
  push(@columns ,$+) while $text =~ m{
    # The first part groups the phrase inside quotes
    "([^\"\\]*(?:\\.[^\"\\]*)*)",?
      | ([^,]+),?
      | ,
    }gx;
  push(@columns ,undef) if substr($text, -1,1) eq ',';
  return @columns ; # list of vars that was comma separated.
}

Does anyone have a suggestion for improving the regex to handle the above cases?

A: 

Finding matching pairs using regexs is non-trivial and generally unsolvable task. There are plenty of examples in the Jeffrey Friedl's Mastering regular expressions book. I don't have it at hand now, but I remember that he used CSV for some examples, too.

Eugene Morozov
"unsolvable"? You can use regexes to find matching quotes easily! It's parenthesis that regexes can't do, not because they're matching, but because they're NESTED matching. You can't (generally) nest quotes. (You can have \", but that doesn't start a new string inside the old one, now does it?)
Chris Lutz
Thanks Eugene, funnily enough i'm pretty sure the Perl Cookbook example is taken from MRE :) I will double check though.
Mark Nold
/((?:[^\n,"]|"(?:[^"]|"")+")+),/g should be closer to what the OP wants, but is by my own admission still imperfect.
Chris Lutz
Yes, I believe quotes, as-is, inside a quoted area is invalid CSV, there's some escaping mechansim required otherwise working it out by guesswork is simply impossible.
Kent Fredric
+20  A: 

Please, Try Use CPAN

There's no reason you couldn't download a copy of Text::CSV, or any other non-XS based implementation of a CSV parser and install it in your local directory, or in a lib/ sub directory of your project so its installed along with your projects rollout.

If you can't store text files in your project, then I'm wondering how it is you are coding your project.

http://sial.org/howto/perl/life-with-cpan/non-root/

Should be a good guide on how to get these into a working state locally.

Not using CPAN is really a recipe for disaster.

Please consider this before trying to write your own CSV implementation.

Text::CSV is over hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.

note: I learnt this the hard way. Took me a full day to get a working CSV parser in PHP before I discovered an inbuilt one had been added in a later version. It really is something awful.

Kent Fredric
@Kent, thanks... my main aversion to Text::CSV was the difficulty installing at the other.. ie: do they have a compiler (not all un*xs come with one) etc. But i rechecked (due to your first post) and there is a Pure Perl implementation. CSV_PP. Thanks.
Mark Nold
You might also want to look at [Text::xSV](http://p3rl.org/Text::xSV) , which is also Pure Perl
Kent Fredric
I'll have to check that out as Text:CSV_PP doesn't work with the fifth case, even with allow_loose_quotes and escape_char set. Thanks again.
Mark Nold
I've had to deal with junior programmers demanding every CPAN module under the sun be installed when a simple regexp would have sufficed. The Perl community advocates "there is more than one way to do it" so spray painting LARGE BOLD LETTERS as if there is ONLY ONE APPROACH does not help those who may actually want to discover an alternative method.
PP
A: 

You can (try to) use CPAN.pm to simply have your program install/update Text::CSV. As said before, you can even "install" it to a home or local directory, and add that directory to @INC (or, if you prefer not to use BEGIN blocks, you can use lib 'dir'; - it's probably better).

Chris Lutz
+6  A: 

You can parse CSV using Text::ParseWords which ships with Perl.

use Text::ParseWords;

while (<DATA>) {
    chomp;
    my @f = quotewords ',', 0, $_;
    say join ":" => @f;
}

__DATA__
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,

which parses your CSV correctly....

# => COLLOQ_TYPE:COLLOQ_NAME:COLLOQ_CODE:XDATA
# => S:BELT,FAN:003541547:
# => S:BELT V,FAN:000324244:
# => S:SHROUD SPRING SCREW:000868265:
# => S:D REL VALVE ASSY:000771881:
# => S:YBELT,V:000323030:
# => S:YBELT,'V':000322933:

The only issue I've had with Text::ParseWords is when nested quotes in data aren't escaped correctly. However this is badly built CSV data and would cause problems with most CSV parsers ;-)

So u may notice that

# S,"YBELT,"V"",000323030,

came out as (ie. quotes dropped around "V")

# S:YBELT,V:000323030:

however if its escaped like so

# S,"YBELT,\"V\"",000323030,

then quotes will be retained

# S:YBELT,"V":000323030:

Hope that helps

/I3az/

draegtun
FYI Text::ParseWords is included in all versions of Perl 5:perl -MModule::CoreList -l -e'print Module::CoreList->first_release_by_date("Text::ParseWords");' printd 5.000
mirod
@draegtun and @mirod thanks for the pointer. I'm testing this out now, thanks.
Mark Nold
A: 

Tested:


use Test::More tests => 2;

use strict;

sub splitCommaNotQuote {
    my ( $line ) = @_;

    my @fields = ();

    while ( $line =~ m/((\")([^\"]*)\"|[^,]*)(,|$)/g ) {
        if ( $2 ) {
            push( @fields, $3 );
        } else {
            push( @fields, $1 );
        }
        last if ( ! $4 );
    }

    return( @fields );
}

is_deeply(
    +[splitCommaNotQuote('S,"D" REL VALVE ASSY,000771881,')],
    +['S', '"D" REL VALVE ASSY', '000771881', ''],
    "Quote in value"
);
is_deeply(
    +[splitCommaNotQuote('S,"BELT V,FAN",000324244,')],
    +['S', 'BELT V,FAN', '000324244', ''],
    "Strip quotes from entire value"
);
PP