tags:

views:

133

answers:

3

Normally I just use TStringList.CommaText, but this wont work when a given field has multiple lines. Basically I need a csv processor that conforms to rfc4180. I'd rather not have to implement the RFC myself.

+1  A: 

Do you really need full RFC support? I can't count the number of times I've written a "csv parser" in perl or something similar. Split on comma's and be done. The only problem comes when you need to respect quotes. If you do, write a "quotesplit" routine that looks for quotes and ensures they're balanced. Unless this csv processor is the meat and potatoes of some application, I'm not sure it'll really be a problem.

On the other hand, I really don't think fully implementing the rfc is that complex. That's a relatively short rfc in comparison to things like... HTTP, SMTP, IMAP, ...

In perl, a decent quotesplit() I wrote is:

sub quotesplit {
    my ($regex, $s, $maxsplits) = @_;
    my @split;
    my $quotes = "\"'";
    die("usage: quotesplit(qr/.../,'string...'), // instead of qr//?\n")
        if scalar(@_) < 2;

    my $lastpos;
    while (1) {
        my $pos = pos($s);

        while ($s =~ m/($regex|(?<!\\)[$quotes])/g) {
            if ($1 =~ m/[$quotes]/) {
                $s =~ m/[^$quotes]*/g;
                $s =~ m/(?<!\\)[$quotes]/g;
            }
            else {
                push @split, substr($s,$pos,pos($s) - $pos - length($1));
                last;
            }
        }

        if (defined(pos($s)) and $lastpos > pos($s)) {
            errorf('quotesplit() issue: lastpos %s > pos %s',
                $lastpos, pos($s)
            );
            exit;
        }
        if ((defined($maxsplits) && scalar(@split) == ($maxsplits - 1))) {
            push @split, substr($s,pos($s));
            last;
        }
        elsif (not defined(pos($s))) {
            push @split, substr($s,$lastpos);
            last;
        }

        $lastpos = pos($s);
    }

    return @split;
}
xyld
Your "quotesplit" suggestion is what I went with (I'd just finished testing it when I read your post). Basically I'm ensuring that there are an even number of quotes on each line, if not I process the next line as part of the same record.
Alister
@alister my solution doesn't require an "even number of quotes", but it could definitely be enhanced. If you can read perl, it may be of use, but maybe just the idea will help. Good luck.
xyld
@xyld The even quotes solution is working perfectly - although I do need to do a quick parse for escaped double quotes first, which is this case is \" rather than the correct "" (double double quotes) and could lead to parsing problems. Looking at the RFC, it would indicate that it is guaranteed that there be an even number of double quotes per record. However due the number of different implementations of CSV I suspect that this might be a bit presumptive.
Alister
@alister definitely. Most parsers simply ignore the last unmatch pair of quotes.
xyld
A: 

did you tried to use Delimiter := ';' and DelimiterText := instead CommaText?

btw, that RFC has no sense at all... it's absurd to Request For Comments on CSV...

111
The problem is multiple lines per record.
Alister
@111: Read the Wikipedia entry on RFC: http://en.wikipedia.org/wiki/Request_for_Comments. The name has remained even though RFCs serve a different purpose today and it no longer fits. And it is of course far from absurd or senseless - there needs to be a canonical format description *somewhere*.
mghie
A: 

Here is my CSV parser (not maybe to the RFC but it works fine). Keep calling it on a supplied string, each time it gives you the next CSV field. I dont believe it has any problems with multiple line.

function CSVFieldToStr(
           var AStr : string;
               ADelimChar : char = Comma ) : string;
{ Returns the next CSV field str from AStr, deleting it from AStr,
  with delimiter }
var
  bHasQuotes : boolean;

  function HandleQuotes( const AStr : string ) : string;
  begin
    Result := Trim(AStr);
    If bHasQuotes then
      begin
      Result := StripQuotes( Result );
      ReplaceAllSubStrs( '""', '"', Result );
      end;
  end;

var
  bInQuote    : boolean;
  I           : integer;
  C           : char;
begin
  bInQuote   := False;
  bHasQuotes := False;
  For I := 1 to Length( AStr ) do
    begin
    C := AStr[I];
    If C = '"' then
      begin
      bHasQuotes := True;
      bInQuote := not bInQuote;
      end
     else
      If not bInQuote then
       If C = ADelimChar then
          begin
          Result := HandleQuotes( Copy( AStr, 1, I-1 ));
          AStr   := Trim(Copy( AStr, I+1, MaxStrLEn ));
          Exit;
          end;
    end;
  Result := HandleQuotes(AStr);
  AStr := '';
end;
Brian Frost