tags:

views:

70

answers:

2

I am using Text::CSV to parse a csv file. Not all lines can be parsed, because of some bad characters.
The Text::CSV documentation says:

Allowable characters within a CSV field include 0x09 (tab) and the inclusive range of 0x20 (space) through 0x7E (tilde).
How can I filter out as easy as possible any not-allowed characters?

A: 
$subject =~ s/[^\x09\x20-\x7E]+//g;

will remove all those characters.

But this seems like a strange limitation on what's allowed in a CSV file. I haven't seen a csv parser yet that couldn't handle, for example, umlauts and other non-ASCII characters. I don't know Perl, though.

Tim Pietzcker
+9  A: 

Instead of filtering out the "bad" characters, you probably want to use the binary flag to tell Text::CSV to stop enforcing its ASCII-only rule:

my $csv = Text::CSV->new ({ binary => 1 });

If you're trying to read a file that's in a non-ASCII character set (e.g. Latin-1 or UTF-8), you should look at the Text::CSV::Encoded module.

cjm
+1. See, I thought that this couldn't really be a limitation of Perl's csv parser.
Tim Pietzcker
Would be nice if it would be mentioned in the docs.
weismat
binary is mentioned in the docs.
MkV