views:

194

answers:

2

I'm having issues with Unicode characters in Perl. When I receive data in from the web, I often get characters like “ or €. The first one is a quotation mark and the second is the Euro symbol.

Now I can easily substitute in the correct values in Perl and print to the screen the corrected words, but when I try to output to a .CSV file all the substitutions I have done are for nothing and I get garbage in my .CSV file. (The quotes work, guessing since it's such a general character). Also Numéro will give Numéro. The examples are endless.

I wrote a small program to try and figure this issue out, but am not sure what the problem is. I read on another stack overflow thread that you can import the .CSV in Excel and choose UTF8 encoding, this option does not pop up for me though. I'm wondering if I can just encode it into whatever Excel's native character set is (UTF16BE???), or if there is another solution. I have tried many variations on this short program, and let me say again that its just for testing out Unicode problems, not a part of a legit program. Thanks.

use strict;
use warnings;
require Text::CSV_XS;
use Encode qw/encode decode/;

my $text = 'Numéro Numéro Numéro Orkos Capital SAS (√¢¬Ä¬úOrkos√¢¬Ä¬ù) 325M√¢¬Ç¬¨ in 40 companies headquartered';

print("$text\n\n\n");

$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;

print("$text\n\n\n");

my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();

open my $OUTPUT, ">:encoding(utf8)", "unicode.csv" or die "unicode.csv: $!";

my @row = ($text);

$CSV->print($OUTPUT, \@row);
$OUTPUT->autoflush(1);

I've also tried these two lines to no avail:

$text = decode("Guess", $text);
$text = encode("UTF-16BE", $text);
+1  A: 

First, your strings are encoded in MacRoman. When you interpret them as byte sequences the second results in C3 A2 C2 82 C2 AC. This looks like UTF-8, and the decoded form is E2 82 AC. This again looks like UTF-8, and when you decode it you get . So what you need to do is:

$step1 = decode("MacRoman", $text);
$step2 = decode("UTF-8", $step1);
$step3 = decode("UTF-8", $step2);

Don't ask me on which mysterious ways this encoding has been created in the first place. Your first character decodes as U+201C, which is indeed the LEFT DOUBLE QUOTATION MARK.

Note: If you are on a Mac, the first decoding step may be unnecessary since the encoding is only in the "presentation layer" (when you copied the Perl source into the HTML form and your browser did the encoding-translation for you) and not in the data itself.

Roland Illig
When I try this I get the following error:Cannot decode string with wide characters at /Library/Perl/Updates/5.10.0/darwin-thread-multi-2level/Encode.pm line 174.What is meant by "Wide Characters"?? Also I am on a Mac.
Usually, when you `decode` something, you go from a byte-sequence to a char-sequence. The "Wide Characters" error message tells you that you already have a char-sequence. It's a safety-net that prevents you from doing things that you normally don't want.
Roland Illig
Perhaps it helps if you save your Perl program not in the MacRoman encoding but in UTF-8. Or do you do that already?
Roland Illig
A: 

So I figured out the answer, the comment from Roland Illig helped me get there (thanks again!). Decoding more than once causes the wide characters error, and therefore should not be done.

The key here is decoding the UTF-8 Text and then encoding it in MacRoman. To send the .CSV files to my Windows friends I have to save it as .XLSX first so that the coding doesn't get all screwy again.

$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;

$text = decode("UTF-8", $text);

print("$text\n\n\n");

my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();

open my $OUTPUT, ">:encoding(MacRoman)", "unicode.csv" or die "unicode.csv: $!";