ansaurus

Question

Answer 1

+1 A:

First, your strings are encoded in MacRoman. When you interpret them as byte sequences the second results in C3 A2 C2 82 C2 AC. This looks like UTF-8, and the decoded form is E2 82 AC. This again looks like UTF-8, and when you decode it you get €. So what you need to do is:

$step1 = decode("MacRoman", $text);
$step2 = decode("UTF-8", $step1);
$step3 = decode("UTF-8", $step2);

Don't ask me on which mysterious ways this encoding has been created in the first place. Your first character decodes as U+201C, which is indeed the LEFT DOUBLE QUOTATION MARK.

Note: If you are on a Mac, the first decoding step may be unnecessary since the encoding is only in the "presentation layer" (when you copied the Perl source into the HTML form and your browser did the encoding-translation for you) and not in the data itself.

Roland Illig 2010-07-24 21:03:52

When I try this I get the following error:Cannot decode string with wide characters at /Library/Perl/Updates/5.10.0/darwin-thread-multi-2level/Encode.pm line 174.What is meant by "Wide Characters"?? Also I am on a Mac.

2010-07-24 21:22:54

Usually, when you `decode` something, you go from a byte-sequence to a char-sequence. The "Wide Characters" error message tells you that you already have a char-sequence. It's a safety-net that prevents you from doing things that you normally don't want.

Roland Illig 2010-07-24 22:01:20

Perhaps it helps if you save your Perl program not in the MacRoman encoding but in UTF-8. Or do you do that already?

Roland Illig 2010-07-24 22:03:10

Answer 2

A:

So I figured out the answer, the comment from Roland Illig helped me get there (thanks again!). Decoding more than once causes the wide characters error, and therefore should not be done.

The key here is decoding the UTF-8 Text and then encoding it in MacRoman. To send the .CSV files to my Windows friends I have to save it as .XLSX first so that the coding doesn't get all screwy again.

$text =~ s/√¢¬Ä¬ú|√¢¬Ä¬ù/"/sig;
$text =~ s/√¢¬Ä¬ôs/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/√Ç¬†/ /sig;

$text = decode("UTF-8", $text);

print("$text\n\n\n");

my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();

open my $OUTPUT, ">:encoding(MacRoman)", "unicode.csv" or die "unicode.csv: $!";

2010-07-25 18:17:29

ansaurus

tags:

views:

answers:

Perl Text::CSV_XS Encoding Issues

related questions