ansaurus

Question

How can I handle a file with multiple encodings in it?

Answer 1

+1 A:

If you are simply redirecting Perl's output, then Perl will have a difficult time producing a decent file.

You should try writing the file directly from Perl.

You should also check whether you really have an encoding problem or whether characters that simply don't belong in your file still end up there. Use vi or a hex editor or simply hexdump to do that.

innaM 2008-12-15 14:50:40

if there were non Unicode characters why would that break the file?

Joshxtothe4 2008-12-15 15:09:10

i think they mean standard ascii characters.. but i belive ascii code are valid unicode codes in UTF-8 anyway are they not?

ShoeLace 2008-12-15 15:19:04

The shell should not interfere. It doesn't modify the text that is outputted.

Leon Timmermans 2008-12-15 15:25:48

You're right, Leon.

innaM 2008-12-15 15:36:29

Answer 2

+2 A:

You can use the IO layers facility. Open a file like this to specify the encoding:

open my $fh, '>:encoding(UTF-8)', $file;

Or you can use use binmode() to alter an already opened filehandle:

binmode(STDOUT, ':encoding(UTF-8)');

Of course, you can set other encodings than utf8, and there's plenty of other options, too. Just look in the documentations for open and binmode. Maybe IO::File is worth a look, too:

perldoc -f open
perldoc -f binmode
perldoc IO::File

Dave Vogt 2008-12-15 15:27:17

have no idea what encoding to try. When I explicitly set the mode to UTF-8 it still fails to open, which based on what I read should not happen as encoding verifies tehe data.

Joshxtothe4 2008-12-15 15:32:19

If you pass crap in you will most likely get crap out. Have you tried opening the file with something that is not as touchy as gedit to see where the problem might be?

innaM 2008-12-15 15:38:17

It is a file of 200 or so emails, hard to go through it all. There were some japanese characters in one of the mails, but removing them did not solve anything.

Joshxtothe4 2008-12-15 15:48:14

doesn't Mail::Box::Manager provide information about the encoding of the specific messages?

Dave Vogt 2008-12-15 16:26:02

Dave: Yes it does, see my answer.

Leon Timmermans 2008-12-15 16:58:13

Answer 3

+3 A:

Different messages probably are in different encodings. Probably gedit detects it as UTF-8, but later finds out that parts of the file aren't UTF-8. Mixed files like this are major PITA.

The best (perhaps only) solution is to check for the content type ($message->contentType) and convert everything to UTF-8.

Leon Timmermans 2008-12-15 15:42:08

How would you convert to utf-8 only if it was not utf-8?

Joshxtothe4 2008-12-15 16:27:03

You should just use Encode::decode() to decode whatever encoding the message is using, and then output it to a filehandle that is opened as UTF-8.

Leon Timmermans 2008-12-15 17:01:33

ansaurus

tags:

views:

answers:

How can I handle a file with multiple encodings in it?

related questions