tags:

views:

217

answers:

3

Hello,

I have a small program to order and sort email messages, outputting to a textfile using $msg->decoded->string. The perl program outputs to stdout, and I redirect it to a txt file. However, gedit is unable to open this text file because of a character set problem, and I would like to know how to restore or set a character set with perl.

The program is now thus:

#!/usr/bin/perl
use warnings;
use strict;
use Mail::Box::Manager;

open (MYFILE, '>>data.txt');

my $file = shift || $ENV{MAIL};
my $mgr = Mail::Box::Manager->new(
    access          => 'r',
);

my $folder = $mgr->open( folder => $file )
or die "$file: Unable to open: $!\n";

for my $msg ( sort { $a->timestamp <=> $b->timestamp } $folder->messages)
{
    my $to          = join( ', ', map { $_->format } $msg->to );
    my $from        = join( ', ', map { $_->format } $msg->from );
    my $date        = localtime( $msg->timestamp );
    my $subject     = $msg->subject;
    my $body        = $msg->decoded->string;

    # Strip all quoted text
    $body =~ s/^>.*$//msg;

    print MYFILE <<"";
From: $from
To: $to
Date: $date
$body

}

However I get the same problem that I am unable to open the file with gedit, even though it works with vi or such. If there are non unicode characters in the file, would this break it?

+1  A: 

If you are simply redirecting Perl's output, then Perl will have a difficult time producing a decent file.

You should try writing the file directly from Perl.

You should also check whether you really have an encoding problem or whether characters that simply don't belong in your file still end up there. Use vi or a hex editor or simply hexdump to do that.

innaM
if there were non Unicode characters why would that break the file?
Joshxtothe4
i think they mean standard ascii characters.. but i belive ascii code are valid unicode codes in UTF-8 anyway are they not?
ShoeLace
The shell should not interfere. It doesn't modify the text that is outputted.
Leon Timmermans
You're right, Leon.
innaM
+2  A: 

You can use the IO layers facility. Open a file like this to specify the encoding:

open my $fh, '>:encoding(UTF-8)', $file;

Or you can use use binmode() to alter an already opened filehandle:

binmode(STDOUT, ':encoding(UTF-8)');

Of course, you can set other encodings than utf8, and there's plenty of other options, too. Just look in the documentations for open and binmode. Maybe IO::File is worth a look, too:

perldoc -f open
perldoc -f binmode
perldoc IO::File
Dave Vogt
have no idea what encoding to try. When I explicitly set the mode to UTF-8 it still fails to open, which based on what I read should not happen as encoding verifies tehe data.
Joshxtothe4
If you pass crap in you will most likely get crap out. Have you tried opening the file with something that is not as touchy as gedit to see where the problem might be?
innaM
It is a file of 200 or so emails, hard to go through it all. There were some japanese characters in one of the mails, but removing them did not solve anything.
Joshxtothe4
doesn't Mail::Box::Manager provide information about the encoding of the specific messages?
Dave Vogt
Dave: Yes it does, see my answer.
Leon Timmermans
+3  A: 

Different messages probably are in different encodings. Probably gedit detects it as UTF-8, but later finds out that parts of the file aren't UTF-8. Mixed files like this are major PITA.

The best (perhaps only) solution is to check for the content type ($message->contentType) and convert everything to UTF-8.

Leon Timmermans
How would you convert to utf-8 only if it was not utf-8?
Joshxtothe4
You should just use Encode::decode() to decode whatever encoding the message is using, and then output it to a filehandle that is opened as UTF-8.
Leon Timmermans