ansaurus

Question

How can I convert an input file to UTF-8 encoding in Perl?

Answer 1

+4 A:

I think I misunderstood your question. I think what you want to do is read a file in a non-UTF-8 encoding, then play with the data as UTF-8 in your program. That's something much easier. After you read the data with the right encoding, Perl represents it internally as UTF-8. So, just do what you have to do.

When you write it back out, use whatever encoding you want to save it as. However, you don't have to put it back in a file to use it.

old answer

The Perl I/O layers only read the data assuming it's already properly encoded. It's not going to convert encoding for you. By telling open to use utf8, you're telling it that it already is utf8.

You have to use the Encode module just as you've shown (unless you want to write your own I/O layer). You can convert bytes to UTF-8, or if you know the encoding, you can convert from one encoding to another. Since it looks like you already know the encoding, you might want the from_to() function.

If you're just starting out with Perl and Unicode, go through Juerd's Perl Unicode Advice before you do anything.

brian d foy 2009-10-23 09:28:46

@brian, thanks for the guidance. I thought there should be some simple way to directly convert the input file to UTF-8 encode while opening it. But now it looks like things are not that simple. Im thinking I can open the input file first and then encode the content to UTF-8 and then output to another file in UTF-8 encode and then open that another file. The code looks like: open my $filter,'<:encoding(gb2312)','c:/outfile.txt';open my $filter_new, '+>:utf8', 'c:/f2.txt';print $filter_new $_ while <$filter>; while (<$filter_new>){...}But this is too much work. while(<$fh_out>){

Mike 2009-10-23 10:22:13

Your idea of too much work is skewed. Try doing it by hand and then come back and tell us how easy Perl makes it for you. Kids today don't know how good they have it. :)

brian d foy 2009-10-23 10:41:01

Mike's instincts are correct; you can stack layers to directly do the conversion he wants :)

ysth 2009-10-23 11:17:54

You can't stack layers, really. You still have to read it, and you still have to write it, if you want to file to end up in a different encoding.

brian d foy 2009-10-23 14:31:06

I'm pretty sure (it's a little clearer in the original part of the question, I think) that all he wants is to convert the data from the file, not the file itself. But yes, to do the latter, just reading isn't sufficient

ysth 2009-10-23 16:12:49

@ysth, I guess I must have phrased my question wrong. Actually what I wanted was to convert the input file to UTF-8 and then do a readline operation. I already knew how to convert the data of the input file while doing a readline operation using the while loop. But thanks.

Mike 2009-10-24 01:59:55

@brian, well, yes, one way of looking at my question is: "are there some *better* ways to read a file in a non-UTF-8 encoding and then play with the data as UTF-8?" By "better ways", I mean not the line-by-line conversion method which I already learnt.

Mike 2009-10-24 13:20:22

Answer 2

+4 A:

The :encoding layer will return UTF-8, suitable for perl's use. That is, perl will recognize each character as a character, even if they are multiple bytes. Depending on what you are going to do next with the data, this may be adequate.

But if you are doing something with the data where perl will try to downgrade it from utf8, you either need to tell perl not to (for instance, doing a binmode(STDOUT, ":utf8") to tell perl that output to stdout should be utf8), or you need to have perl treat your utf8 as binary data (interpreting each byte separately, and knowing nothing about the utf8 characters.)

To do that, all you need is to apply an additional layer to your open:

open my $foo, "<:encoding(gb2312):bytes", ...;

Note that the output of the following will be the same:

perl -we'open my $foo, "<:encoding(gb2312):bytes", "foo"; $bar = <$foo>; print $bar'
perl -CO -we'open my $foo, "<:encoding(gb2312)", "foo"; $bar = <$foo>; print $bar'

but in one case, perl knows that data read is utf8 (and so length($bar) will report the number of utf8 characters) and has to be explicitly told (by -CO) that STDOUT will accept utf8, and in the other, perl makes no assumptions about the data (and so length($bar) will report the number of bytes), and just prints it out as is.

ysth 2009-10-23 11:16:07

ansaurus

tags:

views:

answers:

How can I convert an input file to UTF-8 encoding in Perl?

related questions