tags:

views:

438

answers:

4

How do I convert a file to its utf-8 format using Perl? and how do I check whether the converted file is in utf-8 format?

A: 

To do converting, take a look on Text::Iconv

  use Text::Iconv;
  $converter = Text::Iconv->new("fromcode", "tocode");
  $converted = $converter->convert("Text to convert");
S.Mark
thanks Mark... but I couldnt figure out how exactly to use these lines.. I have a file in ANSI FORMAT which has to saved in UTF8 FORMAT.. when I am using utf8 command, the resulting file is still in ANSI format(I checked it using notepad). any help?
xyz
@xyz, basically you need to read contents of your files and convert with Iconv or built-in Encode function as daxim mentioned and save it back. But if your files have only ascii characters <= 0x7F, you will see output files also same because ascii are valid utf8 too.
S.Mark
@xyz: ANSI is not a format, but a name used by Microsoft for a collection of encodings. You most likely have the encoding "windows-1252", so try converting from "windows-1252" to "UTF-8".
Christoffer Hammarström
How do you know it's not UTF-8? Does it use any characters outside of ISO-646?
bmargulies
when I am using Text::Iconv;I am getting error msg: 'Can't locate Text/Iconv.pm in @INC (@INC contains: C:/Perl/lib C:/Perl/site/lib.)'.. Do I have install anything else?BEGIN failed--compilation aborted at uni1.pl line 1.
xyz
@xyz, could you try with built-in Encode function as @daxim mentioned? I am not sure how to install that in windows.
S.Mark
A: 

Installing bindings to the iconv library such as Text::Iconv is not necessary because Perl already comes with a character encoding library on its own: Encode. Part of it is piconv, an iconv(1) workalike. Use it to batch convert files to UTF-8. ANSI is just a stupid name for the group of windows-125? encodings. You most likely have files encoded in windows-1252. Example:

piconv -f windows-1252 -t UTF-8 < input-file > output-file

If metadata are missing, heuristics have to be used to determine the encoding of a file content. I have been recommending Encode::Detect.

daxim
+1  A: 

Hey,

that depends on the string you got. if it's a file been uploaded - i think this code will help. but if it's a text from web / text that converted itself to utf-8 ( because you're working on utf-8 ) then you'll have a problem figuring it out.

i usually use:

use Encoding::Guess

my $enc = guess_encoding($string);

and then with the above code, i do:

use Text::Iconv;
$converter = Text::Iconv->new($enc,"utf-8");
$converted = $converter->convert("Text to convert");

FYI utf-8 list can be found here:

http://www.fileformat.info/info/charset/UTF-8/list.htm?start=1024

http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024&amp;number=1024&amp;utf8=string-literal&amp;unicodeinhtml=dec

Ricky
A: 

using Encode module you can easily encode in different encoding

e.g;

my $str = "A string in Perl internal format ....";
my $octets = encode("utf-8",$str,Encode::FB_CROAK);

to check for utf you can use function

is_utf8($str,Encode::FB_CROAK) 
ppant