Can you point me tool to convert japanese characters to unicode?
CPAN gives me "Unicode::Japanese". Hope this is helpful to start with. Also you can look at article on Character Encodings in Perl and perl doc for unicode for more information.
See http://p3rl.org/UNI.
use Encode qw(decode encode);
my $bytes_in_sjis_encoding = "\x88\xea\x93\xf1\x8e\x4f";
my $unicode_string = decode('Shift_JIS', $bytes_in_sjis_encoding); # returns 一二三
my $bytes_in_utf8_encoding = encode('UTF-8', $unicode_string); # returns "\xe4\xb8\x80\xe4\xba\x8c\xe4\xb8\x89"
For batch conversion from the command line, use piconv:
piconv -f Shift_JIS -t UTF-8 < infile > outfile
First, you need to find out the encoding of the source text if you don't know it already.
The most common encodings for Japanese are:
euc-jp
: (often used on Unixes and some web pages etc with greater Kanji coverage thanshift-jis
)shift-jis
(Microsoft also added some extensions to shift-jis which is calledcp932
, which is often used on non-Unicode Windows programs)iso-2022-jp
is a distant third
A common encoding conversion library for many languages is iconv
(see http://en.wikipedia.org/wiki/Iconv and http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which supports many other encodings as well as Japanese.
This question seems a bit vague to me, I'm not sure what you're asking. Usually you would use something like this:
open my $file, "<:encoding(cp-932)", "JapaneseFile.txt"
to open a file with Japanese characters. Then Perl will automatically convert it into its internal Unicode format.