ansaurus

Question

How can I guess the encoding of a string in Perl?

Answer 1

A:

You might also want to look at the Perl built-in utf8::downgrade (documented in the utf8 module) or the Encode module.

Paul Tomblin 2009-12-28 17:58:51

Downvote because of bad advice; as per documentation the utf8 module's innards are hands-off. http://p3rl.org/UNI »Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.« Only the `Encode` module must be used because it handles arbitrary encodings.

daxim 2009-12-29 12:53:43

Answer 2

+1 A:

"Unicode-processing issues in Perl and how to cope with it" gives a pretty good treatment of how to handle Unicode in Perl:

Mark Carey 2009-12-28 18:01:11

Answer does not deal with the first part of the question.

daxim 2009-12-29 12:45:47

Answer 3

+2 A:

The Encode module has a way that you can try to do this. You decode the raw octets with what you think the encoding is. If the octets don't represent a valid encoding, it blows up and you catch it with an eval. Otherwise, you get back a properly encoded string. For example:

 use Encode;

 my $a_with_ring =
   eval { decode( 'utf8', "\x6b\xc5", Encode::FB_CROAK ) }
     or die "Could not decode string: $@";

This has the drawback that the same octet sequence can be valid in multiple encodings

I have more to say about this in the upcoming Effective Perl Programming, 2nd Edition, which has an entire chapter on dealing with Unicode. I think my publisher would get mad if I posted the whole thing though. :)

You might also want to see Juerd's Unicode Advice, as well as some of the Unicode docs that come with Perl.

brian d foy 2009-12-29 08:34:45

Answer 4

+3 A:

To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that.

use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
              "\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
              "\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
              "\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030

use Encode;
my $string = decode($encoding_name, $unknown);

I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.

# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.

If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.

use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing  Perl workshop.

However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.

use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ);  # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to &#x5317;&#x4eac; Perl workshop.

daxim 2009-12-29 12:43:52

The input I receive always uses the Latin character-set. The normalize function I use would then convert "Café" to "Cafe". This does not work in all cases though. Given this, would you still prefer to use the PERLQQ or XMLCREF method?

Maulin 2009-12-29 14:52:11

It does not matter what I prefer – it's your code and responsibility after all, and only you know all the circumstances. If indeed you are happy with Café → Cafe, then replace your custom function with `Text::Unidecode`. That does work in all cases.

daxim 2009-12-29 18:07:49

Thanks. I think I will try that.

Maulin 2009-12-29 18:34:59

Answer 5

A:

Dear Friend,

You can use the following code also, to encrypt and decrypt the code

sub ENCRYPT_DECRYPT() {
    my $Str_Message=$_[0];
    my  $Len_Str_Message=length($Str_Message);

    my  $Str_Encrypted_Message="";
    for (my $Position = 0;$Position<$Len_Str_Message;$Position++){
        my  $Key_To_Use = (($Len_Str_Message+$Position)+1);
            $Key_To_Use =(255+$Key_To_Use) % 255;
        my  $Byte_To_Be_Encrypted = substr($Str_Message, $Position, 1);
        my  $Ascii_Num_Byte_To_Encrypt = ord($Byte_To_Be_Encrypted);
        my  $Xored_Byte = $Ascii_Num_Byte_To_Encrypt ^ $Key_To_Use;
            my  $Encrypted_Byte = chr($Xored_Byte);
        $Str_Encrypted_Message .= $Encrypted_Byte;

    }
    return $Str_Encrypted_Message;
}

 my $var=&ENCRYPT_DECRYPT("hai");
 print &ENCRYPT_DECRYPT($var);

muruga 2010-03-02 10:59:53

ansaurus

tags:

views:

answers:

How can I guess the encoding of a string in Perl?

related questions