views:

274

answers:

1

I've run into a really strange UTF-8 problem with Net::Cassandra::Easy (which is built upon Net::Cassandra): UTF-8 strings written to Cassandra are garbled upon retrieval.

The following code shows the problem:

use strict;
use utf8;
use warnings;
use Net::Cassandra::Easy;

binmode(STDOUT, ":utf8");

my $key = "some_key";
my $column = "some_column";
my $set_value = "\x{2603}"; # U+2603 is ☃ (SNOWMAN)
my $cassandra = Net::Cassandra::Easy->new(keyspace => "Keyspace1", server => "localhost");
$cassandra->connect();
$cassandra->mutate([$key], family => "Standard1", insertions => { $column => $set_value });
my $result = $cassandra->get([$key], family => "Standard1", standard => 1);
my $get_value = $result->{$key}->{"Standard1"}->{$column};
if ($set_value eq $get_value) {
    # this is the path I want.
    print "OK: $set_value == $get_value\n";
} else {
    # this is the path I get.
    print "ERR: $set_value != $get_value\n";
}

When running the code above $set_value eq $get_value evaluates to false. What am I doing wrong?

+3  A: 

Add use Encode; to the beginning of your script, and pass variables through Encode::decode_utf8. For example:

my $get_value = $result->{$key}->{"Standard1"}->{$column};
$get_value = Encode::decode_utf8($get_value);

Outputs:

OK: ☃ == ☃

When you set $set_value to "\x{2603}", Perl detects the wide character and sets the string encoding to UTF-8 for you. To confirm this, print the return value of Encode::is_utf8($set_value).

Unfortunately, once this string goes into Cassandra and back out again, the encoding information is lost. It appears that Cassandra is encoding-agnostic. Calling Encode::decode_utf8 tells Perl that you have a string containing a UTF-8 byte sequence, and that it should be converted into Perl's internal representation for Unicode. As jrockway points out, you should also call Encode::encode_utf8 on any strings before they are sent to Cassandra, although in most cases Perl already knows they are UTF-8, for example if you've opened a file with the :utf8 encoding layer.

If you use UTF-8 often, you might want to write a wrapper over Net::Cassandra::Easy to do this automatically.

Finally, you don't need use utf8; unless your Perl source code (variable names, comments etc.) contains UTF-8 characters. Perl can handle UTF-8 strings whether you specify use utf8; or not.

rjh
Thanks for your answer, but I'm afraid that does not solve the problem since \u{2603} is "☃" and not "â". The output I'm expecting is hence "OK: ☃ == ☃" and not "OK: â == â".
knorv
Oops, using PuTTY and forgot to set UTF-8 character set. I'll get back to you.
rjh
With UTF-8 the code above shows "OK: ☃ == ☃". Answer updated.
rjh
Ah, excellent! That did the trick! Thanks a lot for your answer!
knorv