tags:

views:

206

answers:

3

I'm running Perl 5.10.0 and Postgres 8.4.3, and strings into a database, which is behind a DBIx::Class.

These strings should be in UTF-8, and therefore my database is running in UTF-8. Unfortunatly some of these strings are bad, containing malformed UTF-8, so when I run it I'm getting an exception

DBI Exception: DBD::Pg::st execute failed: ERROR: invalid byte sequence for encoding "UTF8": 0xb5

I thought that I could simply ignore the invalid ones, and worry about the malformed UTF-8 later, so using this code, it should flag and ignore the bad titles.

if(not utf8::valid($title)){
   $title="Invalid UTF-8";
}
$data->title($title);
$data->update();

However Perl seems to think that the strings are valid, but it still throws the exceptions.

How can I get Perl to detect the bad UTF-8?

+1  A: 

As the documentation for utf8::valid points out, it returns true if the string is marked as UTF-8 and it's valid UTF-8, or if the string isn't UTF-8 at all. Although it's impossible to tell without seeing the code in context and knowing what the data is, most likely what you want isn't the "valid utf8" check at all; probably you just need to do

$data->title( Encode::encode("UTF-8", $title) )
hobbs
+3  A: 

First, ensure that the strings actually are recognised as UTF-8 by Perl:

use Encode;
Encode::is_utf8($string);

If not, you'll need to either pass them through Encode::encode("UTF-8", $string) or specify UTF-8 encoding when opening filehandles, e.g. open my $fh, '<:encoding(utf8)', $filename (see perlwiki for more information). When using :encoding, Perl automatically ensures that the UTF-8 is valid.

For testing validity by hand, Test::utf8 contains a number of useful UTF-8 testing methods. Unfortunately they are designed for Perl tests, but you could rip out the code. is_valid_string and is_sane_utf8 are worth looking at.

rjh
A: 

How are you getting your strings? Are you sure that Perl thinks that they are UTF-8 already? If they aren't decoded yet (that is, octets interpreted as some encoding), you need to do that yourself:

    use Encode;

    my $ustring =
      eval { decode( 'utf8', $byte_string, FB_CROAK ) }
      or die "Could not decode string: $@";

Better yet, if you know that your source of strings is already UTF-8, you need to read that source as UTF-8. Look at the code you have that gets the strings to see if you are doing that properly.

brian d foy