views:

89

answers:

2

I'm using Text::Capitalize to try and title case some UTF-8 encoded names from a web page (downloaded using WWW::Mechanize, but I'm not getting the results I'm expecting.

For example, the name on web page is "KAJELIJELI, Juvénal" but capitalize_title returns "Kajelijeli, JuvéNal" (notice the uppercase N).

I've tried use utf8; and changing \w's in $word_rule regex to [:word:] but neither changed the output from capitalize_title.

Does anyone know how I can make it work?

TIA

+3  A: 

You must have forgotten to set the binary mode for your input to utf8, because the module works fine.

Example:

#!perl
use warnings;
use strict;
use Text::Capitalize;
use utf8;
my $test = "KAJELIJELI, Juvénal";
binmode STDOUT, "utf8";
print capitalize_title ($test);

prints

Kajelijeli, Juvénal
Kinopiko
+3  A: 

Just to note: use utf8 merely tells Perl that you've used Unicode (wide) characters in your source. It doesn't do anything else. However, with any data you fetch from elsewhere, you have to be sure it's UTF-8 encoded, and that you tell any output destinations that they should expect UTF-8.

When something goes wrong with your UTF-8 strings, there are many places where it could have gone wrong, so start checking front-to-back to ensure it's UTF-8 throughout the whole process. That might mean figuring out how to translate Latin-1 that you might get from a web page into UTF-8. The Encode and Encode::FixLatin are useful. Juerd's Perl Unicode Advice is very helpful too.

My latest book, Effective Perl Programming, 2nd Edition, devote an entire chapter to these issues. It wasn't an especially fun chapter to write because of all these problems, but once you get all the pieces straight it makes a lot more sense. However, it coming out in March isn't going to help you today. :(

brian d foy