views:

170

answers:

2

I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.

Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.

This is what I have done so far (forgive me for only including "summary" code examples):

  1. Made sure files are always read and written in UTF-8:

    use open ':utf8';
    
  2. Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):

    s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg;    # Kept from existing code
    s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg;     # Added
    utf8::decode $_;
    
  3. Made sure text is printed as UTF-8:

    binmode STDOUT, ':utf8';
    
  4. Made sure browsers interpret my content as UTF-8:

    Content-Type: text/html; charset=UTF-8
    <meta http-equiv="content-type" content="text/html;charset=UTF-8">
    
  5. Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):

    accept-charset="UTF-8"
    
  6. Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:

    use utf8;
    

Does this looks reasonable or am I missing something?

EDIT: I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.

+10  A: 
  • The :utf8 PerlIO layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the PerlIO::encoding layer, thus: :encoding(UTF-8).

  • For the same reason, always Encode::decode('UTF-8', …), not Encode::decode_utf8(…).

  • Make decoding fail hard with an exception, compare:

    perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)'
    perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'
    
  • You are not taking care of surrogate pairs in the %u notation. This is the only major bug I can see in your list. 2. is written correctly as:

    use Encode qw(decode);
    use URI::Escape::XS qw(decodeURIComponent);
    $_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
    
  • Do not mess around with the functions from the utf8 module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the Encode module.

  • Add the utf8 pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also CodeLayout::RequireUseUTF8.

  • Employ encoding::warnings to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with Unicode::Semantics. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)) or implicitly through a layer (use open pragma, binmode, 3 argument form of open).

  • For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just print, use the tools Devel::StringInfo and Devel::Peek instead.

daxim
Thanks for your input. I may be a little slow here, but it would be great to know which number in the checklist (if any) *every* bullet point in your answer refers to.
W3Coder
Also, could you (or anyone else) please expand a bit on the security concern - does Perl Unicode represent a potential security hazard (for web sites) and how?
W3Coder
All languages that work natively with bytes rather than Unicode strings (Perl, PHP, Ruby) have this problem: unless you put specific checks in to stop it, they will allow through UTF-8 byte sequences that are ‘over-long’: that is, they would decode to a character than should be expressed using a shorter sequence. If you then do HTML-encoding on the bytes, you will miss a `<` character that has been encoded as 0xC0 0xB3 instead of 0x3C.
bobince
These sequences are invalid, but some user agents may treat 0xC0 0xB3 as a `<`, which can result in cross-site scripting. Modern desktop browsers don't; it was fixed in IE6SP1 and Opera (I think ~8), but there may be other less-known browsers that still get this wrong. For this reason you should filter strings for invalid UTF-8 sequences. You can remove other unwanted control characters at the same time.
bobince
daxim
Due to performance concerns, I am not sure I will follow all recommendations in this answer. However, it is definitely good input to the challenge at hand, so I will accept it. Thanks very much!
W3Coder
+6  A: 

You're always missing something. The problem is usually the unknown unknowns, though. :)

Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing.

Some other things you'll need to do:

  • Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.

  • Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: �

  • Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).

  • Ensure that anything looking at @ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).

If you find anything else, please let us know by answering your own question with whatever we left out. ;)

brian d foy
Can I buy it online? ;)
W3Coder
I don't know if *you* can buy it online. It's at the major book sellers in lots of the English speaking world, but I don't know what's available to you. I do have a big stack of them that I can send almost (almost) anywhere in the world though.
brian d foy
Sorry for not being clear, would like to buy online and read from my computer (don't want a physical copy).
W3Coder
@W3Coder: *Effective Perl Programming* is available from Amazon in a Kindle form. If you can read a Kindle version, you can buy soft copy version that way. Kindle software is available for Mac, PC, iPad, Android but not *ix...
drewk
See the [www.effectiveperlprogramming.com](http://www.effectiveperlprogramming.com) website for all of your electronic options. You can get a PDF, eBook, Kindle, as well as read it in Safari Books Online.
brian d foy