ansaurus

Question

Checklist for going the Unicode way with Perl

Answer 1

+10 A:

The :utf8 PerlIO layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the PerlIO::encoding layer, thus: :encoding(UTF-8).
For the same reason, always Encode::decode('UTF-8', …), not Encode::decode_utf8(…).

Make decoding fail hard with an exception, compare:

perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)'
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'

You are not taking care of surrogate pairs in the %u notation. This is the only major bug I can see in your list. 2. is written correctly as:
```
use Encode qw(decode);
use URI::Escape::XS qw(decodeURIComponent);
$_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
```
Do not mess around with the functions from the utf8 module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the Encode module.
Add the utf8 pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also CodeLayout::RequireUseUTF8.
Employ encoding::warnings to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with Unicode::Semantics. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)) or implicitly through a layer (use open pragma, binmode, 3 argument form of open).
For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just print, use the tools Devel::StringInfo and Devel::Peek instead.

daxim 2010-09-17 15:38:28

Thanks for your input. I may be a little slow here, but it would be great to know which number in the checklist (if any) *every* bullet point in your answer refers to.

W3Coder 2010-09-17 18:01:52

Also, could you (or anyone else) please expand a bit on the security concern - does Perl Unicode represent a potential security hazard (for web sites) and how?

W3Coder 2010-09-17 18:19:53

All languages that work natively with bytes rather than Unicode strings (Perl, PHP, Ruby) have this problem: unless you put specific checks in to stop it, they will allow through UTF-8 byte sequences that are ‘over-long’: that is, they would decode to a character than should be expressed using a shorter sequence. If you then do HTML-encoding on the bytes, you will miss a `<` character that has been encoded as 0xC0 0xB3 instead of 0x3C.

bobince 2010-09-17 19:07:30

These sequences are invalid, but some user agents may treat 0xC0 0xB3 as a `<`, which can result in cross-site scripting. Modern desktop browsers don't; it was fixed in IE6SP1 and Opera (I think ~8), but there may be other less-known browsers that still get this wrong. For this reason you should filter strings for invalid UTF-8 sequences. You can remove other unwanted control characters at the same time.

bobince 2010-09-17 19:10:02

daxim 2010-09-17 20:09:53

Due to performance concerns, I am not sure I will follow all recommendations in this answer. However, it is definitely good input to the challenge at hand, so I will accept it. Thanks very much!

W3Coder 2010-09-19 09:55:05

Answer 2

+6 A:

You're always missing something. The problem is usually the unknown unknowns, though. :)

Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing.

Some other things you'll need to do:

Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.
Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: �
Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).
Ensure that anything looking at @ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).

If you find anything else, please let us know by answering your own question with whatever we left out. ;)

brian d foy 2010-09-17 15:55:08

Can I buy it online? ;)

W3Coder 2010-09-17 18:04:47

I don't know if *you* can buy it online. It's at the major book sellers in lots of the English speaking world, but I don't know what's available to you. I do have a big stack of them that I can send almost (almost) anywhere in the world though.

brian d foy 2010-09-17 18:23:01

Sorry for not being clear, would like to buy online and read from my computer (don't want a physical copy).

W3Coder 2010-09-17 18:43:47

@W3Coder: *Effective Perl Programming* is available from Amazon in a Kindle form. If you can read a Kindle version, you can buy soft copy version that way. Kindle software is available for Mac, PC, iPad, Android but not *ix...

drewk 2010-09-17 22:26:18

See the [www.effectiveperlprogramming.com](http://www.effectiveperlprogramming.com) website for all of your electronic options. You can get a PDF, eBook, Kindle, as well as read it in Safari Books Online.

brian d foy 2010-09-18 07:10:31

ansaurus

tags:

views:

answers:

Checklist for going the Unicode way with Perl

related questions