views:

124

answers:

1

I previously only had vague awareness of character encoding issues, but answers to a question today got me thinking about it. The following provided more food for thought too:

perlunitut - Perl Unicode Tutorial

perlunifaq - Perl Unicode FAQ

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The only place that I've seen mention of stating the character encoding (e.g. use utf8; for most of us) of our source code as a "best practice" was in the answers to the previously mentioned question.

In addition, perlunitut mentions that we should use Encode qw{encode decode}; in our "standard heading" in Perl programs. Thus it seems that another "best practice" should be to decode all input and to encode all output.

What do you think?

+14  A: 

use utf8 actually has fairly little to do with it -- almost no one uses unicode identifiers, and a program can easily be encoding-aware without ever including UTF-8 string literals in the code.

But yes, the best wisdom that I know of for dealing with encodings is this:

  • Always know where your data is coming from and how it's formatted, and decode it as soon as possible (unless it's meant to be processed as bytes).
  • Always understand the data format you're writing to or what your client is expecting, and encode on output (unless your data is already bytes).
  • And when it comes to text, always work with character strings in the "interior" of your program.

The very existence of a million different character sets and a million different encodings should be a detail of the interface as much as possible. There are some things you'll still have to keep in mind -- for example different collations for different languages -- but it's an ideal to strive for anyway, and following it as far as possible should greatly reduce the number of "encoding issues" in your code.

To answer your question more directly, yes -- if you're reading textual data from outside without decoding, or sending data anywhere without encoding, there's a very good chance that you're making a mistake, and that your code will break when someone else uses it in a locale different from yours.

hobbs
Thanks for your answer. I'm wondering what you meant by '"interior" of your program'.
molecules
I mean the core logic of the program -- everything that actually does whatever your program or library does, as opposed to the parts that talk to the outside world.
hobbs
Thanks so much!
molecules