views:

213

answers:

2

I'm trying to get a feel for how to manipulate characters and character sets in UNIX accurately given the existance of differing locales - and doing so without requiring special tools outside of UNIX standard items.

My research has shown me the problem of the German sharp-s character: one character changes into two - and other problems. Using tr is apparently a very bad idea. The only alternative I see is this:

echo StUfF | perl -n -e "print lc($_);"

but I'm not certain that will work, and it requires Perl - not a bad requirement necessarily, but a very big hammer...

What about awk and grep and sed and ...? That, more or less, is my question: how can I be sure that text will be lower-cased in every locale?

+2  A: 

Perl lc/uc works fine for most languages but it won't work with Turkish correctly, see this bug report of mine for details. But if you don't need to worry about Turkish, Perl is good to go.

cartman
Well, Turkish "i" is a common source of i18n/L10n related problems.
Paweł Dyda
+1  A: 

You can't be sure that text will be correct in every locale. That's not possible, there are always some errors in software libraries regarding implementation of i18n related staff.

If you're not afraid of using C++ or Java, you may take a look at ICU which implement broad set of collation, normalization, etc. rules.

Paweł Dyda