Special Characters in non english languages

views:

answers:

Special Characters in non english languages

How can I generate a text file containing all character codes in a specific locale for example 1029 Czech. I basically want to generate a list of every character that exists in their alphabet?

+2 A:

I'd write a script in python + beautiful soup that grabs all of them from:

http://en.wikipedia.org/wiki/List%5Fof%5FUnicode%5Fcharacters

Santi 2009-09-25 07:36:11

+1 A:

If this is a Windows locale, just generate a file with all bytes from 32 to 255: The byte codes for the locales are the same; they just interpret each byte differently. See this page for links.

Example: The byte 0xa5 stands for "Ą" in the Czech codepage (1250) while it's "¥" in the German codepage (1252 a.k.a ISO-Latin-1).

[EDIT] Note that this only works for pre-Unicode locales where one byte maps to exactly one character. It doesn't work for any Asian locale which need two or more bytes per character.

Aaron Digulla 2009-09-25 07:49:37

This will probably not work as you expect in quite a few languages, most notably chinese, japanese, thai, korean etc.

Epcylon 2009-09-25 09:58:12

Correct, it won't work on Asian languages but he was wondering about something that isn't Unicode.

Aaron Digulla 2009-09-25 12:03:49

CLDR (Common Locale Data Repository, http://cldr.unicode.org/) contains that information.

But there is really no good reason to care, if you handle everything with Unicode. The European Union has now member countries that use characters across several of the "traditional" code pages: Western European, Eastern European, Turkish, Baltic, Greek, Cyrillic. Unicode is the only way.

Mihai Nita 2009-09-25 20:53:44

ansaurus

tags:

views:

answers:

Special Characters in non english languages

related questions