ansaurus

Question

How to store unicode data in a format that doesn't support utf-8

Answer 1

+1 A:

What language is your text in? If the characters are mostly ASCII, it's probably best to write the original UTF-8 encoded text as such. A non-UTF-8-aware program will still read ASCII text correctly and display garbled ASCII for unknown characters.

casablanca 2010-07-03 03:37:56

I don't know what langauge it will be in, in advance. I've updated the question to reflect this.

fmark 2010-07-03 07:54:35

Answer 2

+1 A:

What is the best practice in a situation where you must convert a Unicode string to a string in an unknown but specific codepage?

Depends on the file format. If it supports Unicode "escape sequences" like XML's € or JSON's \u20AC, then use those, and you won't lose any information. If not, a different approach is required.

I would assume, therefore, that I must somehow estimate the "best" codepage to use,

Generally, on a non-Unicode system, you'd convert characters into whatever the default encoding is, not an arbitrary code page.

Edit: So you do get a choice of code pages:

01h     DOS USA                      code page 437
6Ah     Greek MS-DOS (437G)          code page 737
02h     DOS Multilingual             code page 850
64h     EE MS-DOS                    code page 852
6Bh     Turkish MS-DOS               code page 857
67h     Icelandic MS-DOS             code page 861
65h     Nordic MS-DOS                code page 865
66h     Russian MS-DOS               code page 866
C8h     Windows EE                   code page 1250
C9h     Russian Windows              code page 1251
03h     Windows ANSI                 code page 1252
CBh     Greek Windows                code page 1253
CAh     Turkish Windows              code page 1254
04h     Standard Macintosh           code page 10000
98h     Greek Macintosh              code page 10006
96h     Russian Macintosh            code page 10007
68h     Kamenicky (Czech) MS-DOS
69h     Mazovia (Polish) MS-DOS
97h     Eastern European Macintosh

To choose a code page, I would recommend:

Check if your data is plain ASCII. If so, it doesn't matter which code page you choose.
If not, try to find a code page that can exactly represent your data (or if you can't, one that minimizes the unrepresentable characters). Try code page 1252 first, then the other 125x code pages. Don't bother with the DOS code pages unless you have box-drawing characters.

and then convert all non-supported characters into their closest approximation in that codepage (or the dreaded ?). Is this the usual approach?

It's the approach we take at work when we need to convert a UTF-8 file into windows-1252 or into EBCDIC. I used Unidecode to help generate the "closest approximations".

We do, however, only replace letters and digits, not punctuation. Replacing “” with "" would break a few file formats.

dan04 2010-07-03 22:17:54

ansaurus

tags:

views:

answers:

How to store unicode data in a format that doesn't support utf-8

related questions