ansaurus

Question

How do I read Unicode-16 strings from a file using POSIX methods in Linux?

Answer 1

+1 A:

I would strongly recommend using a Unicode encoding as your program's internal representation. Use either UTF-16 or UTF-8. If you use UTF-16 internally, then obviously no translation is required. If you use UTF-8, you can use a locale with .UTF-8 in it such as en_US.UTF-8.

Adam Rosenfield 2009-02-05 17:20:57

I didn't have much choice at first since my boss was the one who wrote the broken code. I've since helped him to see things differently and now we'll be using UTF-8 for all stored data.

Harvey 2009-08-10 14:58:35

Answer 2

+2 A:

(Does Windows always use UTF-16? e.g. in Japanese versions)

Yes, NT's WCHAR is always UTF-16LE.

(The ‘system codepage’, which for Japanese installs is indeed cp932/Shift-JIS, still exists in NT for the benefit of the many, many applications that aren't Unicode-native, FAT32 paths, and so on.)

However, wchar_t is not guaranteed to be 16 bits and on Linux it won't be, UTF-32 (UCS-4) is used. So wcstombs_l is unlikely to be happy.

The Right Thing would be to use a library like iconv to read it in to whichever format you are using internally - presumably wchar_t. You could try to hack it yourself by poking bytes in, but you'd probably get things like the Surrogates wrong.

Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.

Indeed, Linux can't use UTF-16 as a locale default encoding thanks to all the \0s.

bobince 2009-02-05 18:43:50

WCHAR in windows seems to have a fixed size (you could do sizeof() on it). Doesn't that mean it only implements a subset of UTF-16, which is variable size?

PolyThinker 2009-02-09 09:21:52

It stores 16-bit values corresponding to UTF-16 code points; if you want characters outside the BMP you have to use the surrogates manually, Windows won't help you. eg. ''.length==2. This is the same situation as eg. Java, or Python in narrow-Unicode mode.

bobince 2009-02-09 13:14:34

After lots of experiments and using the knowledge of this answer, I used libiconv. I'm adding the simple function I used here for others to use. It's not perfect and I encourage others to fix problems.

Harvey 2009-08-10 15:00:31

Answer 3

+3 A:

Simplest way is convert the file from utf16 to utf8 native UNIX encoding and then read it,

iconv -f utf16 -t utf8 file_in.txt -o file_out.txt

You can also use iconv(3) (see man 3 iconv) to convert string using C. Most of other languages has bindings to iconv as well.

Than you can use any UTF-8 locale like en_US.UTF-8 that are usualy the default one on most linux distros.

2009-02-05 19:41:26

I did not know about this tool. This doesn't answer my question b/c I need to read/write the files programmatically, but knowing about this tool makes for easier test case generation. Thanks.

Harvey 2009-08-10 14:57:18

Answer 4

+2 A:

You can read as binary, then do your own quick conversion: http://unicode.org/faq/utf_bom.html#utf16-3 But it is probably safer to use a library (like libiconv) which handles invalid sequences properly.

Mihai Nita 2009-02-09 09:15:08

Thanks for the hint. My boss was using those functions you pointed to, but we switched to libiconv since it makes handling different to/from encoding sets easy.

Harvey 2009-08-10 15:11:05

ansaurus

tags:

views:

answers:

How do I read Unicode-16 strings from a file using POSIX methods in Linux?

related questions