views:

1202

answers:

4

I have a file containing UNICODE-16 strings that I would like to read into a Linux program. The strings were written raw from Windows' internal WCHAR format. (Does Windows always use UTF-16? e.g. in Japanese versions)

I believe that I can read them using raw reads and the converting with wcstombs_l. However, I cannot figure what locale to use. Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.

Is there a better way?

Update: the correct answer and others below helped point me to using libiconv. Here's a function I'm using to do the conversion. I currently have it inside a class that makes the conversions into a one-line piece of code.

// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
  iconv_t cd;
  const char from[] = "UTF-16LE";
  const char to[] = "UTF-8";

  cd = iconv_open(to, from);
  if (cd == (iconv_t)-1)
  {
    printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
           to, from, strerror(errno));
    return(-1);
  }

  // How much space do we need?
  // Guess that we need the same amount of space as used by src.
  // TODO: There should be a while loop around this whole process
  //       that detects insufficient memory space and reallocates
  //       more space.
  int len = sizeof(wchar_t) * (wcslen(src) + 1);

  //printf("len = %d\n", len);

  // Allocate space
  int destLen = len * sizeof(char);
  *dest = (char *)malloc(destLen);
  if (*dest == NULL)
  {
    iconv_close(cd);
    return -1;
  }

  // Convert

  size_t inBufBytesLeft = len;
  char *inBuf = (char *)src;
  size_t outBufBytesLeft = destLen;
  char *outBuf = (char *)*dest;

  int rc = iconv(cd,
                 &inBuf,
                 &inBufBytesLeft,
                 &outBuf,
                 &outBufBytesLeft);
  if (rc == -1)
  {
    printf("iconv() failed: %s\n", strerror(errno));
    iconv_close(cd);
    free(*dest);
    *dest = NULL;
    return -1;
  }

  iconv_close(cd);

  return 0;
} // iwcstombs_alloc()
+1  A: 

I would strongly recommend using a Unicode encoding as your program's internal representation. Use either UTF-16 or UTF-8. If you use UTF-16 internally, then obviously no translation is required. If you use UTF-8, you can use a locale with .UTF-8 in it such as en_US.UTF-8.

Adam Rosenfield
I didn't have much choice at first since my boss was the one who wrote the broken code. I've since helped him to see things differently and now we'll be using UTF-8 for all stored data.
Harvey
+2  A: 

(Does Windows always use UTF-16? e.g. in Japanese versions)

Yes, NT's WCHAR is always UTF-16LE.

(The ‘system codepage’, which for Japanese installs is indeed cp932/Shift-JIS, still exists in NT for the benefit of the many, many applications that aren't Unicode-native, FAT32 paths, and so on.)

However, wchar_t is not guaranteed to be 16 bits and on Linux it won't be, UTF-32 (UCS-4) is used. So wcstombs_l is unlikely to be happy.

The Right Thing would be to use a library like iconv to read it in to whichever format you are using internally - presumably wchar_t. You could try to hack it yourself by poking bytes in, but you'd probably get things like the Surrogates wrong.

Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.

Indeed, Linux can't use UTF-16 as a locale default encoding thanks to all the \0s.

bobince
WCHAR in windows seems to have a fixed size (you could do sizeof() on it). Doesn't that mean it only implements a subset of UTF-16, which is variable size?
PolyThinker
It stores 16-bit values corresponding to UTF-16 code points; if you want characters outside the BMP you have to use the surrogates manually, Windows won't help you. eg. ''.length==2. This is the same situation as eg. Java, or Python in narrow-Unicode mode.
bobince
After lots of experiments and using the knowledge of this answer, I used libiconv. I'm adding the simple function I used here for others to use. It's not perfect and I encourage others to fix problems.
Harvey
+3  A: 

Simplest way is convert the file from utf16 to utf8 native UNIX encoding and then read it,

iconv -f utf16 -t utf8 file_in.txt -o file_out.txt

You can also use iconv(3) (see man 3 iconv) to convert string using C. Most of other languages has bindings to iconv as well.

Than you can use any UTF-8 locale like en_US.UTF-8 that are usualy the default one on most linux distros.

I did not know about this tool. This doesn't answer my question b/c I need to read/write the files programmatically, but knowing about this tool makes for easier test case generation. Thanks.
Harvey
+2  A: 

You can read as binary, then do your own quick conversion: http://unicode.org/faq/utf_bom.html#utf16-3 But it is probably safer to use a library (like libiconv) which handles invalid sequences properly.

Mihai Nita
Thanks for the hint. My boss was using those functions you pointed to, but we switched to libiconv since it makes handling different to/from encoding sets easy.
Harvey