views:

74

answers:

3

Welcome to unsafe land.

I'm doing P/Invoke to a legacy lib that gives me a 0-terminated C-style string in the form of an unknown-length unmanaged byte buffer that can be either ASCII or UTF-16, but without giving any indication whatsoever thereof - other than the byte stream itself that is...

Right now I have a bad scheme, based on checking for single and double 0-bytes, to decide if I should create a managed String from Char* or SByte*. The scheme obviously breaks down for every Unicode code-point higher than U+00FF.

This is what I have:

  • The address of the unmanaged byte buffer.
  • The unmanaged byte buffer is of unknown length.
  • The unmanaged byte buffer is either a 0-terminated ASCII C-style string or a 0-terminated UTF-16 C-style string.

This is what I want:

  • Create a correct managed String from the unmanaged byte buffer, whether it's ASCII or UTF-16.

Is that problem generically solvable?

A: 

One way of adding a level of heuristics to the naïve encoding detection scheme that is based on checking for single and double 0-bytes:

  1. Assume that a marshalled "context" from the legacy lib consists of one or more strings.
  2. If one string in such a context is likely to be UTF-16, then all other strings in that context are also UTF-16.
  3. So, as soon as a UTF-16 string is found with "high enough" certainty, bias all other detections to be "probably UTF-16".
  4. If a "probably not UTF-16" string is found to be a "definitely not UTF-8" string, then it cannot be ASCII either, so set it as UTF-16.

That'll give a much higher rate of accurately created managed Strings.

Johann Gerell
+3  A: 

I don't think this can be solved 100%. If the buffer contains 6c 34 00 00 ("l4"), is that the Chinese sign for water, or just an ASCII lower L and 4? But it should be possible to guess right "most of the time" depending on the specific strings.

Is the UTF-16 little endian or (probably) big endian?

The largest risk is buffer overrun. For instance, if the buffer starts with a 00, is that a zero-length ASCII string or should we try ready more of the buffer interpreting it as UTF-16BE?

Michel de Ruiter
+2  A: 

Is that problem generically solvable?

No.

If you know the length of the string (and that it's even), you could identify UTF-16 by the presence of 00 bytes padding ISO-8859-1 characters. (Even a non-Latin alphabet language would still make heavy use of ASCII space and newline.)

But if you depend on null termination, that won't help you. If you look for 00 00, you can indirectly match a 00 byte that just happens to be right after the null-terminator. Worse, if in ASCII string isn't double null terminated, you'll run right past the end of the string.

dan04