views:

772

answers:

4

I would like to know if there is an easy way to detect if the text on the clipboard is in ISO 8859 or UTF-8 ?

Here is my current code:

    COleDataObject obj;

    if (obj.AttachClipboard())
    {
     if (obj.IsDataAvailable(CF_TEXT))
     {
      HGLOBAL hmem = obj.GetGlobalData(CF_TEXT);
      CMemFile sf((BYTE*) ::GlobalLock(hmem),(UINT) ::GlobalSize(hmem));
      CString buffer;

      LPSTR str = buffer.GetBufferSetLength((int)::GlobalSize(hmem));
      sf.Read(str,(UINT) ::GlobalSize(hmem));
      ::GlobalUnlock(hmem);

            //this is my string class
      s->SetEncoding(ENCODING_8BIT);
      s->SetString(buffer);
     }
    }
}
A: 

You could check to see obj.IsDataAvailable(CF_UNICODETEXT) to see if a unicode version of what's on the clipboard is available.

Adam Davis
I actually had code to detect this but I had weird problems with it sometimes I would get gibberish results. I think I know what the problem was now though. I assumed that the string pointed to a UTF-8 string but it think it can point to many formats and I need to call WideCharToMultiByte on it.
KPexEA
Yeah, a common problem with unicode is it has many representations, only one of which is UTF8.
Adam Davis
+1  A: 

UTF-8 has a defined structure for non-ASCII bytes. You can scan for bytes >= 128, and if any are detected, check if they form a valid UTF-8 string.

The valid UTF-8 byte formats can be found on Wikipedia:

Unicode             Byte1           Byte2           Byte3           Byte4
U+000000-U+00007F   0xxxxxxx
U+000080-U+0007FF   110xxxxx        10xxxxxx
U+000800-U+00FFFF   1110xxxx        10xxxxxx        10xxxxxx
U+010000-U+10FFFF   11110xxx        10xxxxxx        10xxxxxx        10xxxxxx


old answer:

You don't have to -- all ASCII text is valid UTF-8, so you can just decode it as UTF-8 and it will work as expected.

To test if it contains non-ASCII characters, you can scan for bytes >= 128.

John Millikin
I guess I didn't really mean 7 bit ascii, CF_TEXT returns chars from 0 - 255, so it is more like ISO/IEC 8859. I had a problem with this as french accents with values typically between 130 and 162 only take 1 byte in 8859 but they need 2 bytes to be encoded in UTF-8.
KPexEA
ASCII is by definition 7-bit. I'll edit your question to be a bit more clear.
John Millikin
If you find the two characters C0 A0. What do you have? It is valid 'code point' in UTF-8 but it is also two valid characters in ISO 8859. This is not a reliable method.
Martin York
@Martin: There is no reliable method for this, guessing is the best that can be achieved.
John Millikin
@John: Guessing is the best method if you ONLY have text. You can ask the clipboard object for what it thinks.
Martin York
+1  A: 

I can be mistaken, but I think you cannot: if I open an UTF-8 file without Bom in my editor, it is displayed by default as ISO-8859-1 (my locale), and beside some strange use of foreign (for me) accented chars, I have no strong visual hint that it is UTF-8 (unless it is encoded in another way elsewhere, eg. charset declaration in HTML or XML): it is perfectly valid Ansi text.

John wrote "all ASCII text is valid UTF-8" but the reverse is true.

Windows XP+ uses naturally UTF-16, and have a clipboard format for it, but AFAIK it just ignore UTF-8, with no special treatment for it.
(Well, there is an API to convert UTF-8 to UTF-16 (or Ansi, etc.), actually).

PhiLho
+3  A: 

Check out the definition of CF_LOCALE at this Microsoft page. It tells you the locale of the text in the clipboard. Better yet, if you use CF_UNICODETEXT instead, Windows will convert to UTF-16 for you.

Mark Ransom