ansaurus

Question

How to tell if text on the windows clipboard is ISO 8859 or UTF-8 in C++ ?

Answer 1

A:

You could check to see obj.IsDataAvailable(CF_UNICODETEXT) to see if a unicode version of what's on the clipboard is available.

Adam Davis 2008-10-03 03:20:51

I actually had code to detect this but I had weird problems with it sometimes I would get gibberish results. I think I know what the problem was now though. I assumed that the string pointed to a UTF-8 string but it think it can point to many formats and I need to call WideCharToMultiByte on it.

KPexEA 2008-10-03 03:35:41

Yeah, a common problem with unicode is it has many representations, only one of which is UTF8.

Adam Davis 2008-10-03 04:04:29

Answer 2

+1 A:

UTF-8 has a defined structure for non-ASCII bytes. You can scan for bytes >= 128, and if any are detected, check if they form a valid UTF-8 string.

The valid UTF-8 byte formats can be found on Wikipedia:

Unicode             Byte1           Byte2           Byte3           Byte4
U+000000-U+00007F   0xxxxxxx
U+000080-U+0007FF   110xxxxx        10xxxxxx
U+000800-U+00FFFF   1110xxxx        10xxxxxx        10xxxxxx
U+010000-U+10FFFF   11110xxx        10xxxxxx        10xxxxxx        10xxxxxx

old answer:

You don't have to -- all ASCII text is valid UTF-8, so you can just decode it as UTF-8 and it will work as expected.

To test if it contains non-ASCII characters, you can scan for bytes >= 128.

John Millikin 2008-10-03 03:21:50

I guess I didn't really mean 7 bit ascii, CF_TEXT returns chars from 0 - 255, so it is more like ISO/IEC 8859. I had a problem with this as french accents with values typically between 130 and 162 only take 1 byte in 8859 but they need 2 bytes to be encoded in UTF-8.

KPexEA 2008-10-03 03:29:56

ASCII is by definition 7-bit. I'll edit your question to be a bit more clear.

John Millikin 2008-10-03 03:35:10

If you find the two characters C0 A0. What do you have? It is valid 'code point' in UTF-8 but it is also two valid characters in ISO 8859. This is not a reliable method.

Martin York 2008-10-03 04:07:00

@Martin: There is no reliable method for this, guessing is the best that can be achieved.

John Millikin 2008-10-03 04:49:17

@John: Guessing is the best method if you ONLY have text. You can ask the clipboard object for what it thinks.

Martin York 2008-10-04 18:26:43

Answer 3

+1 A:

I can be mistaken, but I think you cannot: if I open an UTF-8 file without Bom in my editor, it is displayed by default as ISO-8859-1 (my locale), and beside some strange use of foreign (for me) accented chars, I have no strong visual hint that it is UTF-8 (unless it is encoded in another way elsewhere, eg. charset declaration in HTML or XML): it is perfectly valid Ansi text.

John wrote "all ASCII text is valid UTF-8" but the reverse is true.

Windows XP+ uses naturally UTF-16, and have a clipboard format for it, but AFAIK it just ignore UTF-8, with no special treatment for it.
(Well, there is an API to convert UTF-8 to UTF-16 (or Ansi, etc.), actually).

PhiLho 2008-10-03 05:31:36

Answer 4

+3 A:

Check out the definition of CF_LOCALE at this Microsoft page. It tells you the locale of the text in the clipboard. Better yet, if you use CF_UNICODETEXT instead, Windows will convert to UTF-16 for you.

Mark Ransom 2008-10-03 14:05:59

ansaurus

tags:

views:

answers:

How to tell if text on the windows clipboard is ISO 8859 or UTF-8 in C++ ?

related questions