tags:

views:

820

answers:

2

I've searched posts here on Stack Overflow, and read JoelOnSoftware's post on encoding, and now have a basic grasp of encoding issues. But I'm running into a problem with some character encoding coming from the Windows clipboard.

The reproducible test is to use IE and select and copy the "Advertising Programs" text from the Google homepage.

I'm using the following C# code to pull this text off the clipboard (error checking removed):

uint FormatId = GetRegisteredClipboardFormatId("HTML Format");
IntPtr hHtml = Win32.GetClipboardData(FormatId);
uint DataSize = Win32.GlobalSize(hHtml);
byte[] HtmlData = new byte[DataSize];
IntPtr pData = Win32.GlobalLock(hHtml);
Marshal.Copy(pData, HtmlData, 0, (int)DataSize);
Win32.GlobalUnlock(hHtml);

The clipboard HTML data is supposed to be UTF-8 encoded, so I use the following to convert the data to a string:

string Content = Encoding.UTF8.GetString(HtmlData);

However, ignoring the surrounding HTML tags, what this results in is:

"Advertising Programs"

Am I doing something wrong, misunderstanding something, or does the problem lie elsewhere?

Thanks for any help!

A: 

Check the HTML code. There is " " between "Advertising" and "Programs".

Try your code with "Business Solutions" text and it will work.

Most probably you will need to replace nbsp with a normal space.

Oleg
+4  A: 

You are displaying UTF-8 as Latin-1 or its variants (CP1252).

Google uses a nbsp in that sentence, which is C2 A0, which happens to be " " in Latin-1.

EDIT: The code you showed here is ok. I think the problem occurs when you display the content. Looks like you are output UTF-8 but the display media is expecting Latin-1.

If you are using console to display, try this,

 Console.OutputEncoding = Encoding.GetEncoding("iso-8859-1");

This will tell console to send out Latin-1, instead of UTF-8.

If you display the text in browsers, make sure the web page is marked with UTF-8, like,

   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
ZZ Coder
Well I guess that leads to two more questions:1. If the Encoding class knows it's taking in UTF-8 and outputting a Unicode(UTF-16?) string, shouldn't it know how to translate C2 A0 in UTF-8 to the correct representation of   in Unicode? I assume I'm misunderstanding the encoding issue on a basic level. Off to do more research...2. I'm eventually encoding the string back into UTF-8 to render in a browser. I'm only converting to a .NET string for convenience in parsing. Is there a better way to parse the text in its native UTF-8 encoding?
See my edits .....................
ZZ Coder
Excellent! That did the trick - thanks a bunch for the pointer!