I've searched posts here on Stack Overflow, and read JoelOnSoftware's post on encoding, and now have a basic grasp of encoding issues. But I'm running into a problem with some character encoding coming from the Windows clipboard.
The reproducible test is to use IE and select and copy the "Advertising Programs" text from the Google homepage.
I'm using the following C# code to pull this text off the clipboard (error checking removed):
uint FormatId = GetRegisteredClipboardFormatId("HTML Format");
IntPtr hHtml = Win32.GetClipboardData(FormatId);
uint DataSize = Win32.GlobalSize(hHtml);
byte[] HtmlData = new byte[DataSize];
IntPtr pData = Win32.GlobalLock(hHtml);
Marshal.Copy(pData, HtmlData, 0, (int)DataSize);
Win32.GlobalUnlock(hHtml);
The clipboard HTML data is supposed to be UTF-8 encoded, so I use the following to convert the data to a string:
string Content = Encoding.UTF8.GetString(HtmlData);
However, ignoring the surrounding HTML tags, what this results in is:
"Advertising Programs"
Am I doing something wrong, misunderstanding something, or does the problem lie elsewhere?
Thanks for any help!