ansaurus

Question

C# UTF-8 Encoding Problem

Answer 1

A:

Check the HTML code. There is " " between "Advertising" and "Programs".

Try your code with "Business Solutions" text and it will work.

Most probably you will need to replace nbsp with a normal space.

Oleg 2009-09-26 20:47:57

Answer 2

+4 A:

You are displaying UTF-8 as Latin-1 or its variants (CP1252).

Google uses a nbsp in that sentence, which is C2 A0, which happens to be "Â " in Latin-1.

EDIT: The code you showed here is ok. I think the problem occurs when you display the content. Looks like you are output UTF-8 but the display media is expecting Latin-1.

If you are using console to display, try this,

 Console.OutputEncoding = Encoding.GetEncoding("iso-8859-1");

This will tell console to send out Latin-1, instead of UTF-8.

If you display the text in browsers, make sure the web page is marked with UTF-8, like,

   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

ZZ Coder 2009-09-26 20:49:29

Well I guess that leads to two more questions:1. If the Encoding class knows it's taking in UTF-8 and outputting a Unicode(UTF-16?) string, shouldn't it know how to translate C2 A0 in UTF-8 to the correct representation of in Unicode? I assume I'm misunderstanding the encoding issue on a basic level. Off to do more research...2. I'm eventually encoding the string back into UTF-8 to render in a browser. I'm only converting to a .NET string for convenience in parsing. Is there a better way to parse the text in its native UTF-8 encoding?

2009-09-26 22:10:03

See my edits .....................

ZZ Coder 2009-09-26 23:55:18

Excellent! That did the trick - thanks a bunch for the pointer!

2009-09-27 05:14:30

ansaurus

tags:

views:

answers:

C# UTF-8 Encoding Problem

related questions