views:

665

answers:

3

I am reading html in with HtmlAgilityPack, editing it, then outputting it to a StreamWriter. The HtmlAgilityPack Encoding is Latin1, and the StreamWriter is UnicdeEncoding.

I am losing some characters in the conversion, and I do not want to be.

I don't seem to be able to change the Encoding of a StreamWriter. What is the best around this problem?

A: 

At a guess; write to a Stream (not a string). If you write to a string (inc. StringWriter/StringBuilder, you are implicitly using .NET's UTF-16 string.

If you just want to tweak the reported encoding (but use a string), then look at Jon's answer here.

Marc Gravell
man, that stinks. That means I basically can't pass text of this alternate encoding into a string. That makes it a true pain in the butt to move my data around...
Alex Baranosky
A: 

It is not clear which end you're losing characters at. In any case, a mere encoding mismatch isn't by itself an issue - you're still supposed to get the correct characters. If a Unicode StreamWriter writes out garbled characters, it means that it had received garbage on input in the first place. Which probably means that HtmlAgilityPack got encoding for your page wrong. If it has an option of setting the encoding manually, you might want to do just that.

It may also be that you have an HTML page which has a wrong encoding declaration in it. E.g. it might be an UTF-8 file which contains <meta> element declaring it as Latin-1. Where do you get the text from? Do you download it straight from the Web, or do you have it in a text file - and if it's the latter, how do you create that file? If you did it manually via Notepad, or in the code via StreamWriter, then you might have an UTF-8 file.

Pavel Minaev
I download the pages straight from the web, from a website I am working on (but didn't create). The pages say they are "ISO-8859-1". I don't know how to tell what the real encoding for these pages should be...
Alex Baranosky
I am pretty sure I am losing the characters when I use the HtmlAgiltiyPack to Save() to a StreamWriter. It isn't possible to save non UTF-8 chars such as \u0093 and \u0094 to a StreamWriter is it?
Alex Baranosky
Pavel, I think we misunderstood each other partly. ALL of the code doesn't come out as gibberish. Only two different characters do, the left double quotes and right double quotes.
Alex Baranosky
It's not an issue of writing chars. It's an issue of reading them. It has to read them from your input first, and keep in mind that .NET strings are already Unicode. So an UTF-8 StreamWriter can write any valid string, but the string may not have been read correctly in the first place.The correct Unicode characters for curved double quotes are \u201c and \u201d. It's what _HtmlAgilityPack_ should return to you when reading if the encoding for the input is specified correctly (since the point of encoding is to translate input to Unicode).
Pavel Minaev
Ohhhh... I see. I tried switching the HTML encoding manually to UTF-8, and now when my software runs on it the returned HTML has little squares in place of the curly quotes. I thought UTF-8 would be fine...?
Alex Baranosky
+1  A: 

If the web page is really Latin-1 (ISO-8859-1), it can't have any curly quotes in it; Latin-1 has no mappings for those characters. If you can see curly quotes when you open the page in your browser, they could be in the form of HTML entities (&ldquo; and &rdquo; or &#8220; and &#8221;). But I suspect the page's encoding is really windows-1252 despite what the headers and embedded declarations say.

windows-1252 is identical to Latin-1 except that it replaces the control characters in the \x80..\x9F range (decimal 128..159) with more useful (or at least prettier) printing characters. If HtmlAgilityPack is taking the page at its word and decoding it as ISO-8859-1, it will convert \x93 to the control character \u0093, which will look like garbage if you can get it to display at all. The browser, meanwhile, will convert it to \u201C, the Unicode code point for the Left Double Quotation Mark.

I'm not familiar with HtmlAgilityPack and I can't find any docs for it, but I would try to force it to use windows-1252. For example, you could create a windows-1252 (or "ANSI") StreamReader and have HAP use that.

Alan Moore
Good point, and very likely to be the correct answer.
Pavel Minaev