tags:

views:

413

answers:

6

I've used MS Word automation to save a .doc to a .htm. If there are bullet characters in the .doc file, they are saved fine to the .htm, but when I try to read the .htm file into a string (so I can subsequently send to a database for ultimate storage as a string, not a blob), the bullets are converted to question marks or other characters depending on the encoding used to load into a string.

I'm using this to read the text:

string html = File.ReadAllText(myFileSpec);

I've also tried using StreamReader, but get the same results (maybe it's used internally by File.ReadAllText).

I've also tried specifying every type of Encoding in the second overload of File.ReadAllText:

string html = File.ReadAllText(originalFile, Encoding.ASCII);

I've tried all the available enums for the Encoding type.

Any ideas?

A: 

Did you try opening the file in binary mode. If you open in test mode I think it will chop up the unicode characters.

osp70
I have to open it as text, since it will be stored as text in the database.
Todd Price
A: 

Isn't the problem that Word's .doc to .html conversion turns the bullet points to question marks (and it hasn't got anything to do with File.ReadAllText or StreamReader etc)?

i.e. by the time it gets to File.ReadAllText it is already a question mark.

When I convert a simple simple Word list to HTML in Word 2003, I get

 <ul style='margin-top:0cm' type=disc> 
     <li class=MsoNormal style='mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'>
       <span lang=EN-GB style='mso-ansi-language:EN-GB'>Test 1</span>
     </li> 
     <li class=MsoNormal style='mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'>
       <span lang=EN-GB style='mso-ansi-language:EN-GB'>Test 2</span>
     </li> 
 </ul>

It's ugly, but it doesn't contain anything that could become a question mark

DrG
Good question, but I did verify that already. Our editors are actually inserting bullet characters into the document and not using a bulleted list. Found the answer though. See below.
Todd Price
A: 

What these characters look like in the HTML file? What is the encoding declaration of this file (in the meta tag "Content-Type")? Ideally, these characters should be transformed into entities or UTF-8 characters.
Answering these questions might lead you to the solution... :-)

PhiLho
+2  A: 

On my system (using US-English) Word saves *.htm files in the Windows-1252 codepage. If your system uses that codepage, then that is what you should read it in as.

string html = File.ReadAllText(originalFile, Encoding.GetEncoding(1252));

It is also possible that whatever you are using the view the results may be creating the question marks for you, though, so be sure and check for that too.

Jeffrey L Whitledge
A: 

OK, apparently I lied in my first statement. I thought I had tried every encoding, but I had not tried this:

data = File.ReadAllText(tempFile, Encoding.Default);

You'd think that the overload of this method where you DO NOT specify an encoding would work just fine, expecting the default encoding to be, well, Encoding.Default. However, it actually uses Encoding.UTF8 by default. Hope this helps someone else.

Todd Price
A: 

This solved my issue. Thanks Todd