ansaurus

Question

.NET: Convert .doc to .htm results in funky characters

Answer 1

A:

Did you try opening the file in binary mode. If you open in test mode I think it will chop up the unicode characters.

osp70 2008-11-07 18:51:16

I have to open it as text, since it will be stored as text in the database.

Todd Price 2008-11-07 19:10:07

Answer 2

A:

Isn't the problem that Word's .doc to .html conversion turns the bullet points to question marks (and it hasn't got anything to do with File.ReadAllText or StreamReader etc)?

i.e. by the time it gets to File.ReadAllText it is already a question mark.

When I convert a simple simple Word list to HTML in Word 2003, I get

 <ul style='margin-top:0cm' type=disc> 
     <li class=MsoNormal style='mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'>
       <span lang=EN-GB style='mso-ansi-language:EN-GB'>Test 1</span>
     </li> 
     <li class=MsoNormal style='mso-list:l0 level1 lfo1;tab-stops:list 36.0pt'>
       <span lang=EN-GB style='mso-ansi-language:EN-GB'>Test 2</span>
     </li> 
 </ul>

It's ugly, but it doesn't contain anything that could become a question mark

DrG 2008-11-07 19:54:31

Good question, but I did verify that already. Our editors are actually inserting bullet characters into the document and not using a bulleted list. Found the answer though. See below.

Todd Price 2008-11-07 20:34:40

Answer 3

A:

What these characters look like in the HTML file? What is the encoding declaration of this file (in the meta tag "Content-Type")? Ideally, these characters should be transformed into entities or UTF-8 characters.
Answering these questions might lead you to the solution... :-)

PhiLho 2008-11-07 19:59:56

Answer 4

+2 A:

On my system (using US-English) Word saves *.htm files in the Windows-1252 codepage. If your system uses that codepage, then that is what you should read it in as.

string html = File.ReadAllText(originalFile, Encoding.GetEncoding(1252));

It is also possible that whatever you are using the view the results may be creating the question marks for you, though, so be sure and check for that too.

Jeffrey L Whitledge 2008-11-07 20:37:34

Answer 5

A:

OK, apparently I lied in my first statement. I thought I had tried every encoding, but I had not tried this:

data = File.ReadAllText(tempFile, Encoding.Default);

You'd think that the overload of this method where you DO NOT specify an encoding would work just fine, expecting the default encoding to be, well, Encoding.Default. However, it actually uses Encoding.UTF8 by default. Hope this helps someone else.

Todd Price 2008-11-07 21:06:57

Answer 6

A:

This solved my issue. Thanks Todd

2009-01-22 20:40:58

ansaurus

tags:

views:

answers:

.NET: Convert .doc to .htm results in funky characters

related questions