views:

53

answers:

5

I have hundreds of .doc files with text that I need put on web pages.

I realize I could convert every .doc file to .txt, then use a server side include to embed the contents of each page into a webpage. This would save a lot of time because I could simply have one .php?txt=... page which will display a different .txt include depending on the link the user pressed to get there. This works perfectly content-wise.

However, all formatting is lost when it is converted to .txt (titles should be in bold)

When I convert these .doc files to .html using Microsoft Word, the ~20 line documents become bloated >300 line .htm files (probably because each paragraph is put into textboxes)

Dreamweaver's "Clean up Word HTML" helped a bit but the code was still extremely bloated.

How would you suggest going about this?

edit: I may have solved my own question, trying to embed Google docs into my page.

A: 

You can try converting the Word documents to a DocBook intermediate format, then you can easily transform the DocBook with existing tools to (X)HTML.

fuzzy lollipop
A: 

MS Word is bloatware. Its own markup is bloated, and therefore any attempt to automatically convert it to HTML will inherit these problems. You end up with garbage like: <strong><strong></strong></strong> for no good reason.

Dreamweaver can clean it up a lot, but nothing short of strip/remarkup is going to get you clean results.

That's why most people use PDFs for this type of issue.

Diodeus
I'm mainly concerned about mobile devices not being able to read PDF files.
bbb
A: 

My immediate reaction would be to convert the docs to PDFs. That will normally preserve formatting quite well, and users typically have their browsers set up to view PDFs one way or another (and the few who don't are undoubtedly accustomed to being unable to view a lot of documents on a lot of sites).

Jerry Coffin
A: 

Alright thanks everyone for your suggestions, but I wanted to make this page accessible to everyone without pdf viewers as well.

Google docs allows you to bulk upload your text files (and converts them for you too)

You can then export them into an iframe to embed in any html document.

bbb
apparently I can't accept my own answer
bbb
You can only accept your own answer after a time limit, some 20 minutes or so.
Hello71
A: 

There is a program suite called wv (former mswordview). It has a program wvWare. This software can transform Word documents to HTML.

Furthermore you can use the output from Word and send it through tidy. This corrects markup and usually can handle the mistakes made by Word.

qbi