ansaurus

Question

How do you convert a Word Document into very simple html in Python?

Answer 1

+2 A:

A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)

It does so many "clean ups"; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.

This is a known standard for Journalist companies.

Lakshman Prasad 2009-10-20 20:20:46

Answer 2

+2 A:

It depends how much formatting and images you're dealing with. I do one of a couple things:

Google Docs: Probably the closest you'll get to the original formatting and usable HTML.
Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.

Chris Amico 2009-10-20 20:31:01

How do I get the HTML from Google Doc? Is it the Download as HTML option?

Thierry Lam 2009-10-20 20:44:11

+1: Word Doc files are *very* hard to work with. Many tools will convert them, including Open Office. Google Docs has a simple API since it's an HTTP web service.

S.Lott 2009-10-20 21:21:02

MS Word -> HTML is just plain evil. I had a client hand me a 95(!) page word document containing hundreds of 'places to see' and say, "It should be easy to enter this into the database." Arrggghh! I did it and billed him $100/hour for the privilege, but I think I undercharged given the amount of pain. The HTML was flat out the worst I have ever had to work with.

Peter Rowell 2009-10-21 00:14:35

Answer 3

A:

There are many other approaches, depending on your specific circumstances, beyond the good ones already suggested -- see this SO question and its answers for a good survey!

Alex Martelli 2009-10-21 02:33:08

Answer 4

+1 A:

You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.

I should mention that Abiword can be run on the command line, so it's easy to integrate it in an automated process.

Etienne 2009-10-21 02:54:42

Answer 5

A:

My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:

import urllib
import urllib2

def decruft(html):
    data = urllib.urlencode({'html' : html})
    req = urllib2.Request('http://wordoff.org/api/clean', data)
    response = urllib2.urlopen(req)
    return response.read()

def save(self, **kwargs):
    if not self.pk: # only de-cruft when content is first added
        self.content = decruft(self.content)
    super(FlatPage, self).save(**kwargs)

tomd 2009-10-21 22:50:05

ansaurus

tags:

views:

answers:

How do you convert a Word Document into very simple html in Python?

related questions