Every now and then I receive a Word Document that I have to display as a web page. I'm currently using Django's flatpages to achieve this by grabbing the html content generated by MS Word. The generated html is quite messy. Is there a better way that can generate very simple html to solve this issue using Python?
A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)
It does so many "clean ups"; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.
This is a known standard for Journalist companies.
It depends how much formatting and images you're dealing with. I do one of a couple things:
- Google Docs: Probably the closest you'll get to the original formatting and usable HTML.
- Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.
There are many other approaches, depending on your specific circumstances, beyond the good ones already suggested -- see this SO question and its answers for a good survey!
You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.
I should mention that Abiword can be run on the command line, so it's easy to integrate it in an automated process.
My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:
import urllib
import urllib2
def decruft(html):
data = urllib.urlencode({'html' : html})
req = urllib2.Request('http://wordoff.org/api/clean', data)
response = urllib2.urlopen(req)
return response.read()
def save(self, **kwargs):
if not self.pk: # only de-cruft when content is first added
self.content = decruft(self.content)
super(FlatPage, self).save(**kwargs)