views:

50

answers:

1

Hi all,

I was just wondering if anyone knew of any good libraries for parsing .doc files (and similar formats, like .odt) to extract text, yet also keep formatting information where possible for display on a website.

Capability of doing similarly for PDFs would be a bonus, but I'm not looking as much for that.

This is for a Rails project, if that helps at all.

Thanks in advance!

+2  A: 

Apache's POI is a very popular way to access Word and Excel documents. There's a Ruby POI binding that might be worth investigating, but it looks like you'll have to build it yourself. And the API doesn't seem very Ruby-like since it's virtually a direct port from the Java code. And it seems to only have been tested against Ruby 1.8.2.

Mark Rushakoff
Thanks very much for the link; I'll be looking into that. (+1)
Platinum Azure