views:

61

answers:

2

Hi,

I want to read a MS word document and Identify Header/Bold font words/Underscored words, etc? is there a way to solve this problem programmatically? I want the suggestion in Java or PHP or Ruby if possible, else if there is some meta-data available also let me know,

Thanks Ram

+1  A: 

You have java API that can do that. I suggest you to look at the Apache POI library.

Benoit Courtine
Apache Tika is a good project, I found out it does lot of stuff.
ram
+1  A: 

This is related to this http://stackoverflow.com/questions/203174/whats-a-good-java-api-for-creating-word-documents

There is a work in progress API for this one using Apache POI.

HWPF is the name of our port of the Microsoft Word 97(-2007) file format to pure Java. It also provides limited read only support for the older Word 6 and Word 95 file formats.and Word 95 file formats.

The partner to HWPF for the new Word 2007 .docx format is XWPF. Whilst HWPF and XWPF provide similar features, there is not a common interface across the two of them at this time.

http://poi.apache.org/hwpf/quick-guide.html

Joset