tags:

views:

684

answers:

4

Hi,

I would like to know the procedure to adopt to parse and obtain text content from Microsoft word (.doc and .docx) documents . programming language used should be plain "C" (should be gcc).

Are there any libraries that already do this job,

extension : can i use the same procedure to parse text from Microsoft power point files also ?

+1  A: 

I don't know about libraries that exist, but the format specifications are available from Microsoft for free and under a promise not to sue you for using them.

coppro
+1  A: 

Microsoft Word documents are an enormous beast - you definitely don't want to be writing this code yourself. Look into using an existing free Word library such as antiword or wvWare.

Adam Rosenfield
seems catdoc is similar library. antiword is what i am actually looking for; looking forward to have a go at the enormous beast. thanks for the info.
FL4SOF
+1  A: 

on windows, let word do the job and interface with the COM object, on linux, the job was done in antiword. Or you can automate OpenOffice.org on any platform with the UNO object model.

PW
A: 

can get inspiration from opensource libraries like antiword / catdoc.

FL4SOF