views:

1520

answers:

1

I'm currently using a combination of OpenOffice macros and a pdf2text program to extract text and would like to find an easier, more efficient way getting the text out of a PowerPoint file.

I've tried using the Apache POI library and have not had much luck, encountered numerous exceptions within the library when trying to process the files I'm looking at and don't particularly want to sift through the source code of the library.

Is there an easy way to do this without using the aforementioned library?

A: 

If you have MS Office and you save the PPT in the RTF (Rich Text Format), it contains just the text from the presentation. You could then open the file in any editor that understands RTF files and save it as a text (TXT) file.

I expect this to work from Open Office too.

Since you talk of API, this may not be the way to go for you but maybe it will give you newer ideas on getting there. Say, you use multiple macros to do the conversion in stages...

Edit: I got curious and did a short google search

This is what i found on one of the www.openoffice.org pages

As people in this thread have pointed out, retrieving text from an OO document isn't hard since it's just zipped xml that can be parsed with a perl script. The problem is getting Microsoft Powerpoint documents into a zipped XML format in the first place.

I've found that File -> Wizards -> Document Convertor does exactly that. Just tell it you want to convert Powerpoint documents, not templates, point it to your source directory and where you want it to spit out the result and you're away.

I then find unzip -p $file.sxi content.xml | perl -p -e "s/<[^>]>/\n/g;s/ +//;s/\n\n/\n/g;" -w

works rather well for extracting the text.

Sorry, i don't have Open Office handy to try any of that out.

nik