tags:

views:

1390

answers:

4

I want to have PHP read an (uploaded) powerpoint presentation, and minimally extract the text from each slide (grabbing more info like images and layouts would even be better, but I would settle for just the text at this point).

I know that google apps does it in its presentation app, so I am guessing there is some way to translate the powerpoint binary, but I can't seem to find any info on how to do it.

Any ideas on what to try?

Thanks -

+1  A: 

Yes of course it's possible.

Here's a start. I wouldn't say it's very well documented/formated, but it's not that hard once you get started. Start by focusing only on elements you need (slides, text, etc).

A less detailed and simpler approach would be to open .ppt file in hex editor and look for information you are interesed in (you should be able to see text within the binary data) and what surrounds it. Then based on what surrounds that information you could write a parser which extracts this information.

Maiku Mori
+1  A: 

Depending on the version, you can take a look on the Zend Framework as Zend_Search_Lucene is able to index PowerPoint 2007 files. Just take a look at the corresponding class file, i think it's something like Zend_Search_Lucene_Document_Pptx.

Mathias
A: 

I wanted to post my resolution to this.

Unfortunately, I was unable to get PHP to reliably read the binary data.

My solution was to write a small vb6 app that does the work by automating PowerPoint.

Not what I was looking for, but, solves the issue for now.

That being said, the Zend option looks like it may be viable at some point, so I will watch that.

Thanks.

OneNerd
A: 

Here's a sample function I created form a similar one that extracts text from Word documents. I tested it with Microsoft PowerPoint files, but it won't decode OpenOfficeImpress files saved as .ppt

For .pptx files you might want to take a look at Zend Lucene.

function parsePPT($filename) {
// This approach uses detection of the string "chr(0f).Hex_value.chr(0x00).chr(0x00).chr(0x00)" to find text strings, which are then terminated by another NUL chr(0x00). [1] Get text between delimiters [2] 
$fileHandle = fopen($filename, "r");
$line = @fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0f),$line);
$outtext = '';

foreach($lines as $thisline) {
    if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
        $text_line = substr($thisline, 4);
        $end_pos   = strpos($text_line, chr(0x00));
        $text_line = substr($text_line, 0, $end_pos);
        $text_line = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$text_line);
        if (strlen($text_line) > 1) {
            $outtext.= substr($text_line, 0, $end_pos)."\n";
        }
    }
}
return $outtext;

}

Jorge Ortiz