views:

214

answers:

8

Like the title says. Reason I ask is that we're converting PDFs to formatted ASCII text (using pdftotext) and only want to display the ones that look reasonably sane.

PPT files tend to have text over images, diagonal text and others things that don't translate to ASCII very well, so we'd like to filter them out if we can.

A: 

It might put its name in the creator or producer info, but I don't have a copy to check this theory with.

Azeem.Butt
That field can contain arbitrary text. It's programatically unreliable.
Jason D
+1  A: 

Your reasoning is very arbitrary - there are surely plenty of PPT files without the features you describe, and plenty of PDF files with them, that were generated from another source.

In theory a better method would just be to detect when these "unwanted" situations occur. However, even though the PDF format is partly open (only for reading, apparently, so it's not truly an open format), extracting complex data like that would be incredibly difficult.

DisgruntledGoat
PDF is absolutely not a closed format.
Azeem.Butt
I quite agree, but the variable I left unstated in the above question is how much effort we want to expend on PDF analysis (answer: not much)
AndrewR
@NSD: I thought Adobe owns it and doesn't publish the format. Maybe that's just Flash?
DisgruntledGoat
@DG The PDF Reference was freely downloadable from Adobe (and as part of Acrobat SDK). Ironically, now that it is an open format it only seems possible to download it from ISO if you pay for it... http://www.adobe.com/devnet/pdf/pdf_reference.html
danio
@NSD tell that to Adobe. They have threatened legal action numerous times in response to Microsoft adding PDF output capabilities to Office.
Josh Einstein
Reading PDFs is free according to the license. Writing them isn't...Hence the saber rattling re: MS Office writing PDFs.
Jason D
Thanks for the down votes everyone :s I corrected my answer since apparently PDFs are not a closed format, even though you have to pay to write them...
DisgruntledGoat
+3  A: 

Short answer:

No, I don't think so.

Long answer:

No, I don't think so, because there are may ways to convert a PowerPoint file to pdf, for example Adobe Acrobat and PDFCreator and many many others. It's up to the converters to embed specific information in the PDF file, even if you find a way to detect PowerPoint-source pdf from one convert, the same method may not work for another.

Even longer answer:

No, I don't think so, because of the reasons described in the "long answer". And I don't think detecting the source of the PDF is the best approach to the problem you are trying to solve. Not just PowerPoint produces overlapped text and images. I think it's much better to detect the actual layout of the PDF file. If there are overlay of image and text, then you do some filtering or pre-processing to cater for that.

lyxera
A: 

In general, it is not an easy task to programmatically determine (reliably) where a file came from or how it was generated based on its contents. After all, a file is just a collection of bits.

Unless you have a lot of resources to expend building the heuristics to determine whether a file looks "reasonably sane" according to your needs, I would consider this a task for human beings.

Andy West
Even still, a human will not be able to tell if it was OOo, Power Point, a LaTeX presentation exported as PDF, a post script presentation output as PDF, a presentation created in QuarkXPress (or similar DTP tool). All the person will easily be able to say is "Does this look like a presentation, or document meant for printing?"
Jason D
+4  A: 

The creating application of a PDF is listed in its XMP metadata. You can see this quite easily in Acrobat 9 (and I believe earlier): go to File > Properties, click Additional Metadata..., then go to Advanced and it's listed under both XMP Core Properties and PDF Properties:

xmp:CreatorTool: Microsoft PowerPoint
pdf:Creator: Microsoft PowerPoint

I'm guessing you want to find this programatically, so you'll need to find a library to read this metadata that works with your language. Here is a list of some XMP tools.

carillonator
I suspect this will only work if the file was created by powerpoint. If it was printed by PowerPoint into Adobe PDF Creator or another PDF Printer driver, wouldn't these fields likely be something else?
Jason R. Coombs
I tried it with the Adobe Acrobat PDF printer driver and with the Mac's built-in Save to PDF (in the print dialog) and it retained PowerPoint as the creator.
carillonator
@carill: Yet, technically, it wasn't created with powerpoint. . . It was created with a printer driver from power point. And if I exported the PPT to an EMF, then printed that, it would place the name of the app printing the EMF. . . It's a simple heuristic, but not one that guarantees the source was in fact powerpoint. . .
Jason D
I've also noticed that it seems to be quite common for the "Title" metadata field to start with "Microsoft Powerpoint"
AndrewR
+1  A: 

All PDFs can have this problem regardless of their source. Most desktop publishing suites are capable of outputting PDF and are often sold boasting their high quality and flashier PDF presentations ...

A "saner" method would be to use a PDF parser, ITextSharp, or pdfNet...etc, Using the library of your choice, find all image rectangles, and all text rectangles, SORT THE RECTANGLES, and then see if there is substantial overlap of text and image rects -- ignoring image to image overlaps. If so, reject the page and/or document.

That won't be perfect, but at least it's going to catch many PDFs that aren't sane, regardless of source. Other heuristics to add would include color analysis. (i.e. are the colors in the overlapping region sufficiently different to allow "sane" results?)

Best of luck to you

Jason D
A: 

some converter from ppt to pdf preserve creator in comments at begin of pdf.

vitaly.v.ch
A: 

I think that PDF's generated from most applications seem to be the same. It may have some meta-data that you can read from the file...

alexy13