ansaurus

Question

Is there a reliable way to determine if a PDF was generated from a Powerpoint file?

Answer 1

A:

It might put its name in the creator or producer info, but I don't have a copy to check this theory with.

Azeem.Butt 2009-10-25 23:21:27

That field can contain arbitrary text. It's programatically unreliable.

Jason D 2009-12-01 07:10:50

Answer 2

+1 A:

Your reasoning is very arbitrary - there are surely plenty of PPT files without the features you describe, and plenty of PDF files with them, that were generated from another source.

In theory a better method would just be to detect when these "unwanted" situations occur. However, even though the PDF format is partly open (only for reading, apparently, so it's not truly an open format), extracting complex data like that would be incredibly difficult.

DisgruntledGoat 2009-10-25 23:30:19

PDF is absolutely not a closed format.

Azeem.Butt 2009-10-25 23:32:11

I quite agree, but the variable I left unstated in the above question is how much effort we want to expend on PDF analysis (answer: not much)

AndrewR 2009-10-25 23:32:46

@NSD: I thought Adobe owns it and doesn't publish the format. Maybe that's just Flash?

DisgruntledGoat 2009-10-26 00:31:30

@DG The PDF Reference was freely downloadable from Adobe (and as part of Acrobat SDK). Ironically, now that it is an open format it only seems possible to download it from ISO if you pay for it... http://www.adobe.com/devnet/pdf/pdf_reference.html

danio 2009-10-26 10:54:06

@NSD tell that to Adobe. They have threatened legal action numerous times in response to Microsoft adding PDF output capabilities to Office.

Josh Einstein 2009-11-29 04:15:39

Reading PDFs is free according to the license. Writing them isn't...Hence the saber rattling re: MS Office writing PDFs.

Jason D 2009-12-01 07:14:33

Thanks for the down votes everyone :s I corrected my answer since apparently PDFs are not a closed format, even though you have to pay to write them...

DisgruntledGoat 2009-12-01 14:02:38

Answer 3

+3 A:

Short answer:

No, I don't think so.

Long answer:

No, I don't think so, because there are may ways to convert a PowerPoint file to pdf, for example Adobe Acrobat and PDFCreator and many many others. It's up to the converters to embed specific information in the PDF file, even if you find a way to detect PowerPoint-source pdf from one convert, the same method may not work for another.

Even longer answer:

No, I don't think so, because of the reasons described in the "long answer". And I don't think detecting the source of the PDF is the best approach to the problem you are trying to solve. Not just PowerPoint produces overlapped text and images. I think it's much better to detect the actual layout of the PDF file. If there are overlay of image and text, then you do some filtering or pre-processing to cater for that.

lyxera 2009-11-26 05:06:41

Answer 4

A:

In general, it is not an easy task to programmatically determine (reliably) where a file came from or how it was generated based on its contents. After all, a file is just a collection of bits.

Unless you have a lot of resources to expend building the heuristics to determine whether a file looks "reasonably sane" according to your needs, I would consider this a task for human beings.

Andy West 2009-11-29 04:01:44

Even still, a human will not be able to tell if it was OOo, Power Point, a LaTeX presentation exported as PDF, a post script presentation output as PDF, a presentation created in QuarkXPress (or similar DTP tool). All the person will easily be able to say is "Does this look like a presentation, or document meant for printing?"

Jason D 2009-12-01 07:09:43

Answer 5

+4 A:

The creating application of a PDF is listed in its XMP metadata. You can see this quite easily in Acrobat 9 (and I believe earlier): go to File > Properties, click Additional Metadata..., then go to Advanced and it's listed under both XMP Core Properties and PDF Properties:

xmp:CreatorTool: Microsoft PowerPoint
pdf:Creator: Microsoft PowerPoint

I'm guessing you want to find this programatically, so you'll need to find a library to read this metadata that works with your language. Here is a list of some XMP tools.

carillonator 2009-11-29 04:36:24

I suspect this will only work if the file was created by powerpoint. If it was printed by PowerPoint into Adobe PDF Creator or another PDF Printer driver, wouldn't these fields likely be something else?

Jason R. Coombs 2009-11-30 03:33:57

I tried it with the Adobe Acrobat PDF printer driver and with the Mac's built-in Save to PDF (in the print dialog) and it retained PowerPoint as the creator.

carillonator 2009-11-30 05:05:54

@carill: Yet, technically, it wasn't created with powerpoint. . . It was created with a printer driver from power point. And if I exported the PPT to an EMF, then printed that, it would place the name of the app printing the EMF. . . It's a simple heuristic, but not one that guarantees the source was in fact powerpoint. . .

Jason D 2009-12-01 07:13:09

I've also noticed that it seems to be quite common for the "Title" metadata field to start with "Microsoft Powerpoint"

AndrewR 2009-12-02 05:10:55

Answer 6

+1 A:

All PDFs can have this problem regardless of their source. Most desktop publishing suites are capable of outputting PDF and are often sold boasting their high quality and flashier PDF presentations ...

A "saner" method would be to use a PDF parser, ITextSharp, or pdfNet...etc, Using the library of your choice, find all image rectangles, and all text rectangles, SORT THE RECTANGLES, and then see if there is substantial overlap of text and image rects -- ignoring image to image overlaps. If so, reject the page and/or document.

That won't be perfect, but at least it's going to catch many PDFs that aren't sane, regardless of source. Other heuristics to add would include color analysis. (i.e. are the colors in the overlapping region sufficiently different to allow "sane" results?)

Best of luck to you

Jason D 2009-12-01 07:21:51

Answer 7

A:

some converter from ppt to pdf preserve creator in comments at begin of pdf.

vitaly.v.ch 2009-12-01 11:00:49

Answer 8

A:

I think that PDF's generated from most applications seem to be the same. It may have some meta-data that you can read from the file...

alexy13 2009-12-01 21:59:10

ansaurus

tags:

views:

answers:

Is there a reliable way to determine if a PDF was generated from a Powerpoint file?

related questions