views:

178

answers:

3

i want to know the similarity of tow pdf files, but i don't want to do the detail content compare . is there any solution just from its external structure .is it possible ?thanks!

A: 

You can tell if two files are different by running a hash on them (like md5) but that won't tell you the degree of similarity between them.

There are binary diff programs that can tell you where two binary files differ with reasonable results but many binary files, especially document containers, can show alot of binary difference when there are only minor internal content differences.

I'm not familiar with the details of the pdf format. Maybe somebody else knows of a built in mechanism that might help.

Arnold Spence
+2  A: 

That sounds potentially tough, but here is some low-hanging fruit from the PDF metadata, in order of difficulty.

  1. Document metadata such as eBook-title and Title
  2. Number of pages in the document (counting /Page directives)
  3. Compare the metadata for each page, such as MediaBox, CropBox, BleedBox, TrimBox
  4. Look for embedded content like images and document-specific fonts and see if they are a perfect match.
  5. Pull out the plain text and compare the words: word counts, most common words, etc. For Western language, you could just run the PDF through a string-finder like strings on Linux. Or you can go into the file and find (blah blah blah) Tj, which is how most text is stored in PDF content.

Finally, you may be able to cheat by converting them to a raster format with GhostScript or another library and then comparing them that way. If you convert to a low-resolution like 100px then the rough details might look similar.

If you've never worked directly with PDF, it's not scary! It's just a text file (after you decompress it) which you can more-or-less parse line-by-line. I discuss PDF more in the HTML document to PDF answer.

jhs
A: 

A PDF is not just a text file. Its a binary dump of a B-tree. With compressed objects you can also get object data compressed inside other binary objects so you cannot see them.

If you want to do low-level text manipulation you really need to use a decent tool. Acrobat 9.0 has a menu option to browse the internal PDF structure or you can use something like IText.