views:

767

answers:

6

I watched the traffic when google displays PDF attachments in gmail in a new window. The content is served as PNG images for each PDF page. And its text can be selected. What does google use on server side to generate a PNG file for a particular page in a pdf file? How does the selection of text on a png file work? Any ideas?

+1  A: 

hey,

if you have the text you can make it what you want offcourse,

more specific you should check out this link : pdf to png using php

so imageMagick will be needed imageMagic

edit : another interesting link.

edit : i found this at google, it looks interesting ... so you could use the google api Google Document List Data Api and this is a blogpost about it Google API Now Lets You Get Documents in Many Formats

Offcourse to be sure what google uses you need an answer from them ? :)

good luck !

mhd
Hi, Thanks for your answer. The links are definitely interesting. I have these Big PDFs( ~50 Mb) as input to my process and these need to be served to clients on slow connections. However, the clients may only need a few pages in order to take the decision. Hence, we were thinking in terms of carrying only a snapshot of a pdf just like google does. We need some kind of enterprise product which could help us do that. Prefferably java.This isnt exactly like that but helpful. Some more leads i have are..http://www.jpedal.org/, iTextVarun
varun
A: 

You may also want to investigate use Lucence to index those big pdf files and serve related pages to your users.

See http://www.jguru.com/faq/view.jsp?EID=1074237 for more ideas.

Journeyman Programmer
+1  A: 

Google uses a non-open-sourced PDF converter app developed in-house. So you're better off looking into the links posted by other answers, since you can't get your hands on the Google version. Sorry!

Kai
+4  A: 

By default attachments are viewed securely using https://docs.google.com/gview, however it turns out you are allowed to request files over plain HTTP. This makes it a little bit easier to figure out what is going on using Wireshark.

As you indicated it was already clear that the PDF is converted on the server side to a PNG (ImageMagick is indeed a reasonable solution for this purpose), the obvious reason for this is to preserve the exact layout while still being able to view the file without requiring a PDF viewer.

However, from looking at the traffic I found out that the entire PDF is also converted to a custom XML format when calling /gview?a=gt&docid=&chan=&thid= (this is done as soon as you request the document). As I couldn't use Wireshark to copy the XML I resorted to the Firefox extension Live HTTP Headers. Here's an excerpt:

<pdf2xml>
    <meta name="Author" content="Bruce van der Kooij"/>
    <meta name="Creator" content="Writer"/>
    <meta name="Producer" content="OpenOffice.org 3.0"/>
    <meta name="CreationDate" content="20090218171300+01'00'"/>
    <page t="0" l="0" w="595" h="842">
        <text l="188" t="99" w="213" h="27" p="188,213">Programmabureau</text>
        <text l="85" t="127" w="425" h="27" p="85,117,209,61,277,21,305,124,436,75">Nederland Open in Verbinding (NOiV)</text>
    </page>
</pdf2xml>

I'm not quite sure yet what all the attributes on the text element stand for (with the exception of w and h) but they're obviously the coordinates of the text and possibly length. As the JavaScript Google uses is minimized (or possibly obsfuscated, but this is not likely) figuring out precisely how the client-side selection function works is not quite that easy. But most likely it uses this XML file to figure out what text the user is looking at and then copies that to the user's clipboard.

Note that there is an open source (GPL licensed) tool called pdf2xml which has similar but not quite the same output. Here's the example from their homepage:

<?xml version="1.0" encoding="utf-8" ?>
<pdf2xml pages="3">
  <title>My Title</title>
  <page width="780" height="1152">
    <font size="10" face="MHCJMH+FuturaT-Bold" color="#FF0000">
      <text x="324" y="37" width="132" height="10">Friday, September 27, 2002</text>
      <img x="324" y="232" width="277" height="340" src="text_pic0001.png"/>
      <link x="324" y="232" width="277" height="340" dest_page="2" dest_x="141" dest_y="187"/>
    </font>
    <font size="12" face="AGaramond-Regular" italic="true" bold="true">
      <text x="509" y="68" width="121" height="12">This is a test PDF file</text>
      <link x="509" y="68" width="121" height="12" href="www.mobipocket.com"/>
    </font>
  </page>
</pdf2xml>

Hope this information is in any way useful, however like one of the other posters mentioned the only way to be sure what Google does is by asking them. It's a shame Google doesn't have an official IRC channel but they do have a forum for Google Docs support questions.

Good luck.

Bruce van der Kooij
I guess t and l stand for top and left. Google also doesn't need font data as font is rendered inside PNG. So pdf2xml is probably the generator, but the XML was afterwards parsed and some data removed.
Ivan Vučica
A: 

To see what a pdf is created with, right click on it and go to the Document Properties (in Adobe reader). The PDF producer will show up as the "PDF Producer". I think google uses both Prince and IText (not in combination for creating PDFs). Google has created some major modifications on the above toolkits to create that end product.

jle
A: 

Well.. this might just be the pdf2xml tool Google is using. They only changed they full words width, height etc and they added the p attribute... which turns out to be the attribute containing the coordinates for the words inside the line. Just played with it and found out :) Going to use this pdf2xml from google :P Upload, let them convert... use xml to transform tooo... epub? :P

Jeroen