views:

413

answers:

4

I am trying to read a pdf document in a j2ee application.

For a webapplication I have to store pdf documents on disk. To make searching easy I want to make a reverse index of the text inside the document; if it is OCR.

With the PDFbox library its possible to create a pdfDocument object wich contains an entire pdf file. However to preserve memory and improve overall performance I'd rather handle the document as a stream and read one page at a time into a buffer.

I wonder if it is possible to read a filestream containing pdf page by page or even one line at a time.

A: 

I'd imagine you can read through the file byte by byte looking for page breaks. Line by line is more difficult because of possible PDF formatting issues.

tkotitan
+1  A: 

For a given generic pdf document you have no way of knowing where one page end and another one starts, using PDFBox at least.

If your concern is the use of resources, I suggest you parse the pdf document into a COSDocument, extract the parsed objects from the COSDocument using the .getObjects(), which will give you a java.util.List. This should be easy to fit into whatever scarce resources you have.

Note that you can easily convert your parsed pdf documents into Lucene indexes through the PDFBox API.

Also, before venturing into the land of optimisations, be sure that you really need them. PDFBox is able to make an in-memory representation of quite large PDF documents without much effort.

For parsing the PDF document from an InputStream, look at the COSDocument class

For writing lucene indexes, look at LucenePDFDocument class

For in-memory representations of COSDocuments, look at FDFDocument

Steen
A: 

I need to try something similar. I have a PDF on my server which I dont want the user to download in complete. The user may need to see only a few pages of this PDF to make the business decision. I want that when the user click "view" button, he sees an interface with the PDF thumbnails. Now, as and when he clicks on a particular thumbnail or clicks next or previous, only that page is fetched from the server and shown to user. Anybody can think of any way?

varun
A: 

Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.

Here is an example copied from the link above which shows how to draw a PDF page into an image:

    File file = new File("test.pdf");
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    FileChannel channel = raf.getChannel();
    ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
    PDFFile pdffile = new PDFFile(buf);

    // draw the first page to an image
    PDFPage page = pdffile.getPage(0);

    //get the width and height for the doc at the default zoom 
    Rectangle rect = new Rectangle(0,0,
            (int)page.getBBox().getWidth(),
            (int)page.getBBox().getHeight());

    //generate the image
    Image img = page.getImage(
            rect.width, rect.height, //width & height
            rect, // clip rect
            null, // null for the ImageObserver
            true, // fill background with white
            true  // block until drawing is done
            );
kepler