ansaurus

Question

Read pdf uploadstream one page at a time with java

Answer 1

A:

I'd imagine you can read through the file byte by byte looking for page breaks. Line by line is more difficult because of possible PDF formatting issues.

tkotitan 2009-02-25 14:55:36

Answer 2

+1 A:

For a given generic pdf document you have no way of knowing where one page end and another one starts, using PDFBox at least.

If your concern is the use of resources, I suggest you parse the pdf document into a COSDocument, extract the parsed objects from the COSDocument using the .getObjects(), which will give you a java.util.List. This should be easy to fit into whatever scarce resources you have.

Note that you can easily convert your parsed pdf documents into Lucene indexes through the PDFBox API.

Also, before venturing into the land of optimisations, be sure that you really need them. PDFBox is able to make an in-memory representation of quite large PDF documents without much effort.

For parsing the PDF document from an InputStream, look at the COSDocument class

For writing lucene indexes, look at LucenePDFDocument class

For in-memory representations of COSDocuments, look at FDFDocument

Steen 2009-03-02 20:39:15

Answer 3

A:

I need to try something similar. I have a PDF on my server which I dont want the user to download in complete. The user may need to see only a few pages of this PDF to make the business decision. I want that when the user click "view" button, he sees an interface with the PDF thumbnails. Now, as and when he clicks on a particular thumbnail or clicks next or previous, only that page is fetched from the server and shown to user. Anybody can think of any way?

varun 2009-04-19 10:08:06

Answer 4

A:

Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.

Here is an example copied from the link above which shows how to draw a PDF page into an image:

    File file = new File("test.pdf");
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    FileChannel channel = raf.getChannel();
    ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
    PDFFile pdffile = new PDFFile(buf);

    // draw the first page to an image
    PDFPage page = pdffile.getPage(0);

    //get the width and height for the doc at the default zoom 
    Rectangle rect = new Rectangle(0,0,
            (int)page.getBBox().getWidth(),
            (int)page.getBBox().getHeight());

    //generate the image
    Image img = page.getImage(
            rect.width, rect.height, //width & height
            rect, // clip rect
            null, // null for the ImageObserver
            true, // fill background with white
            true  // block until drawing is done
            );

kepler 2010-08-19 14:12:23

ansaurus

tags:

views:

answers:

Read pdf uploadstream one page at a time with java

related questions