tags:

views:

62

answers:

1

Hello,

I want to parse pdf websites.

Can anyone say how to extract all the words (word by word) from a pdf file using java.

The code below extract content from a pdf file and write it in another pdf file. I want that the program write it in a text file.

import java.io.FileOutputStream;

import java.io.IOException;

import com.itextpdf.text.*;

import com.itextpdf.text.pdf.*;

public class pdf {

    private static String INPUTFILE = "http://www.britishcouncil.org/learning-infosheets-medicine.pdf" ;

    private static String OUTPUTFILE = "c:/new3.pdf";

    public static void main(String[] args) throws DocumentException,
            IOException {

        Document document = new Document();

        PdfWriter writer = PdfWriter.getInstance(document,
                new FileOutputStream(OUTPUTFILE));

        document.open();

        PdfReader reader = new PdfReader(INPUTFILE);

        int n = reader.getNumberOfPages();

        PdfImportedPage page;


        for (int i = 1; i <= n; i++) {

                page = writer.getImportedPage(reader, i);

                Image instance = Image.getInstance(page);

                document.add(instance);

        }

        document.close();

    }

}

Thanks in advance

+1  A: 

Take a look at this:

How to Read PDF File in Java (uses Apache PDF Box library)

Leniel Macaferi
PDFBox is great.
Adrian Petrescu