tags:

views:

389

answers:

5

Hi,

do you know a Java library, with which I can extract the text of a PDF document as a string, and which also preserves all empty lines and empty spaces from the original document (as they appear in the pdf document)?

I am using right now the PDFTextStripper class from the PDFBox-0.7.3 library, and I use the getText() method, which does return the document as a string, however, it removes also all empty lines, tabs and any empty spaces between the text. The new lines are preserved, so I can recognize the structure of the document, however, it is important for me to keep the other empty stuff as well. This is the default behaviour of getText(), and it seems that it is not possible to make it work so that it preserve the empty pieces of the text (I could not find any method in the API for this purpose).

Thank you for your help.

A: 

Might want to take a look at iText. The PDFReader class looks useful.

matt b
+1  A: 

Are you sure there are line feeds, tabs, space characters in the document? Many of the PDFs I've encountered used positioning for spacing and indentation. So rather than include line feeds and tabs, the text object is simply placed further down the page and offset. In that case PDFBox isn't removing anything from the text, the spaces were never there.

If you haven't looked at the PDF source yet, that could be helpful. If it's compressed you can use Multivalent Uncompress to make it readable. The PDF specification describes the text-positioning operators in section 9.4.2.

John Lemberger
+1  A: 

I had the same problem and solved it by extending the TextStripper class and adding coordinates in front of every line (was not easy though). For your problem you may add coordinates to every word, e.g. by not returning Strings, but a List of own objects (class with the word, x and y). So you would be able to reconstruct tabs and multible spacings from the coordinates afterwards.

Greetz, GHad

GHad
A: 

You can also use JPedal for text extraction. It may well be there are no spaces in the text - remember PDF is a display format...

mark stephens
+1  A: 

You might want to try our PDFTextStream library. We try very hard to maximize the the fidelity of the text extracted by PDFTextStream relative to its displayed presentation, so spacing and such are maintained as much as possible. There are also a couple of optional extraction modes (different implementations of the OutputHandler interface, actually) that allow you to control how the extracted text is formatted, which certainly affects spacing and such.

Chas Emerick