ansaurus

Question

Java: how to write out PDF to a text file?

Answer 1

A:

PDF is a binary file and hence you cannot read it as text file. You will have to hunt for some third party libraries to read the PDF contents.

Chandra Mohan 2009-11-10 09:51:09

ur right thanks for ur response ... i've tried to use IText but in the resulted content are missing some letters that have diacritics Any ideas? thanks

Stephan 2009-11-10 09:58:37

Answer 2

A:

iText is an API for creating pdf from scratch, But inorder to read and edit the existing file, you can look at the following link http://www.lowagie.com/iText/

i2ijeya 2009-11-10 10:03:15

i've tried to use IText but in the resulted content are missing some letters that have diacritics .Any ideas? thanks

Stephan 2009-11-10 10:06:17

Answer 3

+1 A:

You can try JavaPDF. It has an API for you to do the job. You can invoke the method extractTextFromPage(int pageIndex) from the PDFReader class.

Joset 2009-11-10 10:09:51

Answer 4

+1 A:

The reason that iText is not reading all the letters correctly may be due to the encoding used for the font. You could declare the font like:

BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.EMBEDDED);

where BaseFont.CP1252 is the encoding used. Be advised that some fonts do not have support for all types of encodings.

Bobby 2009-11-10 10:18:41

Thank u very much , ur suggestion solved a part of the problem :D Cheers

Stephan 2009-11-10 10:48:01

Answer 5

A:

You have to use a specialised package. Two that I have used are pdftotext (http://en.wikipedia.org/wiki/Pdftotext) and PDFBox (http://incubator.apache.org/pdfbox/). Even with a package you cannot always gurantee success as some PDF-writing tools are poor quality and generate poor PDF.

peter.murray.rust 2009-11-10 10:33:27

thank u for ur suggestion i will try it out

Stephan 2009-11-10 10:51:50

Answer 6

+1 A:

Using the iText helper class PdfTextExtractor should work fine. Just check that you're using the right encoding when writing the file to disk:

OutputStreamWriter out = new OutputStreamWriter( new FileOutputStream(file),"ISO-8859-1") );

FRotthowe 2009-11-10 11:55:38

thank u for ur response it made my work a lot easier but it still didn't resolve my problem with some diacritics

Stephan 2009-11-10 13:37:52

Answer 7

A:

Our PDFTextStream library provides comprehensive support for diacriticals, as well as all character sets defined in the Unicode standard (including Chinese, Japanese, and Korean characters, in both horizontal and vertical writing modes). You might find that it extracts those diacriticals properly where other tools do not.

There are circumstances where a character, when extracted to text, will not appear to be the same as when it is displayed by a PDF reader like Acrobat -- this is most often the case when the text in question is rendered using an image-based font (which obviously doesn't convert directly to text, and would require an OCR process in order to derive the proper accented character(s)).

Chas Emerick 2009-12-07 14:05:52

ansaurus

tags:

views:

answers:

Java: how to write out PDF to a text file?

related questions