tags:

views:

310

answers:

7

When I open a PDF file and write the content to a text file the content from the text file is messed up. I think it's because of the encoding. From what I understand the JVM sets the default character set to Cp1252 (because I'm running on Windows XP). I've changed the default character set but with no results (System.setProperty("file.encoding", "ISO-8859-1");)

  • I've tried to use IText but the resulting content is missing some letters that have diacritics

Any ideas?

A: 

PDF is a binary file and hence you cannot read it as text file. You will have to hunt for some third party libraries to read the PDF contents.

Chandra Mohan
ur right thanks for ur response ... i've tried to use IText but in the resulted content are missing some letters that have diacritics Any ideas? thanks
Stephan
A: 

iText is an API for creating pdf from scratch, But inorder to read and edit the existing file, you can look at the following link http://www.lowagie.com/iText/

i2ijeya
i've tried to use IText but in the resulted content are missing some letters that have diacritics .Any ideas? thanks
Stephan
+1  A: 

You can try JavaPDF. It has an API for you to do the job. You can invoke the method extractTextFromPage(int pageIndex) from the PDFReader class.

Joset
+1  A: 

The reason that iText is not reading all the letters correctly may be due to the encoding used for the font. You could declare the font like:

BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.EMBEDDED);

where BaseFont.CP1252 is the encoding used. Be advised that some fonts do not have support for all types of encodings.

Bobby
Thank u very much , ur suggestion solved a part of the problem :D Cheers
Stephan
A: 

You have to use a specialised package. Two that I have used are pdftotext (http://en.wikipedia.org/wiki/Pdftotext) and PDFBox (http://incubator.apache.org/pdfbox/). Even with a package you cannot always gurantee success as some PDF-writing tools are poor quality and generate poor PDF.

peter.murray.rust
thank u for ur suggestion i will try it out
Stephan
+1  A: 

Using the iText helper class PdfTextExtractor should work fine. Just check that you're using the right encoding when writing the file to disk:

OutputStreamWriter out = new OutputStreamWriter( new FileOutputStream(file),"ISO-8859-1") );
FRotthowe
thank u for ur response it made my work a lot easier but it still didn't resolve my problem with some diacritics
Stephan
A: 

Our PDFTextStream library provides comprehensive support for diacriticals, as well as all character sets defined in the Unicode standard (including Chinese, Japanese, and Korean characters, in both horizontal and vertical writing modes). You might find that it extracts those diacriticals properly where other tools do not.

There are circumstances where a character, when extracted to text, will not appear to be the same as when it is displayed by a PDF reader like Acrobat -- this is most often the case when the text in question is rendered using an image-based font (which obviously doesn't convert directly to text, and would require an OCR process in order to derive the proper accented character(s)).

Chas Emerick