views:

42

answers:

2

I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to text file but not able to see any accented characters.

pdftotext -enc UTF-8 book1.pdf book1.txt

Please help me to resolve this issue.

Thanks in advance,

A: 

You can get a list of available encodings using the command:

pdftotext -listenc

and pick the right one using the -enc argument. Mine here seems to do UTF-8 by default. i.e. your "UTF-8" is superflous

pdftotext -enc UTF-8 your.pdf

You may want to check your locale (LC_ALL, LANG, ...).

EDIT: I downloaded the following PDF: http://www.i18nguy.com/unicode/unicodeexample.pdf

and converted it on a Windows 7 PC (german) and XPDF 3.02PL5 using the command:

pdftotext.exe -enc UTF-8 unicodeexample.pdf

The text file is definitely UTF-8 encoded, as all characters are displayed correctly. What are you using the text file for? If you're displaying it through a web application, your content encoding might simply be wrong, while the text file has been converted as you wanted it to.

Double-check using either a browser (force the encoding in Firefox to ISO-8859-1 and UTF-8) or using a hex editor.

icanhasserver
Thanks for your reply. I am not able to get the list of encodings using pdftotext -listenc. I am also using the same command which you have specified but still no use for me. could you please send me your mail address so that I can forward you the PDF to test? Thanks again.
Amar
I am using pdftotext of version 3.02
Amar
What platform are you running this on? Some kind of Unix/Linux or Windows?Judging by the version number, it looks like you're using the somewhat outdated (original) XPDF version. Most Linux distributions have switched to Poppler in the meantime. Mine says: "pdftotext version 0.14.4" and comes from Poppler (version released in 2010).
icanhasserver
I am using Windows 7 and I have downloaded the latest version "Xpdf 3.02pl5 was released 2010-oct-21" for windows.
Amar
See my edit above. I have no problem converting to UTF-8 using the version you mentionned.
icanhasserver
I am using the same command which you have given here and just saving it to text file. It is not displayed on the web browser. Could you please send me your mail address so that I can send my sample PDF file?
Amar
Send it to the following address: temp12474 AT icanhasserver DOT com , but your problem doesn't come from the PDF file itself. The one I prodided above is way better at diagnosing, as it contains a large amount of different codepoints.
icanhasserver
I did tried the above PDF file and it is working fine for me. But the hungarian PDF which I am trying to convert is not working. I have forwarded you the sample PDF. Please try and let me know, Thanks.
Amar
A: 

Things are getting a little bit messy, so I'm adding another answer.

I took the PDF apart and my best guess would be a "problem" with the font used:

  • open the PDF file in Acrobar Reader
  • select all the text on the page
  • copy it and paste it into a Unicode-aware text editor (there's no "hidden" OCR, so you're copying actual data)

You'll see that the codepoints you end up with aren't the ones you're seeing in the PDF reader. Whatever the font is, it may have a mapping different from the one defined in the Unicode standard. As such, your content is "wront" and there's not much you can do about it.

icanhasserver