ansaurus

Question

How to save text file in UTF-8 format using pdftotext

Answer 1

A:

You can get a list of available encodings using the command:

pdftotext -listenc

and pick the right one using the -enc argument. Mine here seems to do UTF-8 by default. i.e. your "UTF-8" is superflous

pdftotext -enc UTF-8 your.pdf

You may want to check your locale (LC_ALL, LANG, ...).

EDIT: I downloaded the following PDF: http://www.i18nguy.com/unicode/unicodeexample.pdf

and converted it on a Windows 7 PC (german) and XPDF 3.02PL5 using the command:

pdftotext.exe -enc UTF-8 unicodeexample.pdf

The text file is definitely UTF-8 encoded, as all characters are displayed correctly. What are you using the text file for? If you're displaying it through a web application, your content encoding might simply be wrong, while the text file has been converted as you wanted it to.

Double-check using either a browser (force the encoding in Firefox to ISO-8859-1 and UTF-8) or using a hex editor.

icanhasserver 2010-10-28 05:17:06

Thanks for your reply. I am not able to get the list of encodings using pdftotext -listenc. I am also using the same command which you have specified but still no use for me. could you please send me your mail address so that I can forward you the PDF to test? Thanks again.

Amar 2010-10-28 05:42:57

I am using pdftotext of version 3.02

Amar 2010-10-28 05:44:11

What platform are you running this on? Some kind of Unix/Linux or Windows?Judging by the version number, it looks like you're using the somewhat outdated (original) XPDF version. Most Linux distributions have switched to Poppler in the meantime. Mine says: "pdftotext version 0.14.4" and comes from Poppler (version released in 2010).

icanhasserver 2010-10-28 06:30:14

I am using Windows 7 and I have downloaded the latest version "Xpdf 3.02pl5 was released 2010-oct-21" for windows.

Amar 2010-10-28 06:42:23

See my edit above. I have no problem converting to UTF-8 using the version you mentionned.

icanhasserver 2010-10-28 07:18:35

I am using the same command which you have given here and just saving it to text file. It is not displayed on the web browser. Could you please send me your mail address so that I can send my sample PDF file?

Amar 2010-10-28 07:53:36

Send it to the following address: temp12474 AT icanhasserver DOT com , but your problem doesn't come from the PDF file itself. The one I prodided above is way better at diagnosing, as it contains a large amount of different codepoints.

icanhasserver 2010-10-28 08:38:52

I did tried the above PDF file and it is working fine for me. But the hungarian PDF which I am trying to convert is not working. I have forwarded you the sample PDF. Please try and let me know, Thanks.

Amar 2010-10-28 09:03:42

Answer 2

A:

Things are getting a little bit messy, so I'm adding another answer.

I took the PDF apart and my best guess would be a "problem" with the font used:

open the PDF file in Acrobar Reader
select all the text on the page
copy it and paste it into a Unicode-aware text editor (there's no "hidden" OCR, so you're copying actual data)

You'll see that the codepoints you end up with aren't the ones you're seeing in the PDF reader. Whatever the font is, it may have a mapping different from the one defined in the Unicode standard. As such, your content is "wront" and there's not much you can do about it.

icanhasserver 2010-10-28 09:43:06

ansaurus

tags:

views:

answers:

How to save text file in UTF-8 format using pdftotext

related questions