ansaurus

Question

UnicodeEncodeError when reading pdf with pyPdf

Answer 1

+1 A:

I tried it myself and got the same result. Ignore my comment, I hadn't seen that you're writing the output to a file as well. This is the problem:

f.write(convertPdf2String(sys.argv[1]))

As convertPdf2String returns a Unicode string, but file.write can only write bytes, the call to f.write tries to automatically convert the Unicode string using ASCII encoding. As the PDF obviously contains non-ASCII characters, that fails. So it should be something like

f.write(convertPdf2String(sys.argv[1]).encode("utf-8"))
# or
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))

EDIT:

The working source code, only one line changed.

# Execute with "Hindi_Book.pdf" in the same directory
import sys
import pyPdf

def convertPdf2String(path):
    content = ""
    # load PDF file
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # iterate pages
    for i in range(0, pdf.getNumPages()):
        # extract the text from each page
        content += pdf.getPage(i).extractText() + " \n"
    # collapse whitespaces
    content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
    return content

# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
f.close()

# or print contents to the standard out stream
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

AndiDog 2010-10-04 16:18:20

@AndiDog:i had tried both initially and could not get them working.My initially goal was to just read the pd contents from command line and i dont want to do this using xpdf

Hulk 2010-10-04 19:03:54

@Hulk: I have tested what I've written in my answer, on the very same PDF file. Are you saying it doesn't work for you?

AndiDog 2010-10-04 20:13:51

@AndiDog:Its still the same error.I tried using both the statements

Hulk 2010-10-05 05:18:08

@Hulk: I can't believe that. Are you getting the exception at the very same position? Please post what happens exactly.

AndiDog 2010-10-05 08:47:48

@AndiDog:When i run the script i get the following error with both the statements.Could u post your code so i can make out the difference.Thanks for the help

Hulk 2010-10-05 09:03:39

@Hulk: There you go. It's really only the one line.

AndiDog 2010-10-05 09:09:48

@AndiDog:I got it working but it still in the binary form,Cant get it exactly as the language characters.Let me know if u can read the language characters exactly as in pdf.Thanks for all the help

Hulk 2010-10-05 09:20:44

@AndiDog:Did u get the exact text as in the pdf?May this is a font issue..

Hulk 2010-10-05 10:04:15

@Hulk: No, most of the output are ASCII characters. Maybe pyPdf translates Hindi to alphabetical representation? Print `repr(convertPdf2String("Hindi_Book.pdf"))` and you'll see that only few Unicode characters are contained. Seems to be a problem with pyPdf. I don't have any experience with PDF libraries, so I can't tell what the problem is.

AndiDog 2010-10-05 10:14:32

@AndiDog:Thanks for all the help.Much appreciated.

Hulk 2010-10-05 10:36:59

ansaurus

tags:

views:

answers:

UnicodeEncodeError when reading pdf with pyPdf

related questions