views:

50

answers:

1

Guys i had posted a question earlier http://stackoverflow.com/questions/3854963/pypdf-python-tool .dont mark this as duplicate as i get this error indicated below

  import sys
  import pyPdf

  def convertPdf2String(path):
      content = ""
      # load PDF file
      pdf = pyPdf.PdfFileReader(file(path, "rb"))
      # iterate pages
      for i in range(0, pdf.getNumPages()):
          # extract the text from each page
          content += pdf.getPage(i).extractText() + " \n"
      # collapse whitespaces
      content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
      return content

  # convert contents of a PDF file and store retult to TXT file
  f = open('a.txt','w+')
  f.write(convertPdf2String(sys.argv[1]))
  f.close()

  # or print contents to the standard out stream
  print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

I get this error for a the 1st pdf file UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) and the following error for this pdf http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)

How to resolve this

+1  A: 

I tried it myself and got the same result. Ignore my comment, I hadn't seen that you're writing the output to a file as well. This is the problem:

f.write(convertPdf2String(sys.argv[1]))

As convertPdf2String returns a Unicode string, but file.write can only write bytes, the call to f.write tries to automatically convert the Unicode string using ASCII encoding. As the PDF obviously contains non-ASCII characters, that fails. So it should be something like

f.write(convertPdf2String(sys.argv[1]).encode("utf-8"))
# or
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))

EDIT:

The working source code, only one line changed.

# Execute with "Hindi_Book.pdf" in the same directory
import sys
import pyPdf

def convertPdf2String(path):
    content = ""
    # load PDF file
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # iterate pages
    for i in range(0, pdf.getNumPages()):
        # extract the text from each page
        content += pdf.getPage(i).extractText() + " \n"
    # collapse whitespaces
    content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
    return content

# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
f.close()

# or print contents to the standard out stream
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
AndiDog
@AndiDog:i had tried both initially and could not get them working.My initially goal was to just read the pd contents from command line and i dont want to do this using xpdf
Hulk
@Hulk: I have tested what I've written in my answer, on the very same PDF file. Are you saying it doesn't work for you?
AndiDog
@AndiDog:Its still the same error.I tried using both the statements
Hulk
@Hulk: I can't believe that. Are you getting the exception at the very same position? Please post what happens exactly.
AndiDog
@AndiDog:When i run the script i get the following error with both the statements.Could u post your code so i can make out the difference.Thanks for the help
Hulk
@Hulk: There you go. It's really only the one line.
AndiDog
@AndiDog:I got it working but it still in the binary form,Cant get it exactly as the language characters.Let me know if u can read the language characters exactly as in pdf.Thanks for all the help
Hulk
@AndiDog:Did u get the exact text as in the pdf?May this is a font issue..
Hulk
@Hulk: No, most of the output are ASCII characters. Maybe pyPdf translates Hindi to alphabetical representation? Print `repr(convertPdf2String("Hindi_Book.pdf"))` and you'll see that only few Unicode characters are contained. Seems to be a problem with pyPdf. I don't have any experience with PDF libraries, so I can't tell what the problem is.
AndiDog
@AndiDog:Thanks for all the help.Much appreciated.
Hulk