ansaurus

Question

Given a unicode error I don't understand.

Answer 1

+1 A:

Here's the code that answered that question. But now it only writes the last file.

  import pyPdf
import os
import csv

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)


PDFWriter = csv.writer(open('/home/nick/TAM_work/text/text.doc', 'a'), delimiter=' ', quotechar='|', quoting=csv.QUOTE_ALL)

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

for word in os.listdir("/home/nick/TAM_work/TAM_pdfs"):
    print getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word)

PDFWriter.writerow ([getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word).encode("ascii", "ignore")])

Matt 2010-01-07 04:37:55

fixed again by bring the PDFWriter.writerow ([getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word).encode("ascii", "ignore")])into the for loop.

Matt 2010-01-07 04:46:33

Answer 2

A:

as I Underestand you put a large number in a small varible and its throw an exception.

I introduce you a C# tool that work very fine with unicode , you can find it at http://unicode.codeplex.com

in your case I recommand to change the

 for i in range(0, pdf.getNumPages()):

pdf.getNumPages() is above than 128 just controll it.

Nasser Hadjloo 2010-01-07 12:31:02

-1 The OP's exception was a UnicodeEncodeError which only vaguely could be characterised as "large number in small variable" and is definitely nothing to do with the number of pages in the PDF file. As for your undocumented "tool", you'd have to convince a Python user that it provided something on top of Python's standard unicode facilities -- but please don't take these remarks as an invitation to further spamming, quite the contrary.

John Machin 2010-01-07 13:54:19

ansaurus

tags:

views:

answers:

Given a unicode error I don't understand.

related questions