tags:

views:

159

answers:

3

So far here is the code I have (it is working and extracting text as it should.)

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

print getPDFContent("/home/nick/TAM_work/TAM_pdfs/2006-1.pdf").encode("ascii", "ignore")

I now need to add a for loop to get it to run on all PDF's in /TAM_pdfs, save the text as a CSV and (if possible) add something to count the pictures. Any help would be greatly appreciated. Thanks for looking.

Matt

+4  A: 

Take a look at os.walk()

Stuart
A: 

The glob module can help you find all files in a single directory that match a wildcard pattern.

Ignacio Vazquez-Abrams
A: 

for loop to get it to run on all PDF's in a directory: look at the glob module

save the text as a CSV: look at the csv module

count the pictures: look at the pyPDF module :-)

Two comments on this statement:

content = " ".join(content.replace(u"\xa0", " ").strip().split())

(1) It is not necessary to replace the NBSP (U+00A0) with a SPACE, because NBSP is (naturally) considered to be whitespace by unicode.split()

(2) Using strip() is redundant:

>>> u"  foo  bar  ".split()
[u'foo', u'bar']
>>>
John Machin