ansaurus

Question

Answer 1

+4 A:

Take a look at os.walk()

Stuart 2010-01-06 22:31:02

Answer 2

A:

The glob module can help you find all files in a single directory that match a wildcard pattern.

Ignacio Vazquez-Abrams 2010-01-06 22:32:21

Answer 3

A:

for loop to get it to run on all PDF's in a directory: look at the glob module

save the text as a CSV: look at the csv module

count the pictures: look at the pyPDF module :-)

Two comments on this statement:

content = " ".join(content.replace(u"\xa0", " ").strip().split())

(1) It is not necessary to replace the NBSP (U+00A0) with a SPACE, because NBSP is (naturally) considered to be whitespace by unicode.split()

(2) Using strip() is redundant:

>>> u"  foo  bar  ".split()
[u'foo', u'bar']
>>>

John Machin 2010-01-06 22:36:11

Have writen a program to extract text from a PDF in python, and now need to make it run for every PDF in the folder and save as a text file.