ansaurus

Question

How to extract PDF fields from a filled out form in Python?

Answer 1

+1 A:

You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer, and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

import sys
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdftypes import resolve1

filename = sys.argv[1]
fp = open(filename, 'rb')

parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()

fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
    field = resolve1(i)
    name, value = field.get('T'), field.get('V')
    print '{0}: {1}'.format(name, value)

EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()

Steven 2010-10-21 08:48:22

That did the trick, thank you. I saw the web demo and figured I could see if what I wanted was in there and if not I could skip it. Turns out not only can it do exactly way I want, it can even handle the signature fields that PdfBox can't.

Jagerkin 2010-10-22 02:25:14

ansaurus

tags:

views:

answers:

How to extract PDF fields from a filled out form in Python?

related questions