pdf-scraping

Python module for converting PDF to text

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use. ...

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to. Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF. This question had some interesting stuff, especially pdftotext but I'd like to avoid calling to an external...

Moving data from one master pdf to other individual pdf's with different layouts

I have 8-10 different company applications that have to be filled out. About 85-90% of the information is common (however it is not located in the same spot on each application form). I want to create a master application with the common fields and the application specific fields in the master application. I want to have a person fill...

optical character recognition of PDFs of parliamentary debates

Hi, For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany. The problem is that most of these files have a two-column format: I would love to read your answer to my following questions: How I can split the two columns before feeding them into...

How can I convert PDF to HTML?

What good libraries are there, in any common language, for converting PDF to HTML? ...