pdftotext

pdftotext - Error: Illegal entry in bfchar block in ToUnicode CMap

I am running pdftotext on a bunch of pdfs, and some of them throw this error: Error: Illegal entry in bfchar block in ToUnicode CMap I took a look at the outfiles, and they seem to look ok, so I'm not sure if it's a significant error, but I am concerned. Does anyone know what this error is, what causes it, and how much damage there is...

subprocess isn't outputting anything.

I'm trying to use Python to run pdftotext, but for some reason, my code isn't working. If I run the below, I expect that the content variable would contain the contents of the PDF, but the result I am getting is just an empty string. Does anybody know what I'm missing? def getPDFContent(path): path = "/path/to/a valid/pdffile.pdf"...

How to extract text using Zend_Pdf from pdf page

Can anyone help with extracting text from a page in a pdf? <?php $pdf = Zend_Pdf::load('example.pdf'); $page = $pdf->page[0]; I would assume a page method would exist but I could not find anything to let me extract the contents. Example: $page->getContents(); $page->toString(); $page->extractText(); ...Help!!!! This is driving me cr...

pdftotext can't find any of the files to convert when called within a python script

i have a python script which keeps crashing on: subprocess.call(["pdftotext", pdf_filename]) the error being: OSError: [Errno 2] No such file or directory the absolute path to the filename (which i am storing in a log file as i debug) is fine; on the command line, if i type pdftotext <pdf_filename_goes_here> it works for any of the...

Ruby PDF:Toolkit using pdftotext

Hi, I'm converting pdf files in my Ruby project. I'm using the pdf toolkit gem for this. The documentation shows how you can use pdftotext pdftotext(file,outfile = nil,&block) In my project I am converting a PDF file without any arguments and can just do this: PDF::Toolkit.pdftotext("file.pdf", "file.txt) If I run it from...

CLI pdf viewer for linux

Hey, for quite a while now, I am looking for a pdf viewer for the command line. As I like to work without X on Linux, and often work on a remote machine, I would like to have a tool to read pdfs. There are quite a lot of really good graphical programs (evince, okular, acroread, ...) to do the job, so I figured there should be at least o...

process the data of an image like pdf or something else using pdfcreator

hay all. maybe you guys can help me in my project. im using pdfcreator as a virtual printer to print to a file some images. can be pdf can be any type of image. but i need to extract data from it. can it be done? im using C#. ...

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database. Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manne...

using subprocess.popen in python with os.tmp file while passing in optional parameters

Hi, I am writing a python program in linux and in part of it running the pdftotext executable to convert a pdf text. The code I am currently using is given below. pdfData = currentPDF.read() tf = os.tmpfile() tf.write(pdfData) tf.seek(0) out, err = subprocess.Popen(["pdftotext", "-", "-"], stdin = tf, stdout=subprocess.PIPE ).communi...

How to save text file in UTF-8 format using pdftotext

I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to text file but not able to see any accented characters. pdftotext -enc UTF-8 book1.pdf boo...