pypdf

extracting stream from pdf in python

How can I extract the part of this stream (the one named BLABLABLA) from the pdf file which contains it?? <</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 /Resources<</ColorSpace<</CS0 563 0 R>>/ExtGState<</GS0 568 0 R>>/Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>/ProcSet[/PDF/Text/ImageC]/P...

pyPdf for IndirectObject extraction

Following this example, I can list all elements into a pdf file import pyPdf pdf = pyPdf.PdfFileReader(open("pdffile.pdf")) list(pdf.pages) # Process all the objects. print pdf.resolvedObjects now, I need to extract a non-standard object from the pdf file. My object is the one named MYOBJECT and it is a string. The piece printed by...

reading/writing xmp metadatas on pdf files through pypdf

I can read xmp metadatas through pyPdf with this code: a = pyPdf.PdfFileReader(open(self.fileName)) b = a.getXmpMetadata() c = b.pdf_keywords but: is this the best way? And if I don't use the pdf_keywords property? And is there any way to set these metadatas with pyPdf? ...

Fast PDF splitter library

pyPdf is a great library to split, merge PDF files. I'm using it to split pdf documents into 1 page documents. pyPdf is pure python and spends quite a lot of time in the _sweepIndirectReferences() method of the PdfFileWriter object when saving the extracted page. I need something with better performance. I've tried using multi-threading...

python and pyPdf - how to extract text from the pages so that there are spaces between lines

currently, if I make a page object of a pdf page with pyPdf, and extractText(), what happens is that lines are concatenated together. For example, if line 1 of the page says "hello" and line 2 says "world" the resulting text returned from extractText() is "helloworld" instead of "hello world." Does anyone know how to fix this, or have su...

split a pdf based on outline

i would like to use pyPdf to split a pdf file based on the outline where each destination in the outline refers to a different page within the pdf. example outline: main --> points to page 1 sect1 --> points to page 1 sect2 --> points to page 15 sect3 --> points to page 22 it is easy within pyPdf to iterate ove...

What program to write pdf including other pdf on Linux from Python?

On an Ubuntu server, I want to create pdfs which include other static pdfs. I have tried using ReportLab with pyPdf. Ideally I would use ReportLab to do the whole thing, but in order to import the pdfs requires their PageCatcher which has a large recurring fee. So I use pyPdf to merge a page created with ReportLab and my other pdfs. Th...

Generating & Merging PDF Files in Python

I want to automatically generate booking confirmation PDF files in Python. Most of the content will be static (i.e. logos, booking terms, phone numbers), with a few dynamic bits (dates, costs, etc). From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just a...

Aligning two PDFs for a merge using Cairo and pyPDF

I need to programmatically add additional graphical elements onto an existing, static PDF book cover. Right now I use pycairo to draw onto a transparent PDFSurface, then merge it into the existing static PDF using pyPdf. This way, the PDFSurface works as an overlay. However, the transparent PDF is exactly the same size as the static PDF...

Are PDF box coordinates relative or absolute?

I want to programmatically edit a PDF using pyPDF. Currently, I'm struggling with interpreting the various PDF boxes' (TrimBox, MediaBox etc.) dimensions. Each box has four dimensions stored as a four-tuple, e.g.: TrimBox: 56.69 56.69 1040.31 751.18 According to the PDF specification, these are supposed to describe a r...

Change metadata of pdf file with pypdf.

Hello ! I'd like to create/modify the title of a pdf document using pypdf. It seems that the title is readonly. Is there a way to access this metadata r/w? If answer positive, a piece of code would be appreciated. Thanks ...

Dynamically generated PDF files working in most readers except Adobe Reader

I'm trying to dynamically generate PDFs from user input, where I basically print the user input and overlay it on an existing PDF that I did not create. It works, with one major exception. Adobe Reader doesn't read it properly, on Windows or on Linux. QuickOffice on my phone doesn't read it either. So I thought I'd trace the path of me ...

How to open a generated PDF file in browser?

I have written a Pdf merger which merges an original file with a watermark. What I want to do now is to open 'document-output.pdf' file in the browser by a Django view. I already checked Django's related articles, but since my approach is relatively different, I don't directly create the PDF object, using the response object as its "fil...

pypdf python tool

Using pypdf python module how to read the following pdf file http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf # -*- coding: utf-8 -*- from pyPdf import PdfFileWriter, PdfFileReader import pyPdf def getPDFContent(path): content = "" # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pag...

what causes "insufficient data for image" in a pdf

I have a program in Python (using pyPDF) that merges a bunch of different PDF documents. Sometimes, the resulting pdf is fine, except for some blank pages in the middle. When I view these documents with Acrobat Reader, I get an error message saying "insufficient data for image". When I view the documents with FoxIT Reader, I get some ...