Hi i work at a news paper and we are lookin a way to make archieve material available. Atm our pages come in pdf format so we need a way to export text and images from the pdf so that they can be added to a database. We've had a look at the News studio plugin for Adobe Acrobat from Iceni Technology, but just wondering if anyone else knows other options for exporting pdf data. thanks
There is pdftotext
(part of xpdf). It will extract text from PDF files (if it is stored as text in the PDF, and not as an image). You could probably use that.
However, be advised that any solution to extract text from PDF will be limited, as PDFs are really for display only. At the very least, you will not have metadata like article date, author etc.; also, if part of the text is in an image, you might lose that.
The better approach is probably to extract the raw data from the system which generates the PDFs, and archive that in a suitable format. Maybe more work, but better results.
If your pdfs already contain the text, then your job will be much easier: tools like pdftotext and pdftohtml will give you image and text output (see the Ubuntu package xpdf-utils).
On the other hand, if the text in your pdf is image-based then you'll have to look at OCR options. Fortunately, there are some good open source offerings. I have had some success using a combination of ImageMagick and Tesseract:
- First, convert PDFs to TIFF with ImageMagick (Tesseract won't OCR PDFs)
- OCR the TIFF using Tesseract (you can also try gocr, also available in the Ubuntu repos)
The key was to make sure the TIFFs were high enough enough quality. These ImageMagick settings worked well for me:
convert -depth 8 -density 500 -colorspace GRAY -resize 1600 input.pdf output.tif
If you need to extract metadata from a pdf as well (Title, Location, Subject, Author, etc.) then pdftk is a useful tool.