Haskell: parsing PDF

tags:

haskell
pdf

views:

239

answers:

+4 Q:

Haskell: parsing PDF

What I need is to read pdf, make some transformations (generate TOC bookmarks) and write it back.

I found this http://hackage.haskell.org/package/HPDF , but it only mentions generating pdf, not the parsing (although I could have missed it)

Haskell is chosen purely for (self)educational purposes.

Here's a haskell binding to parts of xpdf: http://hackage.haskell.org/package/pdf2line

ja 2010-03-05 20:47:39

+1 A:

There are a few tools for PDF manipulation, though they seem to bias towards generation, rather than parsing:

http://johnmacfarlane.net/pandoc/

Pandoc is a great cross-markup library, but doesn't support PDF parsing (it does support PDF generation from a variety of formats).

There's also:

http://hackage.haskell.org/package/HsHaruPDF
http://hackage.haskell.org/package/pdf2line -- tool for extracting text from pdf
http://hackage.haskell.org/package/HPDF -- another pdf generation library

I'm not sure we have a good parsing tool yet.

Don Stewart 2010-03-05 21:14:18

+1 A:

Also as a learning exercise, I started a PDF parsing library in Haskell, but it's incomplete and has been languishing a bit from lack of attention. I'd be happy to share it with you, and would love feedback, improvements, etc. It's not currently hosted on hackage, but if you're interested in working with an incomplete implementation, let me know and I'll ask some colleagues for advice on getting it up there.

Dylan McNamee 2010-03-05 21:44:39

I am far too junior for such a quest. But thanks anyway, I'll keep this in mind for future.

artemave 2010-03-05 22:51:46

I'd be happy to work with you on it. Its current state is that it takes a PDF file and produces an AST-like representation, which can be manipulated. I've also got an AST pretty-printer that produces a valid PDF file.

Dylan McNamee 2010-03-06 23:41:50

Also, I can't seem to comment on the "waah, the PDF ISO spec is expensive", but I found the free documents here: http://www.adobe.com/devnet/pdf/ to be sufficient for my PDF parsing needs.

Dylan McNamee 2010-03-06 23:43:15

ansaurus

tags:

views:

answers:

Haskell: parsing PDF

related questions