views:

27

answers:

2

Hi. I'm finding a way to extract semantic structural informations (like title, heading, paragraph or lists) from PDF. Because I want to get a pure structural data from PDF.

Finally, I want to create an pure XHTML from the PDF. With only structural informations. No design or layout.

I know, PDF can be created without any structural information. I don't consider those PDFs. Only regularly well-structured PDFs are considered.

I'm new to PDF. So I don't know it offers regular semantic structure or not. If it exists, it's library will offer it. So I want to know whether PDF spec has those information, and best way to get those information if exists.

+1  A: 

I would highly recommend reading through the PDF spec:

http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf

There isn't a "semantic structure" to the document like you might find in an HTML file; it's much more complicated.

The file format is largely based on a COS Object Tree, which is essentially a set of objects referencing each other in various manners, but not in any particular order (with some exceptions).

Some of these objects contain what you are likely after (document tages, etc). Moreover, these objects can be encoded in various ways.

Very complicated.

I would recommend looking at some of the well developed PDF libraries out there like iText:

http://itextpdf.com/

userx
A: 

What do you mean by 'well-structured'?

If the PDFs contain marked content you can get an almost perfect extraction of semantic data. Otherwise it simply does not exist but might be 'guessed' in some cases.

mark stephens