views:

1459

answers:

1

It is basically all in the title, I need to take a bunch of large PDFs and have them in XHTML 1.0 strict, close is good enough, then I can clean it up. Thanks

+2  A: 

This is a complex request, because it depends on the PDF itself (and how it was created) whether this can be done or not. As a first attempt, I would try to use adobe's own online PDF to HTML convertor

http://www.adobe.com/products/acrobat/access_onlinetools.html

and then try to fix up the HTML after the fact with something like tidy

http://tidy.sourceforge.net/

If the PDFs were creating by scanning images in then there may be no text associated with them at all - then the best you can do is either cut apart the pages and turn them into JPG documents, or use some sort of OCR software on the PDF itself.

I warn you that even if the PDFs were created by hand and thus have text information in them, there are likely to be a lot of mistakes in the conversion process that will have to be fixed by hand. I work on a product that basically does this process for corporate annual reports/etc and we ultimately settled on cutting up the pages into JPG/GIF images and HTMLing that - as the other processes we tried introduced too many error and it was too labor intensive to fix them all.

TJ