I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren't useful and look garbled (for instance, lots of numbers mashed together).
I'm looking for something that's a bit more advanced. I'd like to extract the text from a PDF document, excluding any tables and special formatting. Is there a library out there that does this? Or am I forced to do some post-processing on the output text to get rid of these sections?
Any ideas & thoughts would be greatly appreciated. Thank you.