views:

894

answers:

3

I have approx. 40k+ html documents where I need to extract information from. I have tried to do so using PHP+Tidy(because most files are not well-formed)+DOMDocument+XPath but it is extremely slow.... I am advised to use regexp but the html files are not marked up semantically (table based layout, with meaning-less tag/classes used everywhere) and I don't know where i should start...

Just being curious, is using regexp (PHP/Python) faster than using Python's XPath library? Is Xpath library for Python generally faster than PHP's counterpart?

+2  A: 

You might give Beautiful Soup in Python a try. It's a pretty great parser for generating a usable DOM out of garbage HTML. That with some regex skills might get you what you need. Happy hunting!

Most comparative operations in Python are faster than in PHP in my subjective experience. Partly due to Python being a compiled language instead of interpreted at runtime, partly due to Python having been optimized for greater efficiency by its contributors...

Still, for 40k+ documents, find a nice fast machine ;-)

Gabriel Hurley
thanks for the answer :D the production machine should work twice as fast compared to my dev. pc :D but it is running way toooo slow on my machine :/ will give beautiful soup a try soon
Jeffrey04
lxml mentioned elsewhere has a BeautifulSoup-like API, as well.
Ned Deily
A: 

As the previous post mentions Python in general is faster than php due to byte-code compilation (those .pyc files). And a lot of DOM/SAX parsers use fair bit of regexp internally anyway. Those who told you to use regexp need to be told that it is not a magic bullet. For 40k+ documents I would recommend parallelizing the task using the new multi-threads or the classic parallel python.

whatnick
+3  A: 
Peter Hoffmann