views:

191

answers:

4

I have some data I need to extract from a collection of html files. I am not sure if the data resides in a div element, a table element or a combined element (where the div tag is an element of a table. I have seen all three cases. My files are large-as big as 2 mb and I have tens of thousands of them. So far I have looked at the td elements in the tables and looked at the lonely div elements. It seems to me that the longest time is taking the file to be souped, upwards of 30 seconds. I played around with creating a regular expression to find the data I am looking for and then looking for the next close tag-table,tr,td,or div to determine what type of structure my text is contained in. finding the matching open tag, snipping that section and then wrapping it all in open and close HTML tags

 stuff

 <div>
 stuff
 mytext
 stuff
 </div>

so I create a string that looks like:

s='<div>stuffmyTextstuff</div>'

I then wrap the string

 def stringWrapper(s):
     newString='<HTML>'+s+'</HTML>'
     return newString

And then use BeautifulSoup

littleSoup=BeautifulSoup(newString)

I can then access the power of BeautifulSoup to do what I want with newString.

This is running much faster than the alternative which is first test all the cell contents of all of the tables until I find my text and if I can't find it there test all the div contents.

Am I missing something here?

+3  A: 

Have you tried lxml? BeautifulSoup is good but not super-fast, and I believe lxml can offer the same quality but often better performance.

Alex Martelli
+3  A: 

BeautifulSoup uses regex internally (it's what separates it from other XML parsers) so you'll likely find yourself just repeating what it does. If you want a faster option then use try/catch to attempt an lxml or etree parse first then try BeautifulSoup and/or tidylib to parse broken HTML if the parser fails.

It seems for what you are doing you really want to be using XPath or XSLT to find and retrieve your data, lxml can do both.

Finally, given the size of your files you should probably parse using a path or file handle so the source can be read incrementally rather than held in memory for the parse.

SpliFF
+1  A: 

I don't quite understand what you are trying to do. But I do know that you don't need to enclose your div string with < html> tags. BS will parse that just fine.

Unknown
+1  A: 

I've found that even if lxml is faster than BeautifulSoup, for documents that size it's usually best to try to reduce the size to a few kB via regex (or direct stripping) and load that into BS, as you are doing now.

Vinko Vrsalovic