I have some data I need to extract from a collection of html files. I am not sure if the data resides in a div element, a table element or a combined element (where the div tag is an element of a table. I have seen all three cases. My files are large-as big as 2 mb and I have tens of thousands of them. So far I have looked at the td elements in the tables and looked at the lonely div elements. It seems to me that the longest time is taking the file to be souped, upwards of 30 seconds. I played around with creating a regular expression to find the data I am looking for and then looking for the next close tag-table,tr,td,or div to determine what type of structure my text is contained in. finding the matching open tag, snipping that section and then wrapping it all in open and close HTML tags
stuff
<div>
stuff
mytext
stuff
</div>
so I create a string that looks like:
s='<div>stuffmyTextstuff</div>'
I then wrap the string
def stringWrapper(s):
newString='<HTML>'+s+'</HTML>'
return newString
And then use BeautifulSoup
littleSoup=BeautifulSoup(newString)
I can then access the power of BeautifulSoup to do what I want with newString.
This is running much faster than the alternative which is first test all the cell contents of all of the tables until I find my text and if I can't find it there test all the div contents.
Am I missing something here?