First off the html row looks like this:
<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>
I would show the real html but I am sorry to say don't know how to block it. feels shame
Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files in the same directory into a CSV format. This will eventually go into an SQL database. Each directory represents a year and I plan to do at least 5 years.
I have been goofing around with glob
as the best way to do this from some advice. This is what I have so far and am stuck.
import glob
from BeautifulSoup import BeautifulSoup
for filename in glob.glob('/home/phi/data/NHL/pl0708/pl02*.htm'):
#these files go from pl020001.htm to pl021230.htm sequentially
soup = BeautifulSoup(open(filename["r"]))
for row in soup.findAll("tr", attrs={ "class" : "evenColor" })
I realize this is ugly but it's my first time attempting anything like this. This one problem has taken me months to get to this point after realizing that I don't have to manually go through thousands of files copy and pasting into excel. I have also realized that I can kick my computer repeatedly out of frustration and it still works (not recommended). I am getting close and I need to know what to do next to make those CSV files. Please help or my monitor finally gets hammer punched.