views:

538

answers:

3

First off the html row looks like this:

<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>

I would show the real html but I am sorry to say don't know how to block it. feels shame

Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files in the same directory into a CSV format. This will eventually go into an SQL database. Each directory represents a year and I plan to do at least 5 years.

I have been goofing around with glob as the best way to do this from some advice. This is what I have so far and am stuck.

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/data/NHL/pl0708/pl02*.htm'):
#these files go from pl020001.htm to pl021230.htm sequentially
    soup = BeautifulSoup(open(filename["r"]))
    for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

I realize this is ugly but it's my first time attempting anything like this. This one problem has taken me months to get to this point after realizing that I don't have to manually go through thousands of files copy and pasting into excel. I have also realized that I can kick my computer repeatedly out of frustration and it still works (not recommended). I am getting close and I need to know what to do next to make those CSV files. Please help or my monitor finally gets hammer punched.

+1  A: 

That looks fine, and BeautifulSoup is useful for this (although I personally tend to use lxml). You should be able to take that data you get, and make a csv file out of is using the csv module without any obvious problems...

I think you need to actually tell us what the problem is. "It still doesn't work" is not a problem descripton.

Lennart Regebro
+3  A: 

You don't really explain why you are stuck - what's not working exactly?

The following line may well be your problem:

soup = BeautifulSoup(open(filename["r"]))

It looks to me like this should be:

soup = BeautifulSoup(open(filename, "r"))

The following line:

for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

looks like it will only pick out even rows (assuming your even rows have the class 'evenColor' and odd rows have 'oddColor'). Assuming you want all rows with a class of either evenColor or oddColor, you can use a regular expression to match the class value:

for row in soup.findAll("tr", attrs={ "class" : re.compile(r"evenColor|oddColor") })
Judy2K
@It looks to me like this should be:soup = BeautifulSoup(open(filename, "r"))--thanks I changed it
northnodewolf
+2  A: 

You need to import the csv module by adding import csv to the top of your file.

Then you'll need something to create a csv file outside your loop of the rows, like so:

writer = csv.writer(open("%s.csv" % filename, "wb"))

Then you need to actually pull the data out of the html row in your loop, similar to

values = (td.fetchText() for td in row)
writer.writerow(values)
Hank Gay
Yes yes. This is what I am talking about. Thanks. I also realized that maybe I need to import re for regex? We are using '*' and '%' so we need to import re, maybe?For the second part you say I need to pull the data out of the html row... but if the rows have been written from html to csv what's the point. I am probably not wrapping my head around something but even if there are 1230 .csv files that's good enough for me right now. Here's a link to the files I am working with:http://www.nhl.com/scores/htmlreports/20082009/PL020808.HTM
northnodewolf