views:

577

answers:

2

I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.

This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.

How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?

I just need rows that start with this "<tr class="evenColor">" and end with this "</tr>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file.

Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far:

#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser

mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")

table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")

##how do I end it?

Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?

+1  A: 

You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/Data/*.htm'):
    soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
    for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
        print a_tr

Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.

The MYYN
So here is the row's HTML content:<tr class="evenColor"><td class="goal + bborder" align="center">56</td><td class="goal + bborder" align="center">1</td><td class="goal + bborder" align="center">PP</td><td class="goal + bborder" align="center"></td><td class="goal + bborder">GOAL</td><td class="goal + bborder"></td><td class="bold + bborder + rborder"></td><td class="bold + bborder"></td></tr>Again, I would like the whole row and including the filename (pl020001.htm) as a column in the scraped row if possible.
northnodewolf
'write the csv via python'I don't know how to do this but would like to know. Do I need 'import csv'?
northnodewolf
@northnodewolf: Post a new question with the new facts about the HTML structure and the CSV you'd like to create from the HTML table.
S.Lott
A: 

MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:

import glob
    for file_name in glob.glob('/home/phi/Data/*.htm'):
        #read the file and then parse with BeautifulSoup

I've found both the os and glob imports to be really useful for running through files in a directory.

Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.

npdoty
looks better, thanks.
The MYYN