ansaurus

Question

Answer 1

+1 A:

You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/Data/*.htm'):
    soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
    for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
        print a_tr

Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.

The MYYN 2009-05-28 21:45:59

So here is the row's HTML content:<tr class="evenColor"><td class="goal + bborder" align="center">56</td><td class="goal + bborder" align="center">1</td><td class="goal + bborder" align="center">PP</td><td class="goal + bborder" align="center"></td><td class="goal + bborder">GOAL</td><td class="goal + bborder"></td><td class="bold + bborder + rborder"></td><td class="bold + bborder"></td></tr>Again, I would like the whole row and including the filename (pl020001.htm) as a column in the scraped row if possible.

northnodewolf 2009-05-29 02:14:04

'write the csv via python'I don't know how to do this but would like to know. Do I need 'import csv'?

northnodewolf 2009-05-29 02:15:43

@northnodewolf: Post a new question with the new facts about the HTML structure and the CSV you'd like to create from the HTML table.

S.Lott 2009-05-29 02:48:58

Answer 2

A:

MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:

import glob
    for file_name in glob.glob('/home/phi/Data/*.htm'):
        #read the file and then parse with BeautifulSoup

I've found both the os and glob imports to be really useful for running through files in a directory.

Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.

npdoty 2009-05-28 22:11:11

looks better, thanks.

The MYYN 2009-05-29 01:06:05

ansaurus

tags:

views:

answers:

Scraping Multiple html files to CSV

related questions