views:

965

answers:

7

I am trying to scrape an html table and save its data in a database. What strategies/solutions have you found to be helpful in approaching this program.

I'm most comfortable with Java and PHP but really a solution in any language would be helpful.

EDIT: For more detail, the UTA (Salt Lake's Bus system) provides bus schedules on its website. Each schedule appears in a table that has stations in the header and times of departure in the rows. I would like to go through the schedules and save the information in the table in a form that I can then query.

Here's the starting point for the schedules

+1  A: 

There is a nice book about this topic: Spidering Hacks by Kevin Hemenway and Tara Calishain.

Matej
+3  A: 

It all depends on how properly your HTML to scrape is? If it's valid XHTML, you can simply use some XPath queries on it to get whatever you want.

Example of xpath in php: http://blogoscoped.com/archive/2004_06_23_index.html#108802750834787821

A helper class to scrape a table into an array: http://www.tgreer.com/class_http_php.html

David Cumps
+1  A: 

I have tried screen-scraping before, but I found it to be very brittle, especially with dynamically-generated code. I found a third-party DOM-parser and used it to navigate the source code with Regex-like matching patterns in order to find the data I needed.

I suggested trying to find out if the owners of the site have a published API (often Web Services) for retrieving data from their system. If not, then good luck to you.

Gilligan
Unfortunately no API on the site otherwise that would be ideal.
Dan Cramer
+1  A: 

I've found that scripting languages are generally better suited for doing such tasks. I personally prefer Python, but PHP will work as well. Chopping, mincing and parsing strings in Java is just too much work.

Petey
+1  A: 

This would be by far the easiest with Perl, and the following CPAN modules:

CPAN being the main distribution mechanism for Perl modules, and accessible by running the following shell command, for example:

# cpan HTML::Parser

If you're on Windows, things will be more interesting, but you can still do it: http://www.perlmonks.org/?node_id=583586

pianohacker
A: 

pianohacker overlooked the HTML::TableExtract module, which was designed for exactly this sort of thing. You'd still need LWP to retrieve the table.

cjm
+1  A: 

If what you want is a form a csv table then you can use this: using python:

for example imagine you want to scrape forex quotes in csv form from some site like: fxoanda

then...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()

once you have it in this form you can convert the data to any form you like.

Thorvaldur