views:

479

answers:

8

I would like to write a program that will find bus stop times and update my personal webpage accordingly.

If I were to do this manually I would

  1. Visit www.calgarytransit.com
  2. Enter a stop number. ie) 9510
  3. Click the button "next bus"

The results may look like the following:

10:16p Route 154 10:46p Route 154 11:32p Route 154

Once I've grabbed the time and routes then I will update my webpage accordingly.

I have no idea where to start. I know diddly squat about web programming but can write some C and Python. What are some topics/libraries I could look into?

+2  A: 

Since you write in C, you may want to check out cURL; in particular, take a look at libcurl. It's great.

Anthony Cuozzo
+11  A: 

Beautiful Soup is a Python library designed for parsing web pages. Between it and urllib2 (urllib.request in Python 3) you should be able to figure out what you need. :)

Jeremy Banks
Also, be sure to get the latest version 3.1.0.1, released yesterday (Jan. 6, 2009) it fixes a major regression that was introduced in the previous version which caused the parser raise an exception when faced with boolean attributes like <td nowrap> in wild, untamed html land.
Prairiedogg
+1 for Beautiful Soup.
Dana Robinson
-1: urllib. Should be urllib2.
S.Lott
@S.Lott: Doh, you're right. Corrected, and I also added a link to urllib.request for Py3k.
Jeremy Banks
I had problems deploying a parser developed with BeautifulSoup-3.0.x to an Ubuntu 10.04 system - it will always choke while parsing input. Turns out Lucid Lynx ships with BeautifulSoup-3.1.0.1 which isn't exactly a good release (http://www.crummy.com/software/BeautifulSoup/3.1-problems.html). Solved shipping my version of BS (3.0.x) with my program.
Luke404
+2  A: 

What you're asking about is called "web scraping." I'm sure if you google around you'll find some stuff, but the core notion is that you want to open a connection to the website, slurp in the HTML, parse it and identify the chunks you want.

The Python Wiki has a good lot of stuff on this.

Charlie Martin
+1  A: 

That site doesnt offer an API for you to be able to get the appropriate data that you need. In that case you'll need to parse the actual HTML page returned by, for example, a CURL request .

Luca Matteis
A: 

As long as the layout of the web page your trying to 'scrape' doesnt regularly change, you should be able to parse the html with any modern day programming language.

Jobo
+1  A: 

This is called Web scraping, and it even has its own Wikipedia article where you can find more information.

Also, you might find more details in this SO discussion.

splintor
+1  A: 

You can use Perl to help you complete your task.

use strict;
use LWP;

my $browser = LWP::UserAgent->new;

my $responce = $browser->get("http://google.com");
print $responce->content;

Your responce object can tell you if it suceeded as well as returning the content of the page.You can also use this same library to post to a page.

Here is some documentation. http://search.cpan.org/~gaas/libwww-perl-5.822/lib/LWP/UserAgent.pm

J.J.
The LWP is also handy.
Dana Robinson
+2  A: 

You can use the mechanize library that is available for Python http://wwwsearch.sourceforge.net/mechanize/

cheeming