views:

1292

answers:

4

Google's finance API is incomplete -- many of the figures on a page such as:

http://www.google.com/finance?fstype=ii&q=NYSE:GE

are not available via the API.

I need this data to rank companies on Canadian stock exchanges according to the formula of Greenblatt, available via google search for "greenblatt index scans".

My question: what is the most intelligent/clean/efficient way of accessing and processing the data on these webpages. Is the tedious approach really necessary in this case, and if so, what is the best way of going about it? I'm currently learning Python for projects related to this one.

Thanks!

A: 

Scraping web pages always sucks, but I would recommend converting them to xml (via tidy or some other HTML -> XML program) and then using xpath to walk the nodes that you are interested in.

Paul Tarjan
+2  A: 

BeautifulSoup would be the preferred method of HTML parsing with Python

Have you looked into options besides Google (e.g. Yahoo Finance API)?

Eli
Thanks, I will look into BeautifulSoup. You are right that Yahoo Finance API is more complete - unfortunately, Yahoo doesn't have the necessary data when it comes to Canadian stocks.
Marco
+2  A: 

You could try asking Google to provide the missing APIs. Otherwise, you're stuck with screen scraping, which is never fun, prone to breaking without notice, and possibly in violation of Google's terms of service.

But, if you still want to write a screen scraper, it's hard to beat a combination of mechanize and BeautifulSoup. BeautifulSoup is an HTML parser and mechanize is a Python-based web browser that will let you log in, store cookies, and generally navigate around like any other web browser.

Ryan Bright
A: 

Have you looked at YQL?

I assume when you talk about Greenblatt's formula, you mean the one that's in his Little Blue Book? Did he test it with Canadian stocks, or is that what you'd like to do?

Nosredna
I will look at YQL, thanks. You have the right Greenblatt formula. There is a detailed way of doing the calculation, available here: http://members.cox.net/econisvoodoo/piotroski/and it has not been applied to Canadian stocks. Certainly Greenblatt has not tested it for any other stock markets as far as I can tell. I'd like to test it and use it for Canadian stocks. The Perl script provided on that page seems to work fine but uses Yahoo API, and therefore I can't get Canadian stock information (Yahoo info is incomplete when it comes to TSE/TSX stocks.
Marco