tags:

views:

269

answers:

2

I'm new to Python and am playing around with making a very basic web crawler. For instance, I have made a simple function to load a page that shows the high scores for an online game. So I am able to get the source code of the html page, but I need to draw specific numbers from that page. For instance, the webpage looks like this:

http://hiscore.runescape.com/hiscorepersonal.ws?user1=bigdrizzle13

where 'bigdrizzle13' is the unique part of the link. The numbers on that page need to be drawn out and returned. Essentially, I want to build a program that all I would have to do is type in 'bigdrizzle13' and it could output those numbers.

+2  A: 

You can use Beautiful Soup to parse the HTML.

tom10
So is that just a module that needs importing?
Alex
You might have to download it first, if it's not on your computer already. But once you do, yes, it's just a module that you import.
David Zaslavsky
@Alex, it's a 3rd party module, meaning it's not automatically included in your Python installation. Follow the BeautifulSoup link above to download.
Triptych
+6  A: 

As another poster mentioned, BeautifulSoup is a wonderful tool for this job.

Here's the entire, ostentatiously-commented program. It could use a lot of error tolerance, but as long as you enter a valid username, it will pull all the scores from the corresponding web page.

I tried to comment as well as I could. If you're fresh to BeautifulSoup, I highly recommend working through my example with the BeautifulSoup documentation handy.

The whole program...

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys

URL = "http://hiscore.runescape.com/hiscorepersonal.ws?user1=" + sys.argv[1]

# Grab page html, create BeatifulSoup object
html = urlopen(URL).read()
soup = BeautifulSoup(html)

# Grab the <table id="mini_player"> element
scores = soup.find('table', {'id':'mini_player'})

# Get a list of all the <tr>s in the table, skip the header row
rows = scores.findAll('tr')[1:]

# Helper function to return concatenation of all character data in an element
def parse_string(el):
   text = ''.join(el.findAll(text=True))
   return text.strip()

for row in rows:

   # Get all the text from the <td>s
   data = map(parse_string, row.findAll('td'))

   # Skip the first td, which is an image
   data = data[1:]

   # Do something with the data...
   print data

And here's a test run.

> test.py bigdrizzle13
[u'Overall', u'87,417', u'1,784', u'78,772,017']
[u'Attack', u'140,903', u'88', u'4,509,031']
[u'Defence', u'123,057', u'85', u'3,449,751']
[u'Strength', u'325,883', u'84', u'3,057,628']
[u'Hitpoints', u'245,982', u'85', u'3,571,420']
[u'Ranged', u'583,645', u'71', u'856,428']
[u'Prayer', u'227,853', u'62', u'357,847']
[u'Magic', u'368,201', u'75', u'1,264,042']
[u'Cooking', u'34,754', u'99', u'13,192,745']
[u'Woodcutting', u'50,080', u'93', u'7,751,265']
[u'Fletching', u'53,269', u'99', u'13,051,939']
[u'Fishing', u'5,195', u'99', u'14,512,569']
[u'Firemaking', u'46,398', u'88', u'4,677,933']
[u'Crafting', u'328,268', u'62', u'343,143']
[u'Smithing', u'39,898', u'77', u'1,561,493']
[u'Mining', u'31,584', u'85', u'3,331,051']
[u'Herblore', u'247,149', u'52', u'135,215']
[u'Agility', u'225,869', u'60', u'276,753']
[u'Thieving', u'292,638', u'56', u'193,037']
[u'Slayer', u'113,245', u'73', u'998,607']
[u'Farming', u'204,608', u'51', u'115,507']
[u'Runecraft', u'38,369', u'71', u'880,789']
[u'Hunter', u'384,920', u'53', u'139,030']
[u'Construction', u'232,379', u'52', u'125,708']
[u'Summoning', u'87,236', u'64', u'419,086']

Voila :)

Triptych
Bear with me here. So the html of those scores is given in a table? And soup.find('table'... can draw out certain values in that table? And how do you know to look for 'mini_player' / what is 'mini_player'? Thanks in advance.
Alex
Very impressive. I do not understand most of the code yet. Working at it though.
Alex
mini_player is the ID attribute of the HTML table element that holds the scores you wanted from that page. I know to look for "mini_player "because I looked at the source of the HTML page, and that was the id of the table that held the scores.
Triptych
and soup.find() has a mode that can draw certain rows or columns from tables? and what is 'tr'?
Alex
Yes, Alex, find() and findAll can select one or all children of a given element matching particular criteria. In this case, the root element is "table", and the children are first rows ("tr" = "table row") then cells ("td" = "table data"). Nice work, Triptych.
Matthew Flaschen
Thanks for all the help triptych and matt. Its amazing how hard it is to learn this from reading tutorials, yet how easy it is when I just ask questions on here! THANKS!!!
Alex