views:

279

answers:

3

I'm trying to scrape information from http://www.nfl.com/scores (in particular, find out when a game is over so my computer can stop recording it). I can download HTML easily enough, and it makes this claim about compliance with standards:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

But

  1. An attempt to parse it with Expat produces the error not well-formed (invalid token).

  2. The W3C's online validation service reports 399 Errors and 121 warnings.

  3. I tried to run HTML tidy (just called tidy) on my linux system with the -xml option, but tidy reports 56 warnings and 117 errors and is unable to recover a good XML file. The errors look like this:

    line 409 column 122 - Warning: unescaped & or unknown entity "&role"
    ...
    line 409 column 172 - Warning: unescaped & or unknown entity "&tabSeq"
    ...
    line 1208 column 65 - Error: unexpected </td> in <br>
    line 1209 column 57 - Error: unexpected </tr> in <br>
    line 1210 column 49 - Error: unexpected </table> in <br>
    

    But when I check the input, the "unknown entities" appear to be part of a properly quoted URL, so I don't know if a double quote is missing somewhere or what.

I know that there is something out there that can parse this stuff because both Firefox and w3m display something reasonable. What tool will fix the noncompliant HTML so that I can parse it with Expat?

+4  A: 

They're using some kind of Javascript on the score boxes, so you're going to have to play more clever tricks (line breaks mine):

/* box of awesome */
// iscurrentweek ? true;
(new nfl.scores.Game('2009112905','54635',{state:'pre',container:'scorebox-2009112905',
wrapper:'sb-wrapper-2009112905',template:($('scorebox-2009112905').innerHTML),homeabbr:'NYJ',
awayabbr:'CAR'}));

However, to answer your question, BeautifulSoup parses it (seemingly) fine:

fp = urlopen("http://www.nfl.com/scores")
data = ""
while 1:
    r = fp.read()
    if not r:
        break
    data += r
fp.close()

soup = BeautifulSoup(data)
print soup.contents[2].contents[1].contents[1]

Outputs:

<title>NFL Scores: 2009 - Week 12</title>

Might be easier to scrape Yahoo's NFL scoreboard, in my opinion...in fact, off to try it.


EDIT: Used your question as an excuse to get around to learning BeautifulSoup. Alex Martelli has been singing its praise, so I figured it worth a try -- man, am I impressed.

Anyway, I was able to cook up a rudimentary score scraper from the Yahoo! scoreboard, like so:

def main():
    soup = BeautifulSoup(YAHOO_SCOREBOARD)
    on_first_team = True
    scores = []
    hold = None

    # Iterate the tr that contains a team's box score
    for item in soup(name="tr", attrs={"align": "center", "class": "ysptblclbg5"}):
        # Easy
        team = item.b.a.string

        # Get the box scores since we're industrious
        boxscore = []
        for quarter in item(name="td", attrs={"class": "yspscores"}):
            boxscore.append(int(quarter.string))

        # Final score
        sub = item(name="span", attrs={"class": "yspscores"})[0]
        if sub.b:
            # Winning score
            final = int(sub.b.string)
        else:
            data = sub.string.replace("&nbsp;", "")
            if ":" in data:
                # Catch TV: XXX and 0:00pm ET
                final = None
            else:
                try: final = int(data)
                except: final = None

        if on_first_team:
            hold = { team : (boxscore, final) }
            on_first_team = False
        else:
            hold[team] = (boxscore, final)
            scores.append(hold)
            on_first_team = True

    for game in scores:
        print "--- Game ---"
        for team in game:
            print team, game[team]

I would tweak this on Sunday to see how it operates, as it's really rough. Here's what it outputs as of right now:

--- Game ---
Green Bay ([0, 13, 14, 7], 34)
Detroit ([7, 0, 0, 5], 12)
--- Game ---
Oakland ([0, 0, 7, 0], 7)
Dallas ([3, 14, 0, 7], 24)

Look at that, I snagged box scores too... for a game that hasn't happened yet, we get:

--- Game ---
Washington ([], None)
Philadelphia ([], None)

Anyway, a peg for you to jump from. Good luck.

Jed Smith
BeautifulSoup looks awesome! +1
Norman Ramsey
Oh, yes, the soup. It's also good.
bmargulies
I checked it out and BeautifulSoup cleans up the HTML almost completely, but the XML it spits out still contains 5 errors. (This is with output using the `prettify` method.) I'm a little reluctant to get too deep into the soup since the rest of my infrastructure is in Lua, so I'm probably going to try the xml feed first. But this is still a great thing to know about.
Norman Ramsey
+3  A: 

There's a Flash-based auto-updating scoreboard thing at the top of nfl.com. Some monitoring of its network traffic finds:

http://www.nfl.com/liveupdate/scorestrip/ss.xml

That will probably be a bit easier to parse than the HTML scoreboard.

rtucker
Very clever, and easier to parse indeed. I wonder if that'll lead to an arms race, though.
Jed Smith
This does not really answer the question as posed, but it was so helpful for my real problem that I have marked it as the accepted answer. Thanks!!!!!!
Norman Ramsey
+2  A: 

Look into tagsoup. If you want to end up with a DOM tree or a SAX stream in Java, it's the ticket. If you just want to extract specific information, Beautiful Soup is a Beautiful Thing.

bmargulies
Looks really useful, although the Debian package won't run. Grrrr. +1, thanks.
Norman Ramsey
As a rule of thumb, I don't ever use Java via debian.
bmargulies