ansaurus

Question

How to fix noncompliant HTML so Expat will parse it (htmltidy not working)

Answer 1

+4 A:

They're using some kind of Javascript on the score boxes, so you're going to have to play more clever tricks (line breaks mine):

/* box of awesome */
// iscurrentweek ? true;
(new nfl.scores.Game('2009112905','54635',{state:'pre',container:'scorebox-2009112905',
wrapper:'sb-wrapper-2009112905',template:($('scorebox-2009112905').innerHTML),homeabbr:'NYJ',
awayabbr:'CAR'}));

However, to answer your question, BeautifulSoup parses it (seemingly) fine:

fp = urlopen("http://www.nfl.com/scores")
data = ""
while 1:
    r = fp.read()
    if not r:
        break
    data += r
fp.close()

soup = BeautifulSoup(data)
print soup.contents[2].contents[1].contents[1]

Outputs:

<title>NFL Scores: 2009 - Week 12</title>

Might be easier to scrape Yahoo's NFL scoreboard, in my opinion...in fact, off to try it.

EDIT: Used your question as an excuse to get around to learning BeautifulSoup. Alex Martelli has been singing its praise, so I figured it worth a try -- man, am I impressed.

Anyway, I was able to cook up a rudimentary score scraper from the Yahoo! scoreboard, like so:

def main():
    soup = BeautifulSoup(YAHOO_SCOREBOARD)
    on_first_team = True
    scores = []
    hold = None

    # Iterate the tr that contains a team's box score
    for item in soup(name="tr", attrs={"align": "center", "class": "ysptblclbg5"}):
        # Easy
        team = item.b.a.string

        # Get the box scores since we're industrious
        boxscore = []
        for quarter in item(name="td", attrs={"class": "yspscores"}):
            boxscore.append(int(quarter.string))

        # Final score
        sub = item(name="span", attrs={"class": "yspscores"})[0]
        if sub.b:
            # Winning score
            final = int(sub.b.string)
        else:
            data = sub.string.replace("&nbsp;", "")
            if ":" in data:
                # Catch TV: XXX and 0:00pm ET
                final = None
            else:
                try: final = int(data)
                except: final = None

        if on_first_team:
            hold = { team : (boxscore, final) }
            on_first_team = False
        else:
            hold[team] = (boxscore, final)
            scores.append(hold)
            on_first_team = True

    for game in scores:
        print "--- Game ---"
        for team in game:
            print team, game[team]

I would tweak this on Sunday to see how it operates, as it's really rough. Here's what it outputs as of right now:

--- Game ---
Green Bay ([0, 13, 14, 7], 34)
Detroit ([7, 0, 0, 5], 12)
--- Game ---
Oakland ([0, 0, 7, 0], 7)
Dallas ([3, 14, 0, 7], 24)

Look at that, I snagged box scores too... for a game that hasn't happened yet, we get:

--- Game ---
Washington ([], None)
Philadelphia ([], None)

Anyway, a peg for you to jump from. Good luck.

Jed Smith 2009-11-29 05:48:54

BeautifulSoup looks awesome! +1

Norman Ramsey 2009-11-30 00:09:44

Oh, yes, the soup. It's also good.

bmargulies 2009-11-30 02:08:13

I checked it out and BeautifulSoup cleans up the HTML almost completely, but the XML it spits out still contains 5 errors. (This is with output using the `prettify` method.) I'm a little reluctant to get too deep into the soup since the rest of my infrastructure is in Lua, so I'm probably going to try the xml feed first. But this is still a great thing to know about.

Norman Ramsey 2009-11-30 06:53:23

Answer 2

+3 A:

There's a Flash-based auto-updating scoreboard thing at the top of nfl.com. Some monitoring of its network traffic finds:

http://www.nfl.com/liveupdate/scorestrip/ss.xml

That will probably be a bit easier to parse than the HTML scoreboard.

rtucker 2009-11-29 15:11:02

Very clever, and easier to parse indeed. I wonder if that'll lead to an arms race, though.

Jed Smith 2009-11-30 00:59:33

This does not really answer the question as posed, but it was so helpful for my real problem that I have marked it as the accepted answer. Thanks!!!!!!

Norman Ramsey 2009-12-19 00:30:59

Answer 3

+2 A:

Look into tagsoup. If you want to end up with a DOM tree or a SAX stream in Java, it's the ticket. If you just want to extract specific information, Beautiful Soup is a Beautiful Thing.

bmargulies 2009-11-29 20:11:09

Looks really useful, although the Debian package won't run. Grrrr. +1, thanks.

Norman Ramsey 2009-11-30 00:23:28

As a rule of thumb, I don't ever use Java via debian.

bmargulies 2009-11-30 02:07:33

ansaurus

tags:

views:

answers:

How to fix noncompliant HTML so Expat will parse it (htmltidy not working)

related questions