views:

406

answers:

2

I have two machines each, to the best of my knowledge, running python 2.5 and BeautifulSoup 3.1.0.1.

I'm trying to scrape http://utahcritseries.com/RawResults.aspx, using:

from BeautifulSoup import BeautifulSoup
import urllib2

base_url = "http://www.utahcritseries.com/RawResults.aspx"

data=urllib2.urlopen(base_url)
soup=BeautifulSoup(data)
i = 0
table=soup.find("table",id='ctl00_ContentPlaceHolder1_gridEvents')
#table=soup.table
print "begin table"
for row in table.findAll('tr')[1:10]:
    i=i + 1
    col = row.findAll('td')
    date = col[0].string
    event = col[1].a.string
    confirmed = col[2].string
    print '%s - %s' % (date, event)
print "end table"
print "%s rows processed" % i

On my windows machine,I get the correct result, which is a list of dates and event names. On my mac, I don't. instead, I get

3/2/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
3/23/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
4/2/2002 - Rocky Mtn Raceway Criterium
None - Saltair Time Trial
4/9/2002 - Rocky Mtn Raceway Criterium
None - DMV Criterium
4/16/2002 - Rocky Mtn Raceway Criterium

What I'm noticing is that when I

print row

on my windows machine, the tr data looks exactly the same as the source html. Note the style tag on the second table row. Here's the first two rows:

<tr>
<td>
 3/2/2002
</td>
<td>
 <a href="Event.aspx?id=226">
  Rocky Mtn Raceway Criterium
 </a>
</td>
<td>
 Confirmed
</td>
<td>
 <a href="Event.aspx?id=226">
  Points
 </a>
</td>
<td>
 <a disabled="disabled">
  Results
 </a>
</td>
</tr>

<tr style="color:#333333;background-color:#EFEFEF;">
<td>
 3/16/2002
</td>
<td>
 <a href="Event.aspx?id=227">
  Rocky Mtn Raceway Criterium
 </a>
</td>
<td>
 Confirmed
</td>
<td>
 <a href="Event.aspx?id=227">
  Points
 </a>
</td>
<td>
 <a disabled="disabled">
  Results
 </a>
</td>
</tr>

On my mac when I print the first two rows, the style information is removed from the tr tag and it's moved into each td field. I don't understand why this is happening. I'm getting None for every other date value, because BeautifulSoup is putting a font tag around every other date. Here's the mac's output:

<tr>
<td>
 3/2/2002
</td>
<td>
 <a href="Event.aspx?id=226">
  Rocky Mtn Raceway Criterium
 </a>
</td>
<td>
 Confirmed
</td>
<td>
 <a href="Event.aspx?id=226">
  Points
 </a>
</td>
<td>
 <a disabled="disabled">
  Results
 </a>
</td>
</tr>

<tr bgcolor="#EFEFEF">
<td>
 <font color="#333333">
  3/16/2002
 </font>
</td>
<td>
 <font color="#333333">
  <a href="Event.aspx?id=227">
   Rocky Mtn Raceway Criterium
  </a>
 </font>
</td>
<td>
 <font color="#333333">
  Confirmed
 </font>
</td>
<td>
 <font color="#333333">
  <a href="Event.aspx?id=227">
   Points
  </a>
 </font>
</td>
<td>
 <font color="#333333">
  <a disabled="disabled">
   Results
  </a>
 </font>
</td>
</tr>

My script is displaying the correct result under windows-what do I need to do in order to get my Mac to work correctly?

+2  A: 

There are documented problems with version 3.1 of BeautifulSoup.

You might want to double check that is the version you in fact are using, and if so downgrade.

Paolo Bergantino
+1  A: 

I suspect the problem is in the urlib2 request, not BeautifulSoup:

It might help if you show us the same section of the raw data as returned by this command on both machines:

urllib2.urlopen(base_url)

This page looks like it might help: http://bytes.com/groups/python/635923-building-browser-like-get-request

The simplest solution is probably just to detect which environment the script is running in and change the parsing logic accordingly.

>>> import os
>>> os.uname() 
('Darwin', 'skom.local', '9.6.0', 'Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386', 'i386')

Or get microsoft to use web standards :)

Also, didn't you use mechanize to fetch the pages? If so, the problem may be there.

Pete Skomoroch