views:

1733

answers:

7

I am trying to parse an html page with BeautifulSoup, but it appears that BeautifulSoup doesn't like the html or that page at all. When I run the code below, the method prettify() returns me only the script block of the page (see below). Does anybody has an idea why it happens?

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1"
html = "".join(urllib2.urlopen(url).readlines())
print "-- HTML ------------------------------------------"
print html
print "-- BeautifulSoup ---------------------------------"
print BeautifulSoup(html).prettify()

The is the output produced by BeautifulSoup.

-- BeautifulSoup ---------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script language="JavaScript">
 <!--
     function highlight(img) {
       document[img].src = "/marketing/sony/images/en/" + img + "_on.gif";
     }

     function unhighlight(img) {
       document[img].src = "/marketing/sony/images/en/" + img + "_off.gif";
     }
//-->
</script>

Thanks!

UPDATE: I am using the following version, which appears to be the latest.

__author__ = "Leonard Richardson ([email protected])"
__version__ = "3.1.0.1"
__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
__license__ = "New-style BSD"
+2  A: 

BeautifulSoup isn't magic: if the incoming HTML is too horrible then it isn't going to work.

In this case, the incoming HTML is exactly that: too broken for BeautifulSoup to figure out what to do. For instance it contains markup like:

SCRIPT type=""javascript""

(Notice the double quoting.)

The BeautifulSoup docs contains a section what you can do if BeautifulSoup can't parse you markup. You'll need to investigate those alternatives.

Justus
A: 

I tested this script on BeautifulSoup version '3.0.7a' and it returns what appears to be correct output. I don't know what changed between '3.0.7a' and '3.1.0.1' but give it a try.

Łukasz
+3  A: 

Try with version 3.0.7a as Łukasz suggested. BeautifulSoup 3.1 was designed to be compatible with Python 3.0 so they had to change the parser from SGMLParser to HTMLParser which seems more vulnerable to bad HTML.

From the changelog for BeautifulSoup 3.1:

"Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't"

miles82
Some more info about this here: http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
FeatureCreep
A: 
import urllib
from BeautifulSoup import BeautifulSoup

>>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&amp;mfr=&amp;logon=&amp;langid=FR&amp;sort=0&amp;page=1')
>>> soup = BeautifulSoup(page)
>>> soup.prettify()

In my case by executing the above statements, it returns the entire HTML page.

aatifh
Just give the appropriate reason before voting down anyone. That will be a bit ethical. Oh! if you didn't understand my answer then may God help you
aatifh
A: 

I had problems parsing the following code too:

<script>
        function show_ads() {
          document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'&gt;&lt;/scr"+"ipt&gt;&lt;/div&gt;");
        }
</script>

HTMLParseError: bad end tag: u'', at line 26, column 127

Sam

+1  A: 

Try lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup, so it might work better for you. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

Wahnfrieden
A: 

Samj: If I get things like HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>" I just remove the culprit from markup before I serve it to BeautifulSoup and all is dandy:

html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)
Frank Malina