ansaurus

Question

Answer 1

+2 A:

BeautifulSoup isn't magic: if the incoming HTML is too horrible then it isn't going to work.

In this case, the incoming HTML is exactly that: too broken for BeautifulSoup to figure out what to do. For instance it contains markup like:

SCRIPT type=""javascript""

(Notice the double quoting.)

The BeautifulSoup docs contains a section what you can do if BeautifulSoup can't parse you markup. You'll need to investigate those alternatives.

Justus 2009-03-02 04:09:28

Answer 2

A:

I tested this script on BeautifulSoup version '3.0.7a' and it returns what appears to be correct output. I don't know what changed between '3.0.7a' and '3.1.0.1' but give it a try.

Łukasz 2009-03-02 08:31:44

Answer 3

+3 A:

Try with version 3.0.7a as Łukasz suggested. BeautifulSoup 3.1 was designed to be compatible with Python 3.0 so they had to change the parser from SGMLParser to HTMLParser which seems more vulnerable to bad HTML.

From the changelog for BeautifulSoup 3.1:

"Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't"

miles82 2009-03-02 09:16:27

Some more info about this here: http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

FeatureCreep 2009-11-21 19:13:21

Answer 4

A:

import urllib
from BeautifulSoup import BeautifulSoup

>>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&amp;mfr=&amp;logon=&amp;langid=FR&amp;sort=0&amp;page=1')
>>> soup = BeautifulSoup(page)
>>> soup.prettify()

In my case by executing the above statements, it returns the entire HTML page.

aatifh 2009-03-06 07:31:58

Just give the appropriate reason before voting down anyone. That will be a bit ethical. Oh! if you didn't understand my answer then may God help you

aatifh 2009-03-09 07:02:35

Answer 5

A:

I had problems parsing the following code too:

<script>
        function show_ads() {
          document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'&gt;&lt;/scr"+"ipt&gt;&lt;/div&gt;");
        }
</script>

HTMLParseError: bad end tag: u'', at line 26, column 127

Sam

2009-04-20 11:39:53

Answer 6

+1 A:

Try lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup, so it might work better for you. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

Wahnfrieden 2009-08-03 15:39:32

Answer 7

A:

Samj: If I get things like HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>" I just remove the culprit from markup before I serve it to BeautifulSoup and all is dandy:

html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)

Frank Malina 2010-07-13 20:00:35

ansaurus

tags:

views:

answers:

Issues with BeautifulSoup parsing

related questions