views:

976

answers:

3

I thought BeautifulSoup will be able to handle malformed documents, but when I sent it the source of a page, the following traceback got printed:


Traceback (most recent call last):
  File "mx.py", line 7, in 
    s = BeautifulSoup(content)
  File "build\bdist.win32\egg\BeautifulSoup.py", line 1499, in __init__
  File "build\bdist.win32\egg\BeautifulSoup.py", line 1230, in __init__
  File "build\bdist.win32\egg\BeautifulSoup.py", line 1263, in _feed
  File "C:\Python26\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\Python26\lib\HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "C:\Python26\lib\HTMLParser.py", line 314, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "C:\Python26\lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"", at line 258, column 34

Shouldn't it be able to handle this sort of stuff? If it can handle them, how could I do it? If not, is there a module that can handle malformed documents?

EDIT: here's an update. I saved the page locally, using firefox, and I tried to create a soup object from the contents of the file. That's where BeautifulSoup fails. If I try to create a soup object directly from the website, it works.Here's the document that causes trouble for soup.

A: 

In my experience BeautifulSoup isn't that fault tolerant. I had to use it once for a small script and ran into these problems. I think using a regular expression to strip out the tags helped a bit, but I eventually just gave up and moved the script over to Ruby and Nokogiri.

nokogiri can handle malformed documents?
Geo
Yeah, it'll take about whatever you throw at it.
Use lxml, it's far superior to BeautifulSoup and much faster. It handles broken HTML.
Wahnfrieden
+4  A: 

Worked fine for me using BeautifulSoup version 3.0.7. The latest is 3.1.0, but there's a note on the BeautifulSoup home page to try 3.0.7a if you're having trouble. I think I ran into a similar problem as yours some time ago and reverted, which fixed the problem; I'd try that.

If you want to stick with your current version, I suggest removing the large <script> block at the top, since that is where the error occurs, and since you cannot parse that section with BeautifulSoup anyway.

Triptych
Seconding. The problem is with the regular expression on that line you're talking about. It should've been written with "<" instead of "<" and ">" instead of ">" (or, better yet, put into its own .js file), but it wasn't. It seems that HTMLParser can't handle it whereas the SGMLParser in BeautifulSoup 3.0.7 could.
Hao Lian
+1  A: 

The problem appears to be the
contents = contents.replace(/</g, '&lt;');
in line 258 plus the similar
contents = contents.replace(/>/g, '&gt;');
in the next line.

I'd just use re.sub to clobber all occurrences of r"replace(/[<>]/" with something inocuous before feeding it to BeautifulSoup ... moving away from BeautifulSoup would be like throwing out the baby with the bathwater IMHO.

John Machin