ansaurus

Question

BeautifulSoup 3.1 parser breaks far too easily

Answer 1

+2 A:

Your problem must be something else; it works fine for me:

In [1]: import BeautifulSoup

In [2]: c = """<HTML>
   ...:     <HEAD>
   ...:         <TITLE>Title</TITLE>
   ...:         <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
   ...:     </HEAD>
   ...:     <BODY>
   ...:         ...
   ...:         ...
   ...:     </BODY>
   ...: </HTML>
   ...: """

In [3]: print BeautifulSoup.BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
  <http-equiv>
  </http-equiv>
 </head>
 <body>
  ...
        ...
 </body>
</html>


In [4]:

This is Python 2.5.2 with BeautifulSoup 3.0.7a — maybe it's different in older/newer versions? This is exactly the kind of soup BeautifulSoup handles so beautifully, so I doubt it's been changed at some point… Is there something else to the structure that you haven't mentioned in the problem?

ShreevatsaR 2009-01-19 23:40:08

I've got Python 2.5.1 and BeautifulSoup 3.1.0.1. The original broken structure was different, but the problem also occurs with the simplified structure in the question. I have just run the code in your example and have the same problem as before, nothing after </title>. Now I'm really confused!

Mat 2009-01-19 23:49:26

One possibility is that BeautifulSoup broke something when updating... did you try with the text copied exactly from your question here?

ShreevatsaR 2009-01-19 23:51:54

http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html BeautifulSoup 3.1 is based on HTMLParser rather than SGMLParser (as the latter is gone in Python 3.0), which *might* be the problem here. That's sad...

ShreevatsaR 2009-01-19 23:54:32

Yes I tried with the exact text in the question, and I've just copied-and-pasted again to be sure.Sounds like a pain that the parser has changed. Perhaps I should drop a quick regular expression in to zap the borked HTML. It's not like I'm going to come across anything similar elsewhere.

Mat 2009-01-20 00:11:47

Confirming ShreevatsaR's results with BeautifulSoup 3.0.7a...

John Fouhy 2009-01-20 01:23:52

Answer 2

+6 A:

Having problems with Beautiful Soup 3.1.0? recommends to use html5lib's parser as one of workarounds.

#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

c = """<HTML>
    <HEAD>
        <TITLE>Title</TITLE>
        <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
    </HEAD>
    <BODY>
        ...
        ...
    </BODY>
</HTML>"""

soup = parser.parse(c)
print soup.prettify()

Output:

<html>
 <head>
  <title>
   Title
  </title>
 </head>
 <body>
  <http-equiv="pragma" content="NO-CACHE">
   ...
        ...
  </http-equiv="pragma">
 </body>
</html>

The output shows that html5lib hasn't fixed the problem in this case though.

J.F. Sebastian 2009-03-12 13:20:25

Answer 3

+2 A:

Try lxml (and its html module). Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

Wahnfrieden 2009-08-03 15:40:41

ansaurus

tags:

views:

answers:

BeautifulSoup 3.1 parser breaks far too easily

related questions