ansaurus

Question

Why is BeautifulSoup throwing this HTMLParseError?

Answer 1

A:

In my experience BeautifulSoup isn't that fault tolerant. I had to use it once for a small script and ran into these problems. I think using a regular expression to strip out the tags helped a bit, but I eventually just gave up and moved the script over to Ruby and Nokogiri.

2009-07-10 20:25:16

nokogiri can handle malformed documents?

Geo 2009-07-10 20:46:54

Yeah, it'll take about whatever you throw at it.

2009-07-11 00:09:48

Use lxml, it's far superior to BeautifulSoup and much faster. It handles broken HTML.

Wahnfrieden 2009-08-10 21:27:05

Answer 2

+4 A:

Worked fine for me using BeautifulSoup version 3.0.7. The latest is 3.1.0, but there's a note on the BeautifulSoup home page to try 3.0.7a if you're having trouble. I think I ran into a similar problem as yours some time ago and reverted, which fixed the problem; I'd try that.

If you want to stick with your current version, I suggest removing the large <script> block at the top, since that is where the error occurs, and since you cannot parse that section with BeautifulSoup anyway.

Triptych 2009-07-10 23:52:26

Seconding. The problem is with the regular expression on that line you're talking about. It should've been written with "<" instead of "<" and ">" instead of ">" (or, better yet, put into its own .js file), but it wasn't. It seems that HTMLParser can't handle it whereas the SGMLParser in BeautifulSoup 3.0.7 could.

Hao Lian 2009-07-11 01:46:19

Answer 3

+1 A:

The problem appears to be the
contents = contents.replace(/</g, '<');
in line 258 plus the similar
contents = contents.replace(/>/g, '>');
in the next line.

I'd just use re.sub to clobber all occurrences of r"replace(/[<>]/" with something inocuous before feeding it to BeautifulSoup ... moving away from BeautifulSoup would be like throwing out the baby with the bathwater IMHO.

John Machin 2009-07-11 01:51:12

ansaurus

tags:

views:

answers:

Why is BeautifulSoup throwing this HTMLParseError?

related questions