I was having trouble parsing some dodgy HTML with BeautifulSoup. Turns out that the HTMLParser used in newer versions is less tolerant than the SGMLParser used previously.
Does BeautifulSoup have some kind of debug mode? I'm trying to figure out how to stop it borking on some nasty HTML I'm loading from a crabby website:
<HTML>
<HEAD>
<TITLE>Title</TITLE>
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
</HEAD>
<BODY>
...
...
</BODY>
</HTML>
BeautifulSoup gives up after the <HTTP-EQUIV...>
tag
In [1]: print BeautifulSoup(c).prettify()
<html>
<head>
<title>
Title
</title>
</head>
</html>
The problem is clearly the HTTP-EQUIV tag, which is really a very malformed <META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
tag. Evidently, I need to specify this as self-closing, but no matter what I specify I can't fix it:
In [2]: print BeautifulSoup(c,selfClosingTags=['http-equiv',
'http-equiv="pragma"']).prettify()
<html>
<head>
<title>
Title
</title>
</head>
</html>
Is there a verbose debug mode in which BeautifulSoup will tell me what it is doing, so I can figure out what it is treating as the tag name in this case?