views:

246

answers:

4

I have been trying to get BeautifulSoup (3.1.0.1)to parse a html page that has a lot of javascript that generates html inside tags. One example fragment looks like this :

<html><head><body><div>
<script type='text/javascript'>

if(ii > 0) {
html += '<span id="hoverMenuPosSepId" class="hoverMenuPosSep">|</span>'
}
html += 
'<div class="hoverMenuPos" id="hoverMenuPosId" onMouseOver=\"menuOver_3821();\" ' +
'onMouseOut=\"menuOut_3821();\">';
if (children[ii].uri == location.pathname) {
html += '<a class="hiHover" href="' +  children[ii].uri + '" ' + onClick + '>';
} else {
html += '<a class="hover" href="' +  children[ii].uri + '" ' + onClick + '>';
}
html += children[ii].name + '</a></div>';
}
}          
hp = document.getElementById("hoverpopup_3821");
hp.style.top = (parseInt(hoveritem.offsetTop) + parseInt(hoveritem.offsetHeight)) + "px";
hp.style.visibility = "Visible";
hp.innerHTML = html;
}
return false;
}
function menuOut_3821() {
timeOn_3821 =  setTimeout("showSelected_3821()",  1000)             
}
var timeOn_3821 = null;
function menuOver_3821() {
clearTimeout(timeOn_3821)
}   
function showSelected_3821() {
showChildrenMenu_3821( 
document.getElementById("flatMenuItemAnchor" + selectedPageId), selectedPageId);
}
</script>
</body>
</html>

BeautifulSoup doesn't seem to be able to deal with this and is complaning about "malformed start tag" around the onMouseOver=\"menuOver_3821();\". It seems to try parsing the xml that is generated by javascript inside the script block ?!?

Any ideas how to make BeautifulSoup ignores the script tags content ?

I have seen other suggestion of using lxml but can't since it has to run on Google AppEngine.

A: 

Did you try replacing the angle brackets < and > with &lt; and &gt; in all the HTML that is inside the Javascript?

Jim Garrison
A: 

I've faced this kind of problem before, and what I normally do is replace every occurrence of <script with <!-- and </script> with -->. That way, all the <script></script> tags are commented out.

blwy10
Good point but would not work if script tag itself contains <!-- -->
alphageek
That is quite true. I suppose you could implement a sanity check such that when a <script is encountered, it sets a flag that removes --> until a </script> tag is encountered, or something like that.
blwy10
+1  A: 

Reverting to BeautifulSoup 3.0.7a solved this issue and many other html oddities that 3.1.0.1 has choked on.

alphageek
A: 

That would work, but the point of BeautifulSoup is parsing whatever tag soup you throw at it, even if it's horribly ill-formed.

ddaa