So I am trying to scrape a web page but am getting some funky errors.
html = urllib2.urlopen("http://sis.rpi.edu/reg/zs201101.htm").read() # 1
html = re.sub("(<script)(.+\n)+(.+)(</script>)","", html) # 2
print type(html) # 3 (Returns: <type 'str'>)
soup = BeautifulSoup(html) # 4
With line 2 commented out, it tries to parse 'html' with the BeautifulSoup function but spits out this error "HTMLParser.HTMLParseError: bad end tag: u'< /sc"+"ript>', at line 15, column 75 ". To get rid of this, I try to get rid of the script tag all together.
However, after removing the script tag with the regular expression (which sufficiently removes the javascript, then returns a string), I get this error "TypeError: expected string or buffer" as if 'html' is not a string (which it is).
If anybody knows what is going on here, your help would be appreciated. Thanks!