views:

27

answers:

1

So I am trying to scrape a web page but am getting some funky errors.

html = urllib2.urlopen("http://sis.rpi.edu/reg/zs201101.htm").read() # 1
html = re.sub("(<script)(.+\n)+(.+)(</script>)","", html) # 2
print type(html) # 3 (Returns: <type 'str'>)
soup = BeautifulSoup(html) # 4

With line 2 commented out, it tries to parse 'html' with the BeautifulSoup function but spits out this error "HTMLParser.HTMLParseError: bad end tag: u'< /sc"+"ript>', at line 15, column 75 ". To get rid of this, I try to get rid of the script tag all together.

However, after removing the script tag with the regular expression (which sufficiently removes the javascript, then returns a string), I get this error "TypeError: expected string or buffer" as if 'html' is not a string (which it is).

If anybody knows what is going on here, your help would be appreciated. Thanks!

A: 

Getting rid of script tags without regex

html = urllib2.urlopen("http://sis.rpi.edu/reg/zs201101.htm").read()
for sc in html.split("</script>"):
    if "<script" in sc:
        sc = sc.split("<script")[0]
    print sc
ghostdog74