beautifulsoup

Find a specific tag with BeautifulSoup

I can traverse generic tags easily with BS, but I don't know how to find specific tags. For example, how can I find all occurances of <div style="width=300px;">? Is this possible with BS? ...

Extract all <script> tags in an HTML page and append to the bottom of the document

Could someone tell me how I can extract and remove all the <script> tags in a HTML document and add them to the end of the document, right before the </body></html>? I'd like to try and avoid using lxml please. Thanks. ...

BeautifulSoup not taking in a string?

So I am trying to scrape a web page but am getting some funky errors. html = urllib2.urlopen("http://sis.rpi.edu/reg/zs201101.htm").read() # 1 html = re.sub("(<script)(.+\n)+(.+)(</script>)","", html) # 2 print type(html) # 3 (Returns: <type 'str'>) soup = BeautifulSoup(html) # 4 With line 2 commented out, it tries to parse 'html' wit...

Beautiful Soup: Get the Contents of Sub-Nodes

Hello, I have following python code: def scrapeSite(urlToCheck): html = urllib2.urlopen(urlToCheck).read() from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) tdtags = soup.findAll('td', { "class" : "c" }) for t in tdtags: print t.encode('latin1') This will return me following html code:...

Using SoupStrainer to parse selectively

Im trying to parse a list of video game titles from a shopping site. however as the item list is all stored inside a tag . This section of the documentation supposedly explains how to parse only part of the document but i cant work it out. my code: from BeautifulSoup import BeautifulSoup import urllib import re url = "Some Shopping ...

Parsing a document with BeautifulSoup while not-parsing the contents of <code> tags

I'm writing a blog app with Django. I want to enable comment writers to use some tags (like <strong>, a, et cetera) but disable all others. In addition, I want to let them put code in <code> tags, and have pygments parse them. For example, someone might write this comment: I like this article, but the third code example <em>could have...

BeautifulSoup is too slow. Can lxml do this?

I've got the following BeautifulSoup code, a bit simplified. soup = BeautifulSoup(html) for item in soup.findAll('div',id=compile('^result_')): q = item.find('a',{'class':'title'}) if q: ... q = item.find('div',{'class':['one','two']}) if q: ... I profiled it, and it's quite slow. I want to try lxml instead but it seem...

[edited] How to deal with utf-8 encoded String and BeautifulSoup?

How can I replace HTML-entities in unicode-Strings with proper unicode? u'&quot;HAUS Kleider&quot; - &Uuml;ber das Bekleiden und Entkleiden, das Verh&Yuml;llen und Veredeln' to u'"HAUS-Kleider" - Über das Bekleiden und Entkleiden, das Verhüllen und Veredeln' edit Actually the entities are wrong. At it seems like BeautifulSoup f...e...