questions about beautifulsoup | ansaurus

beautifulsoup

Python beautiful soup arguments

Hi I have this code that fetches some text from a page using BeautifulSoup soup= BeautifulSoup(html) body = soup.find('div' , {'id':'body'}) print body I would like to make this as a reusable function that takes in some htmltext and the tags to match it like the following def parse(html, atrs): soup= BeautifulSoup(html) body = s...

Using BeautifulSoup's findAll to search html element's innerText to get same result as searching attributes?

For instance if I am searching by an element's attribute like id: soup.findAll('span',{'id':re.compile("^score_")}) I get back a list of the whole span element that matches (which I like). But if I try to search by the innerText of the html element like this: soup.findAll('a',text = re.compile("discuss|comment")) I get back only ...

Is it possible for BeautifulSoup to work in a case-insensitive manner?

I am trying to extract Meta Description for fetched webpages. But here I am facing the problem of case sensitivity of BeautifulSoup. As some of the pages have <meta name="Description and some have <meta name="description. My problem is very much similar to that of Question on Stackoverflow The only difference is that I can't use lxm...

beautifulsoup and mechanize to get ajax call result

hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction and get this result? i come across a mec...

Extracting an attribute value with beautifulsoup

I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code: import urllib f = urllib.urlopen("http://58.68.130.147") s = f.read() f.close() from BeautifulSoup import BeautifulStoneSoup soup = BeautifulStoneSoup(s) inputTag = soup.findAll(attrs={"name" : "stainfo"})...

beautifulsoup: find the n-th element's sibling

I have a complex html DOM tree of the following nature: <table> ... <tr> <td> ... </td> <td> <table> <tr> <td>  <table> ... ...

Extracting value in Beautifulsoup

I have the following code: f = open(path, 'r') html = f.read() # no parameters => reads to eof and returns string soup = BeautifulSoup(html) schoolname = soup.findAll(attrs={'id':'ctl00_ContentPlaceHolder1_SchoolProfileUserControl_SchoolHeaderLabel'}) print schoolname which gives: [<span id="ctl00_ContentPlaceHolder1_SchoolProfileUs...

Beautiful Soup Unicode encode error

I am trying the following code with a particular HTML file from BeautifulSoup import BeautifulSoup import re import codecs import sys f = open('test1.html') html = f.read() soup = BeautifulSoup(html) body = soup.body.contents para = soup.findAll('p') print str(para).encode('utf-8') I get the following error: UnicodeEncodeError: 'asci...

Using urllib and BeautifulSoup to retrieve info from web with Python

I can get the html page using urllib, and use BeautifulSoup to parse the html page, and it looks like that I have to generate file to be read from BeautifulSoup. import urllib sock = urllib.urlopen("http://SOMEWHERE") htmlSource = sock.read() sock.close() ...

what is the return value of BeautifulSoup.find ?

I run to get some value as score. score = soup.find('div', attrs={'class' : 'summarycount'}) I run 'print score' to get as follows. <div class=\"summarycount\">524</div> I need to extract the number part. I used re module but failed. m = re.search("[^\d]+(\d+)", score) TypeError: expected string or buffer function search in re...

How do I make BeautifulSoup parse the contents of textarea tags as HTML?

Before 3.0.5, BeautifulSoup used to treat the contents of <textarea> as HTML. It now treats it as text. The document I am parsing has HTML inside the textarea tags, and I am trying to process it. I've tried: for textarea in soup.findAll('textarea'): contents = BeautifulSoup.BeautifulSoup(textarea.contents) textarea....

Optimizing BeautifulSoup (Python) code

I have code that uses the BeautifulSoup library for parsing, but it is very slow. The code is written in such a way that threads cannot be used. Can anyone help me with this? I am using BeautifulSoup for parsing and than save into a DB. If I comment out the save statement, it still takes a long time, so there is no problem with the dat...

BeautifulSoup, but for CSS?

BeautifulSoup parses HTML and offers various ways to manipulate and search within HTML. Is there something similar for CSS? Specifically, I'd like to know if a given HTML text is rendered as bold. Either it has an ancestor that is the <strong> or the <bold> tag (which can be done with BeautifulSoup), or it has an ancestor (or itself) th...

In Python BeautifulSoup How to move tags

I have a partially converted XML document in soup coming from HTML. After some replacement and editing in the soup, the body is essentially - <Text...></Text> # This replaces <a href..> tags but automatically creates the </Text> <p class=norm ...</p> <p class=norm ...</p> <Text...></Text> <p class=norm ...</p> and so forth. I nee...

Python GUI Scraper hanging issues.

I wrote a scraper using python a while back, and it worked fine in the command line. I have made a GUI for the application now, but I am having trouble with one issue. When I attempt to update text inside the gui (e.g. 'fetching URL 12/50'), I am unable seeing as the function within the scraper is grabbing 100+ links. Also when going ...

screen-scraping

How to get these values with BeautifulSoup?

Hello everybody, I have this html table: <table> <tr> <td class="datax">a</td> <td class="datax">b</td> <td class="datax">c</td> <td class="datax">d</td> </tr> <tr> <td class="datax">e</td> <td class="datax">f</td> <td class="datax">g</td> <td class="datax">h</...

Matching id's in BeautifulSoup

Hello, I'm using BeautifulSoup - python module. I have to find any reference to the div's with id like: 'post-#'. For example: <div id="post-45">...</div> <div id="post-334">...</div> How can I filter this? html = '<div id="post-45">...</div> <div id="post-334">...</div>' soupHandler = BeautifulSoup(html) print soupHandler.findAll('d...

grabbing a substring while scraping with Python2.6

Hey can someone help with the following? I'm trying to scrape a site that has the following information.. I need to pull just the number after the </strong> tag.. [<li><strong>ISBN-13:</strong> 9780375853401</li>, <li><strong>Pub. Date: </strong> 05/11/2010</li>] [<li><strong>UPC:</strong> 490355000372</li>, <li><strong>Catalog No:</st...

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth. I'm using mechanize and Beautif...

screen-scraping

http-status-code-403

How to unescape special characters from BeautifulSoup output?

Hi, I am facing issues with the special characters like and which represent the degree Fahrenheit sign and the registered sign, when i print the string the contains the special characters, it gives output like this: Preheat oven to 350° F Welcome to Lorem Ipsum Inc® Is there a way I can output the exact characters and n...

character-encoding

special-characters

1
...
5
6
7
8
9
...
12