Hi
I have this code that fetches some text from a page using BeautifulSoup
soup= BeautifulSoup(html)
body = soup.find('div' , {'id':'body'})
print body
I would like to make this as a reusable function that takes in some htmltext and the tags to match it like the following
def parse(html, atrs):
soup= BeautifulSoup(html)
body = s...
For instance if I am searching by an element's attribute like id:
soup.findAll('span',{'id':re.compile("^score_")})
I get back a list of the whole span element that matches (which I like).
But if I try to search by the innerText of the html element like this:
soup.findAll('a',text = re.compile("discuss|comment"))
I get back only ...
I am trying to extract Meta Description for fetched webpages. But here I am facing the problem of case sensitivity of BeautifulSoup.
As some of the pages have <meta name="Description and some have <meta name="description.
My problem is very much similar to that of Question on Stackoverflow
The only difference is that I can't use lxm...
hi im building a scraper using python 2.5 and beautifulsoup
but im stuble upon a problem ... part of the web page is generating
after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters
is there a way to simulate user interaction and get this result? i come across a mec...
I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTag = soup.findAll(attrs={"name" : "stainfo"})...
I have a complex html DOM tree of the following nature:
<table>
...
<tr>
<td>
...
</td>
<td>
<table>
<tr>
<td>
<!-- inner most table -->
<table>
...
...
I have the following code:
f = open(path, 'r')
html = f.read() # no parameters => reads to eof and returns string
soup = BeautifulSoup(html)
schoolname = soup.findAll(attrs={'id':'ctl00_ContentPlaceHolder1_SchoolProfileUserControl_SchoolHeaderLabel'})
print schoolname
which gives:
[<span id="ctl00_ContentPlaceHolder1_SchoolProfileUs...
I am trying the following code with a particular HTML file
from BeautifulSoup import BeautifulSoup
import re
import codecs
import sys
f = open('test1.html')
html = f.read()
soup = BeautifulSoup(html)
body = soup.body.contents
para = soup.findAll('p')
print str(para).encode('utf-8')
I get the following error:
UnicodeEncodeError: 'asci...
I can get the html page using urllib, and use BeautifulSoup to parse the html page, and it looks like that I have to generate file to be read from BeautifulSoup.
import urllib
sock = urllib.urlopen("http://SOMEWHERE")
htmlSource = sock.read()
sock.close() ...
I run to get some value as score.
score = soup.find('div', attrs={'class' : 'summarycount'})
I run 'print score' to get as follows.
<div class=\"summarycount\">524</div>
I need to extract the number part. I used re module but failed.
m = re.search("[^\d]+(\d+)", score)
TypeError: expected string or buffer
function search in re...
Before 3.0.5, BeautifulSoup used to treat the contents of <textarea> as HTML. It now treats it as text. The document I am parsing has HTML inside the textarea tags, and I am trying to process it.
I've tried:
for textarea in soup.findAll('textarea'):
contents = BeautifulSoup.BeautifulSoup(textarea.contents)
textarea....
I have code that uses the BeautifulSoup library for parsing, but it is very slow. The code is written in such a way that threads cannot be used.
Can anyone help me with this?
I am using BeautifulSoup for parsing and than save into a DB. If I comment out the save statement, it still takes a long time, so there is no problem with the dat...
BeautifulSoup parses HTML and offers various ways to manipulate and search within HTML. Is there something similar for CSS?
Specifically, I'd like to know if a given HTML text is rendered as bold. Either it has an ancestor that is the <strong> or the <bold> tag (which can be done with BeautifulSoup), or it has an ancestor (or itself) th...
I have a partially converted XML document in soup coming from HTML. After some replacement and editing in the soup, the body is essentially -
<Text...></Text> # This replaces <a href..> tags but automatically creates the </Text>
<p class=norm ...</p>
<p class=norm ...</p>
<Text...></Text>
<p class=norm ...</p> and so forth.
I nee...
I wrote a scraper using python a while back, and it worked fine in the command line. I have made a GUI for the application now, but I am having trouble with one issue. When I attempt to update text inside the gui (e.g. 'fetching URL 12/50'), I am unable seeing as the function within the scraper is grabbing 100+ links. Also when going ...
Hello everybody,
I have this html table:
<table>
<tr>
<td class="datax">a</td>
<td class="datax">b</td>
<td class="datax">c</td>
<td class="datax">d</td>
</tr>
<tr>
<td class="datax">e</td>
<td class="datax">f</td>
<td class="datax">g</td>
<td class="datax">h</...
Hello, I'm using BeautifulSoup - python module. I have to find any reference to the div's with id like: 'post-#'.
For example:
<div id="post-45">...</div>
<div id="post-334">...</div>
How can I filter this?
html = '<div id="post-45">...</div> <div id="post-334">...</div>'
soupHandler = BeautifulSoup(html)
print soupHandler.findAll('d...
Hey can someone help with the following?
I'm trying to scrape a site that has the following information.. I need to pull just the number after the </strong> tag..
[<li><strong>ISBN-13:</strong> 9780375853401</li>, <li><strong>Pub. Date: </strong> 05/11/2010</li>]
[<li><strong>UPC:</strong> 490355000372</li>, <li><strong>Catalog No:</st...
Is there a way to get around the following?
httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.
I'm using mechanize and Beautif...
Hi, I am facing issues with the special characters like and which represent the degree Fahrenheit sign and the registered sign,
when i print the string the contains the special characters, it gives output like this:
Preheat oven to 350° F
Welcome to Lorem Ipsum Inc®
Is there a way I can output the exact characters and n...