questions about beautifulsoup | ansaurus

beautifulsoup

Select specific child elements with BeautifulSoup

I'm reading up on BeautifulSoup to screen-scrape some pretty heavy html pages. Going through the documentation of BeautifulSoup I can't seem to find a easy way to select child elements. Given the html: <div id="top"> <div>Content</div> <div> <div>Content I Want</div> </div> </div> I want a easy way to to get the "Content I ...

Pamie and python-win32 question

hello, currently im making some web scrap script. and i was choice PAMIE to use my script. actually im new to python and programming. so i have no idea ,if i use PAMIE,it really helpful to make script to relate with win32-python. ok my problem is , while im making script,i was encounter two probelm. first , i want to let work my script w...

How do you elimate all HTML tags/XML tags with BeautifulSoup?

It doesn't say it anywhere in the documentation, it only shows how to parse the tags. ...

BeautifulSoup with Jython

I just tried to run BeautifulSoup (3.1.0.1) with Jython (2.5.1) and I was amazed to see how much slower it was than CPython. Parsing a page (http://www.fixprotocol.org/specifications/fields/5000-5999) with CPython took just under a second (0.844 second to be exact). With Jython it took 564 seconds - almost 700 times as much. Can anyone ...

BeautifulSoup - extracting attribute values

If Beautiful Soup gives me an anchor tag like this: <a class="blah blah" id="blah blah" href="link.html"></a> How would I retrieve the value of the href attribute? ...

Using md5 on BeautifulSoup result

Im trying to use the md5 algorithm on web pages to avoid seeing duplicates. Is there an easy way to convert the result from beautifulsoup into a string which is digestible by md5? Many thanks ...

Matching tags in BeautifulSoup

I'm trying to count the number of tags in the 'soup' from a beautifulsoup result. I'd like to use a regular expression but am having trouble. The code Ive tried is as follows: reg_exp_tag = re.compile("<[^>*>") tags = re.findall(reg_exp_tag, soup(cast as a string)) but re will not allow reg_exp_tag, giving an unexpected end of regular...

BeautifulSoup is omitting body of page

BeautifulSoup newbe... Need help Here is the code sample... from mechanize import Browser from BeautifulSoup import BeautifulSoup mec = Browser() #url1 = "http://www.wines.com/catalog/index.php?cPath=21" url2 = "http://www.wines.com/catalog/product_info.php?products_id=4866" page = mec.open(url2) html = page.read() soup = BeautifulSou...

screen-scraping

Making BeautifulSoup ignore contents inside script tags

I have been trying to get BeautifulSoup (3.1.0.1)to parse a html page that has a lot of javascript that generates html inside tags. One example fragment looks like this : <html><head><body><div> <script type='text/javascript'> if(ii > 0) { html += '<span id="hoverMenuPosSepId" class="hoverMenuPosSep">|</span>' } html += '<div class=...

BeautifulSoup.findAll() in perl

I need to pull out all of the "NodeGroup" elements out of an XML file: <Database> <Get> <Data> <NodeGroups> <NodeGroup> <AssociateNode ConnID="6748763_2" /> <AssociateNode ConnID="6748763_1" /> <Data DataType="Capacity">2</Data> <Name>Alpha</Name> </NodeGroup> <...

BeautifulSoup - easy way to to obtain HTML-free contents.

I'm using this code to find all interesting links in a page: soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font, b and different things... I'd like to get just the text content, without any other html tag. Example of l...

html-content-extraction

Remove a tag using BeautifulSoup but keep its contents

Currently I have code that does something like this: soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.extract() soup.renderContents() Except I don't want to throw away the contents inside the invalid tag. How do I get rid of the tag but keep the contents inside ...

Python and BeautifulSoup, not finding 'a'

Hey, Here's a piece of HTML code (from delicious): <h4> <a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers & Anti-Bot Protection</a> <span class="saverem"> <em class="bookmark-actions"> <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&tit...

How to design small web forms in html page

When i design our web-form then i see then my web-form is very small then my web page Because my form have only two field (two text-box two label) How i design it. then he look Beautiful. ...

How can I translate this XPath expression to BeautifulSoup?

In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I've been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression? hxs.select('//td[@class="altRow"][2]/a/@href...

Beautifulsoup get value in table

I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have: soup = BeautifulSoup(url_opener.open(url)) x = soup('table', text = re.compile("Owner Name"))...

screen-scraping

html-content-extraction

What are these errors and how do I handle them?

I am using this simple code for l in bios: OpenThisLink = url + l response = urllib2.urlopen(OpenThisLink) to open about 200 urls and search them with regex (and BeautifulSoup), but after a dozen or so I get these errors and IDLE quits. What do they mean? How can I handle them? Thank you. Traceback (most recent call last): ...

Need help with Python/BeautifulSoup

Can I combine these two blocks into one: Edit: Any other method than combining loops like Yacoby did in the answer. for tag in soup.findAll(['script', 'form']): tag.extract() for tag in soup.findAll(id="footer"): tag.extract() Also can I multiple blocks into one: for tag in soup.findAll(id="footer"): tag.extract() for ...

split a comma separated list with links in with beautifulsoup

I've got a comma separated list in a table cell in an HTML document, but some of items in the list are linked: <table> <tr> <td>Names</td> <td>Fred, John, Barry, <a href="http://www.example.com/">Roger</a>, James</td> </tr> </table> I've been using beautiful soup to parse the html, and I can get to the table, but ...

How to parse through script tag using python and beautifulsoup

Hi, I am trying to extract attributes of frame tag which is inside document.write function on a page like this: <script language="javascript"> . . . document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>'); if (anchor != "") { document.write(...

1
2
3
4
5
...
12