beautifulsoup

help on getting image src from a table cell using BeautifulSoup

So I have a html page that has a form, and a table inside the form that has rows of products. I got to the point now where I am looping through the table rows, and in each loop I grab all the table cells. for tr in t.findAll('tr'): td = tr.findAll('td') Now I want to grab the image src url from the first td. Html looks like: <t...

Using BeautifulSoup, how to guard against elements not being found?

I am looping through table rows in a table, but the first 1 or 2 rows doesn't have the elements I am looking for (they are for table column headers etc.). So after say the 3rd table row, there are elements in the table cells (td) that have what I am looking for. e.g. td[0].a.img['src'] But calling this fails since the first few rows...

beautifulsoup, Find th with text 'price', then get price from next th

My html looks like: <td> <table ..> <tr> <th ..>price</th> <th>$99.99</th> </tr> </table> </td> So I am in the current table cell, how would I get the 99.99 value? I have so far: td[3].findChild('th') But I need to do: Find th with text 'price', then get next th tag's string value. ...

name and value lookup collection, loaded from a dropdown list using beautifulsoup

On my html page I have a dropdown list: <select name="somelist"> <option value="234234234239393">Some Text</option> </select> So do get this list I am doing: ddl = soup.findAll('select', name="somelist") if(ddl): ??? Now I need help with this collection/dictionary, I want to be able to lookup by both 'Some Text' and 2342342...

Extracting a tag value in BeautifulSoup when uanble to match by position or attributes.

I'm using BS to scrape a web page and i'm a little stuck with a small problem. Here's a snippet of HTML from the page. <span style="font-family: arial;"><span style="font-weight: bold;">Artist:</span> M.I.A.<br> </span> Once I've got the soup, how can i find this tag and get the artist name i.e. M.I.A. I cannot match the tag with the ...

pyton regex to find any link that contains the text 'abc123'

I am using beautifuly soup to find all href tags. links = myhtml.findAll('a', href=re.compile('????')) I need to find all links that have 'abc123' in the href text. I need help with the regex , see ??? in my code snippet. ...

Having some beautifulsoup htmlparse errors, how to revert to a different version?

I am using beautifulsoup, and I am getting some htmlparser errors with start tags etc. I read on crummy's site that one suggestion is to go back to an older version (3.08). I am using Ubuntu, where I did: sudo apt-get install python-beautifulsoup to install it. how can I check what version I have now? how can I force a specific ver...

Using beautifulSoup, trying to get all table rows that have a string in them.

I need to get all table rows on a page that contain a specific string 'abc123123' in them. The string is inside a TD, but I need the entire TR if it contains the 'abc123123' anywhere inside. I tried this: userrows = s.findAll('tr', contents = re.compile('abc123123')) I'm not sure if contents is the write property. My html looks som...

Problem using replaceWith to replace HTML tags with BeautifulSoup on Python

I am using BeautifulSoup in Python and am having trouble replacing some tags. I am finding <div> tags and checking for children. If those children do not have children (are a text node of NODE_TYPE = 3), I am copying them to be a <p>. from BeautifulSoup import Tag, BeautifulSoup class bar: self.soup = BeautifulSoup(self.input) foo()...

Problem with encode decode. Python. Django. BeautifulSoup

In this code: soup=BeautifulSoup(program.Description.encode('utf-8')) name=soup.find('div',{'class':'head'}) print name.string.decode('utf-8') error happening when i'm trying to print or save to database. dosnt metter what i'm doing: print name.string.encode('utf-8') or just print name.string Traceback (most recent ca...

Want all links that have 2 attributes, how do you pass 2 attributes?

I know how to pass 1 attribute, but how do I pass 2? e.g. somerows = soup.findAll('a', target="blank") what if I want all links that have target="blank" and class="blah" ? ...

Help retrieving product code from HTML using Beautiful Soup

A webpage has a product code I need to retrive, and it is in the following HTML section: <table...> <tr> <td> <font size="2">Product Code#</font> <br> <font size="1">2342343</font> </td> </tr> </table> So I guess the best way to do this would be first to reference the html element with the text value 'Product Code#', and then re...

Using BeautifulSoup, Can I quickly traverse to a specific parent element?

Say I reference an element inside of a table in a HTML page like this: someEl = soup.findAll(text = "some text") I know for sure this element is embedded inside a table, is there a way to find the parent table without having to call .parent so many times? <table...> .. .. <tr>....<td><center><font..><b>some text</b></font></center><...

python beautifulsoup adding extra end tags

I'm using Beautifulsoup to parse a website request = urllib2.Request(url) response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response) I am using it to traverse a table. The problem I am running into is that BS is adding an extra end tag for the table into the html which doesn't exist, which I verified with...

How can I strip comment tags from HTML using BeautifulSoup?

I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a> or <img> tags. So far I have this EDITED & UPDATED CURRENT CODE: soup = BeautifulSoup(page) comments = soup...

How can I get all the attributes of a HTML tag?

How can I get all the attributes of a HTML tag? listinp = soup('input') for input in listinp: # get all attr on this tag in dict ...

Get a list of the absolute paths of all the images in a page using BeautifulSoup

Could someone show me how to get a list of aboslute paths for all the images in a webpage using BeautifulSoup? It's simple to get all the images. I'm doing this: page_images = [image["src"] for image in soup.findAll("img")] ...but I'm having difficulties getting the absolute paths. Any help? Thank you. ...

Removing Tags from HTML Parsed with BeautifulSoup

I'm new to python and I'm using BeautifulSoup to parse a website and then extract data. I have the following code: for line in raw_data: #raw_data is the parsed html separated into smaller blocks d = {} d['name'] = line.find('div', {'class':'torrentname'}).find('a') print d['name'] <a href="/ubuntu-9-10-desktop-i386-t314421...

BeautifulSoup chokes on paths with back slashes

I wrote a script to automate the process of creating an image gallery. I used os.path.join() for creating paths to new image directories. I only relized after creating all the galleries that using os.path.join() was not such a good idea as it creates paths with \ (on windows) which causes problems with firefox (it doesn't seem to unders...

How to serialize beautifulsoup access-paths?

i have code, which does something like this: item.previous.parent.parent.aTag['href'] now i would like to be able to add filters fast, so hardcoding is no longer an option. how can i access the same tags with a path coded in a string? of course i could invent some format like [('getattr', 'previous'), ('getattr', 'parent'), ..., ('ge...