questions about beautifulsoup | ansaurus

beautifulsoup

Parameters for find function

Hello, I'm using beautiful soup (in Python). I have such hidden input object: <input type="hidden" name="form_build_id" id="form-531f740522f8c290ead9b88f3da026d2" value="form-531f740522f8c290ead9b88f3da026d2" /> I need in id/value. Here is my code: mainPageData = cookieOpener.open('http://page.com').read() soupHandler = BeautifulSo...

How to find links and modify an Html using BeautifulSoup in Python

Starting from an Html input like this: <p> <a href="http://www.foo.com">this if foo</a> <a href="http://www.bar.com">this if bar</a> </p> using BeautifulSoup, i would like to change this Html in: <p> <a href="http://www.foo.com">this if foo[1]</a> <a href="http://www.bar.com">this if bar[2]</a> </p> saving parsed links ...

Is it possibile to modify a link value with Beautifulsoup without recreating the all link?

Starting from an Html input like this: <p> <a href="http://www.foo.com" rel="nofollow">this is foo</a> <a href="http://www.bar.com" rel="nofollow">this is bar</a> </p> is it possible to modify the <a> node values ("this i foo" and "this is bar") adding the suffix "PARSED" to the value without recreating the all link? The result need t...

How to append a tag after a link with BeaufitulSoup

Starting from an Html input like this: <p> <a href="http://www.foo.com">this if foo</a> <a href="http://www.bar.com">this if bar</a> </p> using BeautifulSoup, i would like to change this Html in: <p> <a href="http://www.foo.com">this if foo</a><b>OK</b> <a href="http://www.bar.com">this if bar</a><b>OK</b> </p> Is it po...

BeautifulSoup: Get the contents of a specific table

Hi, My local airport disgracefully blocks users without IE, and looks awful. I want to write a Python scripts that would get the contents of the Arrival and Departures pages every few minutes, and show them in a more readable manner. My tools of choice are mechanize for cheating the site to believe I use IE, and BeautifulSoup for parsi...

BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

I'm trying to scrape all the inner html from the <p> elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text. For example, for: <p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p> How can I extract: Red Blue Yellow Light green Neither .string nor...

Python: Mechanize and BeautifulSoup not working on a shared hosting computer

Hi, I am writing a small site decorator to make my local airport site work with standard HTML. On my local computer, I use Python's mechanize and BeautifulSoup packages to scrape and parse the site contents, and everything seems to work just fine. I have installed these packages via apt-get. On my shared hosting site (at DreamHost) I ...

How to prevent BeautifulSoup from stripping lines

I'm trying to translate an online html page into text. I have a problem with this structure: <div align="justify"><b>Available in <a href="http://www.example.com.be/book.php?number=1"> French</a> and <a href="http://www.example.com.be/book.php?number=5"> English</a>. </div> Here is its representation as a python string: '<d...

does BeautifulSoup strips inline CSS and javascript content

hey guys does beautifulSoup strips css and javascript content? after using content3 = ''.join(BeautifulSoup(content).findAll(text=True)) i still have them lingering around. ...

Use BeautifulSoup to get delimited contents of a div.

I want to get correctly delimited text out of BeautifulSoup, turning tags into whitespace if necessary. The problem is that newlines are collapsed and tags like <br/> are not rendered as whitespace. <div class="companyInfo"> <p class="identInfo"> <acronym title="Standard Industrial Code"> SIC </acronym> ...

Get table with maximum number of rows in a page using BeautifulSoup

Hi, Can anyone tell me how i can get the table in a HTML page which has a the most rows? I'm using BeautifulSoup. There is one little problem though. Sometimes, there seems to be one table nested inside another. <table> <tr> <td> <table> <tr> <td></td> <t...

Downloading a picture via urllib and python.

So I'm trying to make a Python script that downloads webcomics and puts them in a folder on my desktop. I've found a few similar programs on here that do something similar, but nothing quite like what I need. The one that I found most similar is right here (http://bytes.com/topic/python/answers/850927-problem-using-urllib-download-imag...

Element Based XML Parsing

I have an XML document which reads like this: <xml> <web:Web> <web:Total>4000</web:Total> <web:Offset>0</web:Offset> </web:Web> </xml> my question is how do I access them using a library like BeautifulSoup in python? xmlDom.web["Web"].Total ? does not work? ...

Massage with BeatifulSoup or clean with Regex

Could someone tell me whats a better way to clean up bad HTML so BeautifulSoup can handle it - should one use the massage methods of BeautifulSoup or clean it up using regular expressions? Thanks. ...

How to get data for BeautifulSoup Xml Parser

Hi, I'm trying to parse an XML file with BeautifulSoup. In all tutorials on the net, the content of the xml is given like xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3" soup = BeautifulStoneSoup(xml) but I want to give only xml file's path. In mechanize one can use get_data() method but it only works for html files. A...

Separating HTML into groups using BeautifulSoup when groups are all in the same element

Here's an example: <p class='animal'>cats</p> <p class='attribute'>they meow</p> <p class='attribute'>they have fur</p> <p class='animal'>turtles</p> <p class='attribute'>they don't make noises</p> <p class='attribute'>they have shells</p> If each animal was in a separate element I could just iterate over the elements. That would be g...

associative list python

Dear all, i am parsing some html form with Beautiful soup. Basically i´ve around 60 input fields mostly radio buttons and checkboxes. So far this works with the following code: from BeautifulSoup import BeautifulSoup x = open('myfile.html','r').read() out = open('outfile.csv','w') soup = BeautifulSoup(x) values = soup.findAll('input',...

Cleaning up and removing tags with BeautifulSoup

Hey again all, I have the following script so far: from mechanize import Browser from BeautifulSoup import BeautifulSoup import re import urllib2 br = Browser() br.open("http://www.foo.com") html = br.response().read(); soup = BeautifulSoup(html) items = soup.findAll(id="info") and it runs perfectly, and results in the following ...

Escaping … with BeautifulSoup

I am currrently using BeautifulSoup to scrape some websites, however I have a problem with some specific characters, the code inside UnicodeDammit seems to indicate this (again) are some Microsoft-invented ones. I'm using the newest version of BeautifulSoup(3.0.8.1) as I am still using python2.5 The following code illustrates my proble...

Problem with scraping data using BeautifulSoup

Dear Python Experts, I have written the following trial code to retreive the title of legislative acts from the European parliament. import urllib2 from BeautifulSoup import BeautifulSoup search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN" for number in xran...

1
...
6
7
8
9
10
...
12