What's the nearest equivalent of Beautiful Soup for Ruby?
I love the Beautiful Soup scraping library in Python. It just works. Is there a close equivalent in Ruby? ...
I love the Beautiful Soup scraping library in Python. It just works. Is there a close equivalent in Ruby? ...
Does anyone know an easy way in Python to convert a string with HTML entity codes (e.g. < &) to a normal string (e.g. < &)? cgi.escape() will escape strings (poorly), but there is no unescape(). ...
I am trying to pull at list of resource/database names and IDs from a listing of resources that my school library has subscriptions to. There are pages listing the different resources, and I can use urllib2 to get the pages, but when I pass the page to BeautifulSoup, it truncates its tree just before the end of the entry for the first r...
I need to do a fairly extensive project involving web scraping and am considering using Hpricot or Beautiful Soup (i.e. Ruby or Python). Has anyone come across a tutorial that they thought was particularly good on this subject that would help me start the project off on the right foot? ...
For URLs that show file trees, such as Pypi packages, is there a small solid module to walk the URL tree and list it like ls -lR? I gather (correct me) that there's no standard encoding of file attributes, link types, size, date ... in html <A attributes so building a solid URLtree module on shifting sands is tough. But surely this wheel...
Here is a snippet of an HTML file I'm exploring with Beautiful Soup. <td width="50%"> <strong class="sans"><a href="http:/website">Site</a></strong> <br /> I would like to get the <a href> for any line which has the <strong class="sans"> and which is inside a <td width="50%">. Is it possible to query a HTML file for those multipl...
from BeautifulSoup import BeautifulStoneSoup xml_data = """ <doc> <test>test</test> <foo:bar>Hello world!</foo:bar> </doc> """ soup = BeautifulStoneSoup(xml_data) print soup.prettify() make = soup.find('foo:bar') print make # prints <foo:bar>Hello world!</foo:bar> make.contents = ['Top of the world Ma!'] print make # prints <foo:b...
Hi, I have made some adaptations to the script from this answer. and I am having problems with unicode. Some of the questions end up being written poorly. Some answers and responses end up looking like: Yeah.. I know.. I’m a simpleton.. So what’s a Singleton? (2) How can I make the ’ to be translated to the right cha...
I have two machines each, to the best of my knowledge, running python 2.5 and BeautifulSoup 3.1.0.1. I'm trying to scrape http://utahcritseries.com/RawResults.aspx, using: from BeautifulSoup import BeautifulSoup import urllib2 base_url = "http://www.utahcritseries.com/RawResults.aspx" data=urllib2.urlopen(base_url) soup=BeautifulSo...
So say I'm using BeautifulSoup to parse pages and my code figures out that there are at least 7 pages to a query. The pagination looks like 1 2 3 4 5 6 7 Next If I paginate all the way to 7, sometimes there are more than 7 pages, so that if I am on page 7, the pagination looks like 1 2 3 7 8 9 10 Next So now, I know there are...
I'd like to do a very simple replacement using Beautiful Soup. Let's say I want to visit all A tags in a page and append "?foo" to their href. Can someone post or link to an example of how to do something simple like that? ...
How do I iterate over the HTML attributes of a Beautiful Soup element? Like, given: <foo bar="asdf" blah="123">xyz</foo> I want "bar" and "blah". ...
Let's say I wanted to remove vowels from HTML: <a href="foo">Hello there!</a>Hi! becomes <a href="foo">Hll thr!</a>H! I figure this is a job for Beautiful Soup. How can I select the text in between tags and operate on it like this? ...
This is a soup from a WordPress post detail page: content = soup.body.find('div', id=re.compile('post')) title = content.h2.extract() item['title'] = unicode(title.string) item['content'] = u''.join(map(unicode, content.contents)) I want to omit the enclosing div tag when assigning item['content']. Is there any way to render all the c...
I'm working on some screen scraping software and have run into an issue with Beautiful Soup. I'm using python 2.4.3 and Beautiful Soup 3.0.7a. I need to remove an <hr> tag, but it can have many different attributes, so a simple replace() call won't cut it. Given the following html: <h1>foo</h1> <h2><hr/>bar</h2> And the following co...
I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11} <h2> this is cool #12345678901 </h2> So, the previous would match by using: soup('h2',text=re.compile(r' #\S{11}')) And the results would be something like: [u'blahblah #223409823523', u'thisisinteresting #293845023984'] I'm able to...
I have some html files that I want to convert to text. I have played around with BeautifulSoup and made some progress on understanding how to use the instructions and can submit html and get back text. However, my files have a lot of text that is formatted using table structures. For example I might have a paragraph of text that res...
I have some data I need to extract from a collection of html files. I am not sure if the data resides in a div element, a table element or a combined element (where the div tag is an element of a table. I have seen all three cases. My files are large-as big as 2 mb and I have tens of thousands of them. So far I have looked at the td ...
I want to pass the results of utidy to Beautiful Soup, ala: page = urllib2.urlopen(url) options = dict(output_xhtml=1,add_xml_decl=0,indent=1,tidy_mark=0) cleaned_html = tidy.parseString(page.read(), **options) soup = BeautifulSoup(cleaned_html) When run, the following error results: Traceback (most recent call last): File "soup.py...
Previously I asked this question and got back this BeautifulSoup example code, which after some consultation locally, I decided to go with. >>> from BeautifulSoup import BeautifulStoneSoup >>> html = """ ... <config> ... <links> ... <link name="Link1" id="1"> ... <encapsulation> ... <mode>ipsec</mode> ... </encapsulation> ... </link...