beautifulsoup

What's the nearest equivalent of Beautiful Soup for Ruby?

I love the Beautiful Soup scraping library in Python. It just works. Is there a close equivalent in Ruby? ...

HTML Entity Codes to Text

Does anyone know an easy way in Python to convert a string with HTML entity codes (e.g. &lt; &amp;) to a normal string (e.g. < &)? cgi.escape() will escape strings (poorly), but there is no unescape(). ...

Why is Beautiful Soup truncating this page?

I am trying to pull at list of resource/database names and IDs from a listing of resources that my school library has subscriptions to. There are pages listing the different resources, and I can use urllib2 to get the pages, but when I pass the page to BeautifulSoup, it truncates its tree just before the end of the entry for the first r...

Looking for a recommendation of a good tutorial on best practices for a web scraping project?

I need to do a fairly extensive project involving web scraping and am considering using Hpricot or Beautiful Soup (i.e. Ruby or Python). Has anyone come across a tutorial that they thought was particularly good on this subject that would help me start the project off on the right foot? ...

URL tree walker in Python?

For URLs that show file trees, such as Pypi packages, is there a small solid module to walk the URL tree and list it like ls -lR? I gather (correct me) that there's no standard encoding of file attributes, link types, size, date ... in html <A attributes so building a solid URLtree module on shifting sands is tough. But surely this wheel...

Complex Beautiful Soup query

Here is a snippet of an HTML file I'm exploring with Beautiful Soup. <td width="50%"> <strong class="sans"><a href="http:/website">Site</a></strong> <br /> I would like to get the <a href> for any line which has the <strong class="sans"> and which is inside a <td width="50%">. Is it possible to query a HTML file for those multipl...

Changing element value with BeautifulSoup returns empty element.

from BeautifulSoup import BeautifulStoneSoup xml_data = """ <doc> <test>test</test> <foo:bar>Hello world!</foo:bar> </doc> """ soup = BeautifulStoneSoup(xml_data) print soup.prettify() make = soup.find('foo:bar') print make # prints <foo:bar>Hello world!</foo:bar> make.contents = ['Top of the world Ma!'] print make # prints <foo:b...

How to convert html entities into symbols?

Hi, I have made some adaptations to the script from this answer. and I am having problems with unicode. Some of the questions end up being written poorly. Some answers and responses end up looking like: Yeah.. I know.. I&#8217;m a simpleton.. So what&#8217;s a Singleton? (2) How can I make the &#8217; to be translated to the right cha...

python- is beautifulsoup misreporting my html?

I have two machines each, to the best of my knowledge, running python 2.5 and BeautifulSoup 3.1.0.1. I'm trying to scrape http://utahcritseries.com/RawResults.aspx, using: from BeautifulSoup import BeautifulSoup import urllib2 base_url = "http://www.utahcritseries.com/RawResults.aspx" data=urllib2.urlopen(base_url) soup=BeautifulSo...

Dynamically change range in Python?

So say I'm using BeautifulSoup to parse pages and my code figures out that there are at least 7 pages to a query. The pagination looks like 1 2 3 4 5 6 7 Next If I paginate all the way to 7, sometimes there are more than 7 pages, so that if I am on page 7, the pagination looks like 1 2 3 7 8 9 10 Next So now, I know there are...

Where can I find some "hello world"-simple Beautiful Soup examples?

I'd like to do a very simple replacement using Beautiful Soup. Let's say I want to visit all A tags in a page and append "?foo" to their href. Can someone post or link to an example of how to do something simple like that? ...

How do I iterate over the HTML attributes of a Beautiful Soup element?

How do I iterate over the HTML attributes of a Beautiful Soup element? Like, given: <foo bar="asdf" blah="123">xyz</foo> I want "bar" and "blah". ...

Using Beautiful Soup, how do I iterate over all embedded text?

Let's say I wanted to remove vowels from HTML: <a href="foo">Hello there!</a>Hi! becomes <a href="foo">Hll thr!</a>H! I figure this is a job for Beautiful Soup. How can I select the text in between tags and operate on it like this? ...

How to render contents of a tag in unicode in BeautifulSoup?

This is a soup from a WordPress post detail page: content = soup.body.find('div', id=re.compile('post')) title = content.h2.extract() item['title'] = unicode(title.string) item['content'] = u''.join(map(unicode, content.contents)) I want to omit the enclosing div tag when assigning item['content']. Is there any way to render all the c...

Error with Beautiful Soup's extract()

I'm working on some screen scraping software and have run into an issue with Beautiful Soup. I'm using python 2.4.3 and Beautiful Soup 3.0.7a. I need to remove an <hr> tag, but it can have many different attributes, so a simple replace() call won't cut it. Given the following html: <h1>foo</h1> <h2><hr/>bar</h2> And the following co...

Using BeautifulSoup to find a HTML tag that contains certain text

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11} <h2> this is cool #12345678901 </h2> So, the previous would match by using: soup('h2',text=re.compile(r' #\S{11}')) And the results would be something like: [u'blahblah #223409823523', u'thisisinteresting #293845023984'] I'm able to...

Can I segment a document in BeautifulSoup before converting it to text based on my analysis of the document?

I have some html files that I want to convert to text. I have played around with BeautifulSoup and made some progress on understanding how to use the instructions and can submit html and get back text. However, my files have a lot of text that is formatted using table structures. For example I might have a paragraph of text that res...

Should I implement the mixed use of BeautifulSoup and REGEXs or rely solely on BS

I have some data I need to extract from a collection of html files. I am not sure if the data resides in a div element, a table element or a combined element (where the div tag is an element of a table. I have seen all three cases. My files are large-as big as 2 mb and I have tens of thousands of them. So far I have looked at the td ...

Beautiful Soup and uTidy

I want to pass the results of utidy to Beautiful Soup, ala: page = urllib2.urlopen(url) options = dict(output_xhtml=1,add_xml_decl=0,indent=1,tidy_mark=0) cleaned_html = tidy.parseString(page.read(), **options) soup = BeautifulSoup(cleaned_html) When run, the following error results: Traceback (most recent call last): File "soup.py...

Dynamically specifying tags while using replaceWith in Beautiful Soup

Previously I asked this question and got back this BeautifulSoup example code, which after some consultation locally, I decided to go with. >>> from BeautifulSoup import BeautifulStoneSoup >>> html = """ ... <config> ... <links> ... <link name="Link1" id="1"> ... <encapsulation> ... <mode>ipsec</mode> ... </encapsulation> ... </link...