questions about beautifulsoup | ansaurus

beautifulsoup

Adding New Element to Text Substring

Say I have the following string: "I am the most foo h4ck3r ever!!" I'm trying to write a makeSpecial(foo) function where the foo substring would be wrapped in a new span element, resulting in: "I am the most <span class="special">foo></span> h4ck3r ever!!" BeautifulSoup seemed like the way to go, but I haven't been able to make it ...

BeautifulSoup's Python 3 compatibility

Does BeautifulSoup work with Python 3? If not, how soon will there be a port? Will there be a port at all? Google doesn't turn up anything to me (Maybe it's 'coz I'm looking for the wrong thing?) ...

Scrape a dynamic website

What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new. --Edit-- For more detail: I'm trying to scrape the CNN primary database. There is a wealth of infor...

screen-scraping

How do I find all cells with a particular attribute in BeautifulSoup?

Hi I am trying to develop a script to pull some data from a large number of html tables. One problem is that the number of rows that contain the information to create the column headings is indeterminate. I have discovered that the last row of the set of header rows has the attribute border-bottom for each cell with a value. Thus I de...

How can you use BeautifulSoup to get colindex numbers?

I had a problem a week or so ago. Since I think the solution was cool I am sharing it here while I am waiting for an answer to the question I posted earlier. I need to know the relative position for the column headings in a table so I know how to match the column heading up with the data in the rows below. I found some of my tables ha...

How do you get the text from an HTML 'datacell' using BeautifulSoup

I have been trying to strip out some data from HTML files. I have the logic coded to get the right cells. Now I am struggling to get the actual contents of the 'cell': here is my htm snip headerRows[0][10].contents [<font size="+0"><font face="serif" size="1"><b>Apples Produced</b><font size="3"> </font></font></font>] ...

Is there a more Pythonic way to merge two HTML header rows with colspans?

I am using BeautifulSoup in Python to parse some HTML. One of the problems I am dealing with is that I have situations where the colspans are different across header rows. (Header rows are the rows that need to be combined to get the column headings in my jargon) That is one column may span a number of columns above or below it and the...

Decomposing HTML to link text and target

Given an HTML link like <a href="urltxt" class="someclass" close="true">texttxt</a> how can I isolate the url and the text? Updates I'm using Beautiful Soup, and am unable to figure out how to do that. I did soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url)) links = soup.findAll('a') for link in links: print "link co...

What does this Python message mean?

ho-fe3fdd00-12:~ Sam$ easy_install BeautifulSoup Traceback (most recent call last): File "/usr/bin/easy_install", line 8, in <module> load_entry_point('setuptools==0.6c7', 'console_scripts', 'easy_install')() File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/setuptools/command/easy_install.py", line...

What is the best way to handle a bad link given to BeautifulSoup?

I'm working on something that pulls in urls from delicious and then uses those urls to discover associated feeds. However, some of the bookmarks in delicious are not html links and cause BS to barf. Basically, I want to throw away a link if BS fetches it and it does not look like html. Right now, this is what I'm getting. trillian:D...

BeautifulSoup 3.1 parser breaks far too easily

I was having trouble parsing some dodgy HTML with BeautifulSoup. Turns out that the HTMLParser used in newer versions is less tolerant than the SGMLParser used previously. Does BeautifulSoup have some kind of debug mode? I'm trying to figure out how to stop it borking on some nasty HTML I'm loading from a crabby website: <HTML> <...

BeautifulSoup - modifying all links in a piece of HTML?

Hello, I need to be able to modify every single link in an HTML document. I know that I need to use the SoupStrainer but I'm not 100% positive on how to implement it. If someone could direct me to a good resource or provide a code example, it'd be very much appreciated. Thanks. ...

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

With BeautifulSoup 3.1.0.1 and Python 2.5.2, and trying to parse a web page in French. However, as soon as I call findAll, I get the following error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1146: ordinal not in range(128) Below is the code I am currently running: import urllib2 from BeautifulSoup i...

screen-scraping

How to embed p tag inside some text using Beautifulsoup?

I wanted to embed <p> tag where ever there is a \r\n\r\n. u"Finally Sri Lanka showed up, prevented their first 5-0 series whitewash, and stopped India at nine ODI wins in a row. \r\n\r\nFor 62 balls Yuvraj Singh played a dream knock, keeping India in the game despite wickets falling around him. \r\n\r\nPerhaps the toss played a big par...

How do i extract my required data from HTML file?

This is the HTML I have: p_tags = '''<p class="foo-body"> <font class="test-proof">Full name</font> Foobar<br /> <font class="test-proof">Born</font> July 7, 1923, foo, bar<br /> <font class="test-proof">Current age</font> 27 years 226 days<br /> <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan...

screen-scraping

Preventing BeautifulSoup from converting my XML tags to lowercase

I am using BeautifulStoneSoup to parse an XML document and change some attributes. I noticed that it automatically converts all XML tags to lowercase. For example, my source file has <DocData> elements, which BeautifulSoup converts to <docdata>. This appears to be causing problems since the program I am feeding my modified XML document t...

Parsing an HTML file with selectorgadget.com

How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel, Desktop, ......, 2.4GHz, 1066Mhz, ...... , 3 years limited. After using selectorgadget I ...

screen-scraping

html-content-extraction

Issues with BeautifulSoup parsing

I am trying to parse an html page with BeautifulSoup, but it appears that BeautifulSoup doesn't like the html or that page at all. When I run the code below, the method prettify() returns me only the script block of the page (see below). Does anybody has an idea why it happens? import urllib2 from BeautifulSoup import BeautifulSoup ur...

Decoding HTML Entities With Python

The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien's "The Children of Húrin". import urllib2 from BeautifulSoup import BeautifulStoneSoup URL = ("http://www.librarything.com/services/rest/1.0/" "?method=librarything.ck.getwork&id=1907912" "&apikey=2a2e596b887...

BeautifulSoup gives me unicode+html symbols, rather than straight up unicode. Is this a bug or misunderstanding?

I'm using BeautifulSoup to scrape a website. The website's page renders fine in my browser: Oxfam International’s report entitled “Offside! http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271 In particular, the single and double quotes look fine. They look html symbols rather than ascii, though strangely wh...

1
2
3
4
5
...
12