beautifulsoup

Scraping a table using BeautifulSoup

Dear Python Experts, I have a question which i suspect is fairly straight forward. I have the following type of page from which I want to collect the information in the last table (if you scroll all the way down it is the one in the box labelled "Procedure"): http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&re...

How to use BeautifulSoup to extract from within a HTML paragraph?

Hello, I'm using BeautifulSoup to do some screen-scraping. My problem is this: I need to extract specific things out of a paragraph. An example: <p><b><a href="/name/abe">ABE</a></b> &nbsp; <font class="masc">m</font> &nbsp; <font class="info"><a href="/nmc/eng.php" class="usg">English</a>, <a href="/nmc/jew.php" class="usg">Hebrew</a>...

Alternatives to my slow method of using BeautifulSoup and Python to parse Amazon API XML?

As the title says, I'm using the BS module in Python to parse XML pages that I access from the Amazon API (i create the signed url, load it with liburl2, and then parse with BS). It takes about 4 seconds to do two pages, but there has to be a faster way Would PHP be faster? What's making it slow, the BS parsing or the liburl loading? ...

What encoding does the unicode function in BeautifulSoup convert from?

When I use the unicode function in BeautifulSoup - what encoding does it convert to Unicode from? Does it automatically use the soup.originalEncoding? from BeautifulSoup import BeautifulSoup doc = "<html><h1>Heading</h1><p>Text" soup = BeautifulSoup(doc) print unicode(soup) Thanks ...

BeautifulSoup doesn't give me Unicode

I'm using Beautiful soup to scrape data. The BS documentation states that BS should always return Unicode but I can't seem to get Unicode. Here's a code snippet import urllib2 from libs.BeautifulSoup import BeautifulSoup # Fetch and parse the data url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern' dat...

malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some issues. I'm not that familiar with Python, so this may be out of my league. I was able to get everything installed, but then running sipie gives this: /usr/bin/Si...

BeautifulSoup get innerhtml data

I am trying to read data from a website. I can see the value I need but the value does not appear in the downloaded html code (using urllib2). The value is created by some js file and embedded into the webpage as innerhtml for that id. PS: How can that be extracted? raw source code cannot render js unlike the browsers! ...

Hi,all. I have a question about BeautifulSoup.

Now , I use this method "allcity = dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")})" return a list look like [<a onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" href="http://www.ylyd.com/showurl.asp...

A href catching

Hello, I'm using BeautifulSoup for parsing some html. Here is the content: <tr> <th>Your provider:</th> <td> <img src="/isp_logos/la-la-la.ico" alt=""/> <a href="/isp/SomeProvider"> Provider name </a> &nbsp; <a href="http://*/isp-comparer/?isp=000000"&gt; </a> </td> </tr> I have to get SomeProvider text from the link . ...

Python Beautiful soup tag for table td

Python Beautiful soup tag for table td <td class="result" valign="top" colspan="3"> At the moment, the following does not work: for header in soup('table', 'td .result'): Getting error: HTMLParser.HTMLParseError: malformed start tag ...

Unable to get correct link in BeautifulSoup

I'm trying to parse a bit of HTML and I'd like to extract the link that matches a particular pattern. I'm using the find method with a regular expression but it doesn't get me the correct link. Here's my snippet. Could someone tell me what I'm doing wrong? from BeautifulSoup import BeautifulSoup import re html = """ <div class="entry">...

ANSI, ASCII, Unicode and encoding confusion with Python

Hi! I was happily using BeautifulSoup and I'm also using a text file as input parameters of my Python script. I then came across the famous "UnicodeEncodeError" error. I've been reading questions here at SO but I'm still confused. What does ASCII got to do with all of these? What encoding do I use on my text editor (Notepad++)? ANSI? ...

post to page to login using beautiful soup

I'm using python and beautifulsoup (new to both!), and I want to login to a suppliers website. So their form looks like (simplified): <form name=loginform action=/index.html method="post"> <input name=user> <input name=pass"> </form> Is there a way to keep track for cookies? ...

with beautifulsoup, how to reference the first table after a given form

I want to drill down into my html, specifically I want to get the first html table that is AFTER a form that looks like: <form method="POST" action="/parts.html"> .. <table ...> ... </table> .. </form> So this table has <tr> for each product. My utlimate goal here is to loop through each tablerow, and then I need to extract the ...

Getting BeautifulSoup to catch tags in a non-case-sensitive way

I want to catch some tags with BeautifulSoup: Some <p> tags, the <title> tag, some <meta> tags. But I want to catch them regardless of their case; I know that some sites do meta like this: <META> and I want to be able to catch that. I noticed that BeautifulSoup is case-sensitive by default. How do I catch these tags in a non-case-sensit...

BeautifulSoup and ASP.NET/C#

Has anyone integrated BeautifulSoup with ASP.NET/C# (possibly using IronPython or otherwise)? Is there a BeautifulSoup alternative or a port that works nicely with ASP.NET/C# The intent of planning to use the library is to extract readable text from any random URL. Thanks ...

BeautifulSoup(html) not working, saying can't call module?

import urllib2 import urllib from BeautifulSoup import BeautifulSoup # html from BeautifulSoup import BeautifulStoneSoup # xml import BeautifulSoup # everything import re f = o.open( 'http://www.google.com', p) html = f.read() f.close() soup = BeautifulSoup(html) Getting an error saying the line with soup ...

Using beautifulsoup, how to I reference table rows in html page

I have a html page that looks like: <html> .. <form post="/products.hmlt" ..> .. <table ...> <tr>...</tr> <tr> <td>part info</td> .. </tr> </table> .. </form> .. </html> I tried: form = soup.findAll('form') table = form.findAll('table') # table inside form But I get an e...

form -> table -> tr using successive findAll calls

Ok so I can reference my table correctly in a html page like this: form = soup.findAll('form')[1] table = form.findAll('table', width="79%") # returns 1 table, doing a print shows table with rows tr = table.findAll('tr') I get an error: ResultSet object has no attribute findAll. Why doesn't this work? I used the output of form.f...

Is there a faster alternative to BeautifulSoup?

I have a piece of code that basically extracts text from a page. It uses BeautifulSoup to first remove script, style and noscript tags and then find all the text in the page and return it. I don't want to do anything fancy, just get all the text in a page. However, it turns out that BeautifulSoup is rather slow, as it takes an appreciab...