tags:

views:

56

answers:

2

I am trying to get a web page using the following sample code:

from urllib import urlopen
print urlopen("http://www.php.net/manual/en/function.gettext.php").read()

Now I can get the whole web page in a variable. I wanna get a part of the page containing something like this

<div class="methodsynopsis dc-description">
   <span class="type">string</span><span class="methodname"><b>gettext</b></span> ( <span class="methodparam"><span class="type">string</span> <tt class="parameter">$message</tt></span>
   )</div>

So that i can generate a file to implement in another application. I wanna be able to extract the words "string", "gettext" and "$message".

+1  A: 

When extracting information from HTML, it isn't recommended to just hack some regexes together. The right way to do it is to use a proper HTML parsing module. Python has several good modules for this purpose - in particular I recommend BeautifulSoup.

Don't be put off by the name - it's a serious module used by a lot of people with great success. The documentation page has a lot of examples that should help you get started with your particular needs.

Eli Bendersky
+2  A: 

Why don't you try using BeautifulSoup

Example code :

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(htmldoc)
allSpans = soup.findAll('span', class="type")
for element in allSpans:
    ....
pyfunc