Adapted from Tony Segaran's Programming Collective Intelligence (page 60):
def gettextonly(soup):
v=soup.string
if v == None:
c=soup.contents
resulttext=''
for t in c:
subtext=gettextonly(t)
resulttext+=subtext+'\n'
return resulttext
else:
return v.strip()
Example usage:
>>>from BeautifulSoup import BeautifulSoup
>>>doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
>>>''.join(doc)
'<html><head><title>Page title</title></head><body><p id="firstpara" align="center">
This is paragraph <b>one</b>.<p id="secondpara" align="blah">This is
paragraph<b>two</b>.</html>'
>>>soup = BeautifulSoup(''.join(doc))
>>>gettextonly(soup)
u'Page title\n\nThis is paragraph\none\n.\n\nThis is paragraph\ntwo\n.\n\n\n\n'
Note that the result is a single string, with text from inside different tags separated by newline (\n) characters.
If you would like to extract all of the words of the text as a list of words, you can use the following function, also adapted from Tony Segaran's Programming Collective Intelligence (pg. 61):
import re
def separatewords(text):
splitter=re.compile('\\W*')
return [s.lower() for s in splitter.split(text) if s!='']
Example usage:
>>>separatewords(gettextonly(soup))
[u'page', u'title', u'this', u'is', u'paragraph', u'one', u'this', u'is',
u'paragraph', u'two']