ansaurus

Question

Using Beautiful Soup, how do I iterate over all embedded text?

Answer 1

+4 A:

Suppose the variable test_html has the following html content:

<html>
<head><title>Test title</title></head>
<body>
<p>Some paragraph</p>
Useless Text
<a href="http://stackoverflow.com"&gt;Some link</a>not a link
<a href="http://python.org"&gt;Another link</a>
</body></html>

Just do this:

from BeautifulSoup import BeautifulSoup

test_html = load_html_from_above()
soup = BeautifulSoup(test_html)

for t in soup.findAll(text=True):
    text = unicode(t)
    for vowel in u'aeiou':
        text = text.replace(vowel, u'') 
    t.replaceWith(text)

print soup

That prints:

<html>
<head><title>Tst ttl</title></head>
<body>
<p>Sm prgrph</p>
Uslss Txt
<a href="http://stackoverflow.com"&gt;Sm lnk</a>nt  lnk
<a href="http://python.org"&gt;Anthr lnk</a>
</body></html>

Note that the tags and attributes are untouched.

nosklo 2009-05-06 20:18:58

ansaurus

tags:

views:

answers:

Using Beautiful Soup, how do I iterate over all embedded text?

related questions