ansaurus

Question

Answer 1

+3 A:

If you're not really excited at the idea of parsing the HTML code yourself, there are two good options:

You'll probably find that lxml runs faster than BeautifulSoup, but in my uses, Beautiful Soup was very easy to learn and use, and handled typical crappy HTML as found in the wild well enough that I don't have need for anything else.

YMMV.

bgporter 2010-10-26 15:28:57

man do I love Beautifulsoup

JudoWill 2010-10-26 18:43:08

Answer 2

+3 A:

Using lxml:

import lxml.html as lh
content='''\
<body>
<div>AAAA
  <div>BBBB
     <div>CCCC
     </div>DDDD
  </div>EEEE
</div>FFFF
</body>
'''
doc=lh.document_fromstring(content)
div=doc.xpath('./body/div')[0]
print(div.text_content())
# AAAA
#   BBBB
#      CCCC
#      DDDD
#   EEEE

div=doc.xpath('./body/div/div')[0]
print(div.text_content())
# BBBB
#      CCCC
#      DDDD

unutbu 2010-10-26 15:29:37

Answer 3

A:

Personally I prefer lxml in general, but there are times where it's HTML handling is a bit off... Here's a BeautifulSoup recipe if it helps.

from BeautifulSoup import BeautifulSoup, NavigableString

def printText(tags):
    s = []
    for tag in tags :
        if tag.__class__ == NavigableString :
            s.append(tag)
        else :
            s.append(printText(tag))
    return "".join(s)

html = "<html><p>Para 1<div class='stuff'>Div Lead<p>Para 2<blockquote>Quote 1</div><blockquote>Quote 2"
soup = BeautifulSoup(html)

v = soup.find('div', attrs={ 'class': 'stuff'})

print v.text_content

koblas 2010-10-26 15:40:03

ansaurus

tags:

views:

answers:

Selecting only text within a div tag

related questions