views:

59

answers:

3

I'm working on a web parser using urllib. I need to be able to only save lines that lie within a certain div tag. for instance: I'm saving all text in the div "body." This means all text within the div tags will be returned. It also means if there are other divs inside of it thats fine, but as soon as I hit the parent it stops. Any ideas?

My Idea

  1. search for the div you're looking for.

  2. Record the position.

  3. Keep track of any divs in the future. +1 for new div -1 for end div.

  4. when back to 0, your at your parent div? Save location.

  5. Then save data from beginnning number to end number?

+3  A: 

If you're not really excited at the idea of parsing the HTML code yourself, there are two good options:

Beautiful Soup

Lxml

You'll probably find that lxml runs faster than BeautifulSoup, but in my uses, Beautiful Soup was very easy to learn and use, and handled typical crappy HTML as found in the wild well enough that I don't have need for anything else.

YMMV.

bgporter
man do I love Beautifulsoup
JudoWill
+3  A: 

Using lxml:

import lxml.html as lh
content='''\
<body>
<div>AAAA
  <div>BBBB
     <div>CCCC
     </div>DDDD
  </div>EEEE
</div>FFFF
</body>
'''
doc=lh.document_fromstring(content)
div=doc.xpath('./body/div')[0]
print(div.text_content())
# AAAA
#   BBBB
#      CCCC
#      DDDD
#   EEEE

div=doc.xpath('./body/div/div')[0]
print(div.text_content())
# BBBB
#      CCCC
#      DDDD
unutbu
A: 

Personally I prefer lxml in general, but there are times where it's HTML handling is a bit off... Here's a BeautifulSoup recipe if it helps.

from BeautifulSoup import BeautifulSoup, NavigableString

def printText(tags):
    s = []
    for tag in tags :
        if tag.__class__ == NavigableString :
            s.append(tag)
        else :
            s.append(printText(tag))
    return "".join(s)

html = "<html><p>Para 1<div class='stuff'>Div Lead<p>Para 2<blockquote>Quote 1</div><blockquote>Quote 2"
soup = BeautifulSoup(html)

v = soup.find('div', attrs={ 'class': 'stuff'})

print v.text_content
koblas