ansaurus

Question

Python parsing: lxml to get just part of a tag's text

Answer 1

A:

Have a look at BeautifulSoup. I've just started using it, so I'm no expert. Off the top of my head:

import BeautifulSoup

text = '''<p><span class="Title">Name</span>Dave Davies</p>
          <p><span class="Title">Address</span>123 Greyfriars Road, London</p>'''

soup = BeautifulSoup.BeautifulSoup(text)

paras = soup.findAll('p')

for para in paras:
    spantext = para.span.text
    othertext = para.span.nextSibling
    print spantext, othertext

[Out]: Name Dave Davies
       Address 123 Greyfriars Road, London

Alex Bliskovsky 2010-07-21 18:12:29

Thanks for this. I also like BeautifulSoup, but I believe it's no longer being maintained, so I'm switching to lxml/pyquery.

AP257 2010-07-21 18:45:57

Answer 2

A:

Each Element can have a text and a tail attribute (in the link, search for the word "tail"):

import lxml.etree

content='''\
<p><span class="Title">Name</span>Dave Davies</p>
<p><span class="Title">Address</span>123 Greyfriars Road, London</p>'''


root=lxml.etree.fromstring(content,parser=lxml.etree.HTMLParser())
for elt in root.findall('**/span'):
    print(elt.text, elt.tail)

# ('Name', 'Dave Davies')
# ('Address', '123 Greyfriars Road, London')

unutbu 2010-07-21 18:16:45

Perfect - thank you!

AP257 2010-07-21 18:45:32

Answer 3

A:

Another method -- using xpath:

>>> from lxml import html
>>> doc = html.parse( file )
>>> doc.xpath( '//span[@class="Title"][text()="Name"]/../self::p/text()' )
['Dave Davies']
>>> doc.xpath( '//span[@class="Title"][text()="Address"]/../self::p/text()' )
['123 Greyfriars Road, London']

Steven D. Majewski 2010-07-21 18:37:12

ansaurus

tags:

views:

answers:

Python parsing: lxml to get just part of a tag's text

related questions