views:

23

answers:

1

I'm using BS to scrape a web page and i'm a little stuck with a small problem. Here's a snippet of HTML from the page.

<span style="font-family: arial;"><span style="font-weight: bold;">Artist:</span> M.I.A.<br>
</span>

Once I've got the soup, how can i find this tag and get the artist name i.e. M.I.A. I cannot match the tag with the style attribute as it is used in a dozen places in the page. I don't even know the exact location of the span tag as it changes position from page to page and therefore i can't match by position. The artsit name chnages but the title span structure is always the same.

I would only like the extract the artist name (the M.I.A. bit).

Thanks

A: 

BeautifulSoup is kind of dead, since SGMLParser is deprecated. I suggest you use the better lxml library -- It even has xpath support!!

from lxml import html

text = '''
<span style="font-family: arial;">
    <span style="font-weight: bold;">Artist:</span>M.I.A.<br>
</span>
'''

doc = html.fromstring(text)
print ''.join(doc.xpath("//span/span[text()='Artist:']/../text()"))

This xpath expression means "find the span tag which is inside another span tag and contains the text 'Artist:', and grab all the text of the parent containing tag". It correctly prints M.I.A. as one would expect.

nosklo