views:

44

answers:

2

Hi there.

I am using lxml.html to parse some hmtl to get links, however when it hits a link which contains an image it just returns blank, what it'd really like is to be able to detect if it's an image, and then try and return the image alt text.

So it looks like this...

from lxml.html import parse, fromstring

doc = fromstring('<a href="Link One">Anchor Link One</a><br /><a href="Link Two"<img src="Image Link Two" alt="Alt Image" /></a><br /><a href="Link Three">Anchor Link Three</a><br />')
for link in doc.cssselect('a'):
    print '%s: %s' % (link.text_content(), link.get('href'))

result

Anchor Link One: Link One
: Link Two
Anchor Link Three: Link Three

So I tried using .html_content() to try and get the raw html and then check if that was an image.

Hmm.. How to detect if wrapped in image, and/or pull out the html there....

+1  A: 

Just modify your css selector:

for img in doc.cssselect('a img'):

You can also use an XPATH expression:

for img in doc.xpath('a//img'):
mikerobi
Does that also pickup if there is no img?
Wizzard
No, base on your question, it seemed all you wanted was the alt text, no image, no alt text.
mikerobi
+1  A: 
for link in doc.xpath('a'):
    img = link.find('img')
    if img is not None:
        print '%s: %s' % (img.get('alt'), link.get('href'))
    else:
        print '%s: %s' % (link.text_content(), link.get('href'))
dusan