views:

69

answers:

3

Working on a small web spider in python, using the lxml module I have a segment of code which does an xpath query of the document and places all the links from 'a href' tags into a list. what I'd like to do is check each link as it is being added to the list, and if it is needed, unescape it. I understand using the urllib.unquote() function, but the problem I'm experiencing is that the urllib method throws an exception which I believe is due to not every link that is passed to the method needs unescaping. Can anyone point me in the right direction? Here's the code I have so far:

import urllib
import urllib2
from lxml.html import parse, tostring

class Crawler():

    def __init__(self, url):
        self.url = url
        self.links = []
    def crawl(self):

        doc = parse("http://" + self.url).getroot()
        doc.make_links_absolute(self.url, resolve_base_href=True)
        for tag in doc.xpath("//a"):
            old = tag.get('href')
            fixed = urllib.unquote(old)
            self.links.append(fixed)
        print(self.links)
A: 
url.find('%') > -1

or wrap urllib.unquote in a try..except clause.

intuited
The lack of a % does not actually cause unquote() to raise an exception.
kindall
I think `'%' in url` would be slightly more Pythonic.
MatrixFrog
@MatrixFrog: Yes, it would. Good point.
intuited
+1  A: 

You could do something like this. Although I don't have a url which causes an exception. So this is just hypothesis at this point. See if this approach works.

from urllib import unquote

#get url from your parse tree.
url_unq = unquote(url or '')
if not url_unq:
    url_unq = url

See if this works? It would be great if you could give an actual example of the URL which causes exception. What Exception? Could you post the StackTrace?

Worst-case you could always use a try-except around that block & go about your business.

MovieYoda
A: 

unquote doesn't throw exceptions because of URLs that don't need escaping. You haven't shown us the exception, but I'll guess that the problem is that old isn't a string, it's probably None, because you have an <a> tag with no href attribute.

Check the value of old before you try to use it.

Ned Batchelder
This was it. I took a second look at the stack trace, and it was referencing a 'None' object. I made the changes to the xpath query as noted in the above comments, working great now. Thanks
Stev0