Working on a small web spider in python, using the lxml module I have a segment of code which does an xpath query of the document and places all the links from 'a href' tags into a list. what I'd like to do is check each link as it is being added to the list, and if it is needed, unescape it. I understand using the urllib.unquote() function, but the problem I'm experiencing is that the urllib method throws an exception which I believe is due to not every link that is passed to the method needs unescaping. Can anyone point me in the right direction? Here's the code I have so far:
import urllib
import urllib2
from lxml.html import parse, tostring
class Crawler():
def __init__(self, url):
self.url = url
self.links = []
def crawl(self):
doc = parse("http://" + self.url).getroot()
doc.make_links_absolute(self.url, resolve_base_href=True)
for tag in doc.xpath("//a"):
old = tag.get('href')
fixed = urllib.unquote(old)
self.links.append(fixed)
print(self.links)