views:

62

answers:

3

I wrote a simple Python script to download a web page for offline viewing. The problem is that the relative links are broken. So the offline file "c:\temp\webpage.html" has a href="index.aspx" but when opened in a browser it resolves to "file:///C:/temp/index.aspx" instead of "http://myorginalwebsite.com/index.aspx".

So I imagine that I would have to modify my script to fix each of the relative links so that it points to the original website. Is there an easier way? If not, anyone have some sample Python code that can do this? I'm a Python newbie so any pointers will be appreciated.

Thanks.

A: 

So you want to check all links that start with http:// but any that don't you want to append http://myoriginalwebsite.com to the front of the string, then test for connection?

Sounds easy enough. Or is it the python code proper you're having issues with?

drachenstern
Caveat: if the link starts with `?`, you need to prepend the path and file also; and if it doesn't start with `/`, prepend path.
Piskvor
@piskvor ~ good point. I was really looking for problem clarification. I took it he wanted to scrape static files looking for links, not modify existing files. C'est la vie I suppose. Too much to ask some questioners about their problem domain some days...
drachenstern
+4  A: 

If you just want your relative links to refer to the website, just add a base tag in the head:

<base href="http://myoriginalwebsite.com/" />
jhurshman
Okay, now, how do you add this to the page in the context of the question?
Aaron Gallagher
The base tag just needs to be added to the head tag of the local file. Actually, for pretty much all browsers, it doesn't have to be in the head tag, it can just be the first element in the page.I don't know python, so I don't know specifically how that would be done, but basically, just prepend the base tag to the downloaded HTML.
jhurshman
This does exactly what I need. I have to figure out how to add that to the file. Can't be too hard...
Sajee
I'm sure there's a better way but I simply used Python replace() to change the HTML file so that <HEAD> becomes <HEAD> <base href="http://myoriginalwebsite.com/" />That works nicely.
Sajee
+1  A: 

lxml makes this braindead simple!

>>> import lxml.html, urllib
>>> url = 'http://www.google.com/'
>>> e = lxml.html.parse(urllib.urlopen(url))
>>> e.xpath('//a/@href')[-4:]
['/intl/en/ads/', '/services/', '/intl/en/about.html', '/intl/en/privacy.html']
>>> e.getroot().make_links_absolute()
>>> e.xpath('//a/@href')[-4:]
['http://www.google.com/intl/en/ads/', 'http://www.google.com/services/', 'http://www.google.com/intl/en/about.html', 'http://www.google.com/intl/en/privacy.html']

From there you can write the DOM out to disk as a file.

Aaron Gallagher