views:

66

answers:

2

Does anyone know of a library for fixing "broken" urls. When I try to open a url such as

http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff

urllib2.urlopen chokes and gives me an HTTPError traceback. Does anyone know of a library that can fix these sorts of things?

+2  A: 

What about something like...:

import re
import urlparse

urls = '''
http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff
'''.split()

def main():
  for u in urls:
    pieces = list(urlparse.urlparse(u))
    pieces[2] = re.sub(r'^[./]*', '/', pieces[2])
    pieces[-1] = ''
    print urlparse.urlunparse(pieces)

main()

it does emit, as you desire:

http://www.domain.com/page.html
http://www.domain.com/page.html
http://www.domain.com/page.html

and would appear to roughly match your needs, if I understood them correctly.

Alex Martelli
http://www.domain.com/ohmygod///page.htmlhttp://www.domain.com/nono/...page.html will not work
Right, I'm only fixing breakages at the start of the path, as per the only examples the OP gave. You can fix a few more breakages by `path.split('/')`, ignoring empty pieces and removing stray leading dots. But there are a higher-order infinity of possible broken URLs, unless SOME specs are given it's impossible to KNOW what to fix!-)
Alex Martelli
A: 

See also this question.

Zak Johnson