tags:

views:

580

answers:

8

I am using Python to extract the filename from a link using rfind like below:

url = "http://www.google.com/test.php"

print url[url.rfind("/") +1 : ]

This works ok with links without a / at the end of them and returns "test.php". I have encountered links with / at the end like so "http://www.google.com/test.php/". I am have trouble getting the page name when there is a "/" at the end, can anyone help?

Cheers

A: 

You can remove the slash at the end of your string before processing it:

if url[-1] == '/':
    url = url[:-1]
unbeknown
A: 

You could use

print url[url.rstrip("/").rfind("/") +1 : ]
Tim Pietzcker
+1  A: 

Filenames with a slash at the end are technically still path definitions and indicate that the index file is to be read. If you actually have one that' ends in test.php/, I would consider that an error. In any case, you can strip the / from the end before running your code as follows:

url = url.rstrip('/')
Steve Moyer
Deestan
Actually it will ... they both resolve to the same path and are redirected to http://www.reddit.com/r/gaming/. As was pointed out elsewhere, query strings are a completely different problem (which the OP didn't ask about)
Steve Moyer
A: 

There is a library called urlparse that will parse the url for you, but still doesn't remove the / at the end so one of the above will be the best option

Andrew Cox
+8  A: 

Just removing the slash at the end won't work, as you can probably have a URL that looks like this:

http://www.google.com/test.php?filepath=tests/hey.xml

...in which case you'll get back "hey.xml". Instead of manually checking for this, you can use urlparse to get rid of the parameters, then do the check other people suggested:

from urlparse import urlparse
url = "http://www.google.com/test.php?something=heyharr/sir/a.txt"
f = urlparse(url)[2].rstrip("/")
print f[f.rfind("/")+1:]
Claudiu
A: 

Just for fun, you can use a Regexp:

import re
print re.search('/([^/]+)/?$', url).group(1)
gimel
Python isn't Perl, you don't always need to be reaching for regexps! For simple processing the builtin string methods are likely to be more readable and faster. (In this case on my machine regexps were 60% slower, 160% if not pre-compiled. Not that it probably matters on such simple code, but still)
bobince
I know. I also support the urlparse suggestion. Since no one brought up regexps, I thought I'd mention the possibility.
gimel
+4  A: 

Use [r]strip to remove trailing slashes:

url.rstrip('/').rsplit('/', 1)[-1]

If a wider range of possible URLs is possible, including URLs with ?queries, #anchors or without a path, do it properly with urlparse:

path= urlparse.urlparse(url).path
return path.rstrip('/').rsplit('/', 1)[-1] or '(root path)'
bobince
+1 for urlparse plus rstrip solution.
S.Lott
A: 
filter(None, url.split('/'))[-1]

(But urlparse is probably more readable, even if more verbose.)

fivebells