views:

167

answers:

4

This seems like a common task, alter some elements of an array, but my solution didn't feel very pythonic. Is there a better way to build urls with list comprehension?

links = re.findall(r"(?:https?://|www\.|https?://www\.)[\S]+", text)
if len(links) == 0:
    return text
urls = []
for link in links:
    if link[0:4] == "www.":
        link = "http://" + link
    urls.append(link)

Maybe something like

links = re.findall(r"(?:https?://|www\.|https?://www\.)[\S]+", text)
if len(links) == 0:
    return text
urls = map(lambda x : something(x), links)
+1  A: 
["http://"+link if link[0:4]=='www.' else link for link in links]

or

[link[0:4]=='www.' and "http://"+link or link for link in links]


Notes: ("http://"+link if link[0:4]=='www.' else link) - this is ternary operator like ?: in C

(link[0:4]=='www.' and "http://"+link or link) - this has the same meaning.


On another subject: I would test for http://, not for www. Domains don't have to start with www. For instance, http://stackoverflow.com.

yu_sha
In Python2.5+ you can use a more pythonic ternary operator `x = a if True else b`
Lakshman Prasad
My regex will not grab stackoverflow.com but will grab http://stackoverflow.com perfectly fine (and won't do the prefix).
Paul Tarjan
==In Python2.5+ you can use a more pythonic ternary operator x = a if True else b==See my first line. That's what it does
yu_sha
+4  A: 

If you want to go with list comprehensions, use:

urls = ['http://' + link if link.startswith('www.') else link for link in links]

But I actually think that the more verbose way of looping through the links that you used is easier to read. "Shorter" does not always equal "better" or "more readable".

Pär Wieslander
+1: people are confused about what is pythonic - it means "easy to write, read and understand in python" or "idiomatic python"
nosklo
+1  A: 

You'll probably be better off using built-in Python functionality for dealing with urls. Assuming you stay with your current regex, I think you could rewrite this as:

from urlparse import urlsplit, urlunsplit

links = re.findall("(?:https?://|www\.|https?://www\.)[\S]+", text)
urls = [urlunsplit(urlsplit(link, 'http')) for link links]

This should come out to the same thing as what you're currently doing. Also keep in mind that finding URLs using a regex is somewhat risky, ie this will return www.google.com! with the exclamation mark.

gab
This won't quite work, unfortunately. `urlunsplit(urlsplit('www.google.com'))` will return `'http:///www.google.com'` with three slashes after `http:`.
Pär Wieslander
Right you are; I had used urlparse to automatically add in the protocol string, but considering it doesn't work to do this then just doing string matching is the way to go. Thanks for pointing out my error!
gab
A: 

Alternatively:

def addHttp(url):
    if url[0:4] == "www.":
        url = "http://" + url
    return url

urls = map(addHttp, links)

this is longer than using list comprehensions and the ternary operator, but IMHO it is more readable since the function name describes what it is doing, so the code is self-documenting. It is also easier to refactor e.g. if you decide to follow yu_sha's advice and not test explicitly for "www".

Dave Kirby
if url[0:4] == "www." is considered harmful, you should check if url.startswith("www.") instead
Kimvais