views:

103

answers:

2

Hello I would like to remove urls from a string replace them with their titles of the original contents.

For example:

mystring = "Ah I like this site: http://www.stackoverflow.com. Also I must say I like http://www.digg.com"

sanitize(mystring) # it becomes "Ah I like this site: Stack Overflow. Also I must say I like Digg - The Latest News Headlines, Videos and Images"

For replacing url to the title, I have written this snipplet:

#get_title: string -> string
def get_title(url):
    """Returns the title of the input URL"""

    output = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
    return output.title.string

I somehow need to apply this function to strings where it catches the urls and converts to titles via get_title.

+1  A: 

You can probably solve this using regular expressions and substitution (re.sub accepts a function, which will be passed the Match object for each occurence and returns the string to replace it with):

url = re.compile("http:\/\/(.*?)/")
text = url.sub(get_title, text)

The difficult thing is creating a regexp that matches an URL, not more, not less.

wump
1. `get_title()` should accept MatchObject (not just string). 2. Django uses somethink like r'https?://[^ \t\n\r]+' to linkify text
J.F. Sebastian
+2  A: 

Here is a question with information for validating a url in Python: http://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python

urlparse module is probably your best bet. You will still have to decide what constitutes a valid url in the context of your application.

To check the string for a url you will want to iterate over each word in the string check it and then replace the valid url with the title.

example code (you will need to write valid_url):

def sanitize(mystring):
  for word in mystring.split(" "):
    if valid_url(word):
      mystring = mystring.replace(word, get_title(word))
  return mystring
tdedecko