tags:

views:

142

answers:

5
Dear Sam Adams

Thanks for your submission to ourdirectory.com
URL: http://myurlok.us
Please click below link to confirm your submission.
http://www.ourdirectory.com/confirm.aspx?id=1247778154270076

Once we receive your comfirmation, your site will be included for process!
regards,

http://www.ourdirectory.com

Thank you!

Should be obvious which URL I need to extract.

A: 

Not easy. One suggestion (taken from the RegexBuddy library):

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])

will match URLs (without mailto:, if you want that, say so), even if they are enclosed in parentheses. Will also match URLs without http:// or ftp:// etc. if they start with www. or ftp..

A simpler version:

\bhttps?://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

It all depends on what your needs are/what your input looks like.

Tim Pietzcker
I don't think you need to get so fancy. I think he wants to parse very specific emails from a very specific source, so I imagine he could parse for the exact string: "http://www.ourdirectory.com/confirm.aspx?id=" followed by digits and end-of-line.
Brian Schroth
Probably yes. Although there is another URL (myurlok.us) in there. Who knows what else might turn up - he wasn't very specific.
Tim Pietzcker
A: 

regex:

"http://www.ourdirectory.com/confirm.aspx\?id=[0-9]+$"

or without regex, parse the email line by line and test if the string contains "http://www.ourdirectory.com/confirm.aspx?id=" and if it does, that's your URL.

Of course, if your input is actually the HTML source instead of the text you posted, this all goes out the window.

Brian Schroth
+1  A: 

If it's HTML email with hyperlinks you could use the HTMLParse library as a shortcut.

import HTMLParser
class parseLinks(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for name, value in attrs:
                if name == 'href':
                    print value
                    print self.get_starttag_text()

someHtmlContainingLinks = ""
linkParser = parseLinks()
linkParser.feed(someHtmlContainingLinks)
Matt Garrison
his sample document doesn't look like HTML to me :-)
Suppressingfire
Everyone else provided non-HTML solutions, and the OP's question history indicates he's pulling it from gmail, which does support HTML. Given the vagueness of the question, I think this is a valid response.
Matt Garrison
+1  A: 

@OP, if your email is always standard,

f=open("emailfile")
for line in f:
    if "confirm your submission" in line:
        print f.next().strip()        
f.close()
A: 

This solution works only if the source is not HTML.

def extractURL(self,fileName):

    wordsInLine = []
    tempWord = []
    urlList = []

    #open up the file containing the email
    file=open(fileName)
    for line in file:
        #create a list that contains is each word in each line
        wordsInLine = line.split(' ')
        #For each word try to split it with :
        for word in wordsLine:
            tempWord = word.split(":")
            #Check to see if the word is a URL
            if len(tempWord) == 2:
                if tempWord[0] == "http":
                    urlList.append(word)
    file.close()

    return urlList
apocolyp4