ansaurus

Question

Answer 1

A:

Not easy. One suggestion (taken from the RegexBuddy library):

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])

will match URLs (without mailto:, if you want that, say so), even if they are enclosed in parentheses. Will also match URLs without http:// or ftp:// etc. if they start with www. or ftp..

A simpler version:

\bhttps?://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

It all depends on what your needs are/what your input looks like.

Tim Pietzcker 2009-11-24 19:33:33

I don't think you need to get so fancy. I think he wants to parse very specific emails from a very specific source, so I imagine he could parse for the exact string: "http://www.ourdirectory.com/confirm.aspx?id=" followed by digits and end-of-line.

Brian Schroth 2009-11-24 19:35:24

Probably yes. Although there is another URL (myurlok.us) in there. Who knows what else might turn up - he wasn't very specific.

Tim Pietzcker 2009-11-24 19:37:21

Answer 2

A:

regex:

"http://www.ourdirectory.com/confirm.aspx\?id=[0-9]+$"

or without regex, parse the email line by line and test if the string contains "http://www.ourdirectory.com/confirm.aspx?id=" and if it does, that's your URL.

Of course, if your input is actually the HTML source instead of the text you posted, this all goes out the window.

Brian Schroth 2009-11-24 19:39:44

Answer 3

+1 A:

If it's HTML email with hyperlinks you could use the HTMLParse library as a shortcut.

import HTMLParser
class parseLinks(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for name, value in attrs:
                if name == 'href':
                    print value
                    print self.get_starttag_text()

someHtmlContainingLinks = ""
linkParser = parseLinks()
linkParser.feed(someHtmlContainingLinks)

Matt Garrison 2009-11-24 19:42:22

his sample document doesn't look like HTML to me :-)

Suppressingfire 2009-11-25 00:54:27

Everyone else provided non-HTML solutions, and the OP's question history indicates he's pulling it from gmail, which does support HTML. Given the vagueness of the question, I think this is a valid response.

Matt Garrison 2009-11-30 14:45:31

Answer 4

+1 A:

@OP, if your email is always standard,

f=open("emailfile")
for line in f:
    if "confirm your submission" in line:
        print f.next().strip()        
f.close()

2009-11-25 00:51:10

Answer 5

A:

This solution works only if the source is not HTML.

def extractURL(self,fileName):

    wordsInLine = []
    tempWord = []
    urlList = []

    #open up the file containing the email
    file=open(fileName)
    for line in file:
        #create a list that contains is each word in each line
        wordsInLine = line.split(' ')
        #For each word try to split it with :
        for word in wordsLine:
            tempWord = word.split(":")
            #Check to see if the word is a URL
            if len(tempWord) == 2:
                if tempWord[0] == "http":
                    urlList.append(word)
    file.close()

    return urlList

apocolyp4 2009-11-25 09:12:00

ansaurus

tags:

views:

answers:

Extract URLs out of email in Python

related questions