views:

137

answers:

4

I am trying use the following regular expression to extract domain name from a text, but it just produce nothing, what's wrong with it? I don't know if this is suitable to ask this "fix code" question, maybe I should read more. I just want to save some time. Thanks

pat_url = re.compile(r'''

            (?:https?://)*

            (?:[\w]+[\-\w]+[.])*

            (?P<domain>[\w\-]*[\w.](com|net)([.](cn|jp|us))*[/]*)

            ''')

print re.findall(pat_url,"http://www.google.com/abcde")

I want the output to be google.com

+4  A: 

Don't use regex for this. Use the urlparse standard library instead. It's far more straightforward and easier to read/maintain.

http://docs.python.org/library/urlparse.html

Amber
thanks Dav, but the urlparse.netloc returns "www.google.com" ? And I want to extract urls in text like<a href = "http://www.google.com/adfaskl">?
urlparse.scheme + urlparse.netloc + urlparse.path should give you the expected result.
Ivo Wetzel
+3  A: 

The first is that you're missing the re.VERBOSE flag in the call to re.compile(). The second is that you should use the methods on the returned object. The third is that you're using a regular expression where an appropriate parser already exists in the stdlib.

Ignacio Vazquez-Abrams
oh.. the re.VERBOSE WORKS. Thanks
A: 

I don't believe that this is actually about "regression", is it? It's about regular expressions, which is a totally different thing. Perhaps someone should fix the tagging.

But the keys are like... RIGHT NEXT to each other :P Also, the answer box is not the place for comments on the question.
Ignacio Vazquez-Abrams
+1  A: 

This is the only correct way to parse an url with a regex:

It's in C++ but you'll find trivial to convert to python by removing additional \. And with an enum for the captures.

Also see RFC3986 as original source for the regexp.

static const char* const url_regex[] = {
    /* RE_URL */
    "^(([^:/?#]+):)?(//([^/?#]*)|///)?([^?#]*)(\\?[^#]*)?(#.*)?",
};

enum {
    URL = 0,
    SCHEME_CLN = 1,
    SCHEME  = 2,
    DSLASH_AUTH = 3,
    AUTHORITY = 4,
    PATH    = 5,
    QUERY   = 6,
    FRAGMENT = 7
};
piotr