ansaurus

Question

python regular expression for domain names

Answer 1

+4 A:

Don't use regex for this. Use the urlparse standard library instead. It's far more straightforward and easier to read/maintain.

http://docs.python.org/library/urlparse.html

Amber 2010-04-13 04:38:24

thanks Dav, but the urlparse.netloc returns "www.google.com" ? And I want to extract urls in text like<a href = "http://www.google.com/adfaskl">?

2010-04-13 04:48:10

urlparse.scheme + urlparse.netloc + urlparse.path should give you the expected result.

Ivo Wetzel 2010-04-13 20:09:34

Answer 2

+3 A:

The first is that you're missing the re.VERBOSE flag in the call to re.compile(). The second is that you should use the methods on the returned object. The third is that you're using a regular expression where an appropriate parser already exists in the stdlib.

Ignacio Vazquez-Abrams 2010-04-13 04:42:09

oh.. the re.VERBOSE WORKS. Thanks

2010-04-13 05:06:01

Answer 3

A:

I don't believe that this is actually about "regression", is it? It's about regular expressions, which is a totally different thing. Perhaps someone should fix the tagging.

2010-04-13 04:50:52

But the keys are like... RIGHT NEXT to each other :P Also, the answer box is not the place for comments on the question.

Ignacio Vazquez-Abrams 2010-04-13 04:55:59

Answer 4

+1 A:

This is the only correct way to parse an url with a regex:

It's in C++ but you'll find trivial to convert to python by removing additional \. And with an enum for the captures.

Also see RFC3986 as original source for the regexp.

static const char* const url_regex[] = {
    /* RE_URL */
    "^(([^:/?#]+):)?(//([^/?#]*)|///)?([^?#]*)(\\?[^#]*)?(#.*)?",
};

enum {
    URL = 0,
    SCHEME_CLN = 1,
    SCHEME  = 2,
    DSLASH_AUTH = 3,
    AUTHORITY = 4,
    PATH    = 5,
    QUERY   = 6,
    FRAGMENT = 7
};

piotr 2010-04-13 05:16:25

ansaurus

tags:

views:

answers:

python regular expression for domain names

related questions