views:

316

answers:

4

Possible Duplicate:
The Hostname Regex

I'm trying to use pcrepp (PCRE) to extract hostname from url. the pcre regular expression is as same as Perl 5 regular expression.

for example:

url = "http://www.pandora.com/#/volume/73";
// the match will be "http://www.pandora.com/".

I can't find the correct syntax of the regex for this example.

  • Needs to work for any url: amazon.com/sds/ should return: amazon.com. or abebooks.co.uk/isbn="62345627457245"/blabla/ should return abebooks.co.uk
  • I don't need to check if the url is valid. just to get the hostname.
+2  A: 

Something like this:

^(?:[a-z]+://)?[^/]+/?
Mark Byers
what if I have "amazon.com/abc" ? I can't use only slash for that kind of address.
shaimagz
If there is no protocol, look for the first slash. With the regular expression this is `^([a-z]+://)?[^/]+/`. You probably should mention this detail in your question rather than hiding it in a comment to an answer. It's a very important detail.
Mark Byers
+1  A: 

Here is one possibility:

^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$

And another:

^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$

These and other URL related regular expressions can be found here: Regular Expression Library

magnifico
+2  A: 

See Regexp::Common::URI::http which uses sub-patterns defined in Regexp::Common::URI::RFC2396. Examining the source code of those modules should give you a good idea how to put together a decent pattern.

Sinan Ünür
A: 
string regex1, regex2, finalRegex; 
regex1 = "^((\\w+):\\/\\/\\/?)?((\\w+):?(\\w+)?@)?([^\\/\\?:]+):?(\\d+)?(\\/?[^\\?#;\\|]+)?([;\\|])?([^\\?#]+)?\\??";

regex2 = "([^#]+)?#?(\\w*)";

    //concatenation
    finalRegex= regex1+regex2;

the result will be at the sixth place. answered in another question I asked: Details.

shaimagz