tags:

views:

2830

answers:

4

I'm need a regular expression in Java that I can use to retrieve the domain.tld part from any url. So https://foo.com/bar, http://www.foo.com#bar, http://bar.foo.com will all return foo.com.

I wrote this regex, but it's matching the whole url

Pattern.compile("[.]?.*[.x][a-z]{2,3}");

I'm not sure I'm matching the "." character right. I tried "." but I get an error from netbeans.

Update:

The tld is not limited to 2 or 3 characters, and http://www.foo.co.uk/bar should return foo.co.uk.

+4  A: 

This is harder than you might imagine. Your example https://foo.com/bar, has a comma in it, which is a valid URL character. Here is a great post about some of the troubles:

http://www.codinghorror.com/blog/archives/001181.html

https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])

Is a good starting point

Some listings from "Mastering Regular Expressions" on this topic:

http://regex.info/listing.cgi?ed=3&p=207

@sjobe

>>> import re
>>> pattern = r'https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])'
>>> url = re.compile(pattern)
>>> url.match('http://news.google.com/').groups()
('news.google.com/',)
>>> url.match('not a url').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> url.match('http://google.com/').groups()
('google.com/',)
>>> url.match('http://google.com').groups()
('google.com',)

sorry the example is in python not java, it's more brief. Java requires some extraneous escaping of the regex.

jsamsa
I dont think he meant the comma to be part of the url , he was just separating a list
RC1140
That's my point, it's ambiguous. How should the regex determine if the comma is part of the URL or not?
jsamsa
Doesn't matter anyway, as he's interested in "domain.tld" part of an http URL. There's no comma in that part.
MSalters
MSalters good point
jsamsa
I tried that regular expression [added a ')' at the end]https?://([-A-Za-z0-9+
sjobe
+2  A: 

You're going to need to get a list of all possible TLDs and ccTLDs and then match against them. You have to do this else you'll never be able to distinguish between subdomain.dom.com and hello.co.uk.

So, get your self such a list. I recommend inverting it so you store, for example, uk.co. Then, you can extract the domain from a URL by getting everying between // and / or end of line. Split at . and work backwards, matching the TLD and then 1 additional level to get the domain.

Adam Pope
+1  A: 

If the string contains a valid URL then you could use a regex like (Perl quoting):

/^
(?:\w+:\/\/)?
[^:?#\/\s]*?

(
[^.\s]+
\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
)

(?:[:?#\/]|$)
/xi;

Results:

url: https://foo.com/bar
matched: foo.com
url: http://www.foo.com#bar
matched: foo.com
url: http://bar.foo.com
matched: foo.com
url: ftp://foo.com
matched: foo.com
url: ftp://www.foo.co.uk?bar
matched: foo.co.uk
url: ftp://www.foo.co.uk:8080/bar
matched: foo.co.uk

For Java it would be quoted something like:

"^(?:\\w+://)?[^:?#/\\s]*?([^.\\s]+\\.(?:[a-z]{2,}|co\\.uk|org\\.uk|ac\\.uk|org\\.au|com\\.au|___etc___))(?:[:?#/]|$)"

Of course you'll need to replace the etc part.

Example Perl script:

use strict;

my @test = qw(
    https://foo.com/bar
    http://www.foo.com#bar
    http://bar.foo.com
    ftp://foo.com
    ftp://www.foo.co.uk?bar
    ftp://www.foo.co.uk:8080/bar
);

for(@test){
    print "url: $_\n";

    /^
    (?:\w+:\/\/)?
    [^:?#\/\s]*?

    (
    [^.\s]+
    \.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
    )

    (?:[:?#\/]|$)
    /xi;

    print "matched: $1\n";
}
Qtax
I'm getting an "Illegal escape character" error in Java
sjobe
I forgot to double escape the first \w in the beginning of the string, should be "\\w". If you see any other single backslashes escape them.
Qtax
A: 

I would use the java.net.URI class to extract the host name, and then use a regex to extract the last two parts of the host uri.

import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RunIt {

    public static void main(String[] args) throws URISyntaxException {
        Pattern p = Pattern.compile(".*?([^.]+\\.[^.]+)");

        String[] urls = new String[] {
                "https://foo.com/bar",
                "http://www.foo.com#bar",
                "http://bar.foo.com"
        };

        for (String url:urls) {
            URI uri = new URI(url);
            //eg: uri.getHost() will return "www.foo.com"
            Matcher m = p.matcher(uri.getHost());
            if (m.matches()) {
                System.out.println(m.group(1));
            }
        }
    }
}

Prints:

foo.com
foo.com
foo.com
idrosid
That's actually what I ended up doing.
sjobe
An what about domain names like foobar.co.uk?
Gumbo
+1 @Gumbo : only works with one-part extensions
Spark