ansaurus

Question

Answer 1

A:

^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$

Joe Garrett 2010-07-07 22:23:30

Read the whole question, this is not a solution.

You 2010-07-07 22:26:31

As I mentioned in my question, I need a Regular Expression capable of finding stuff like `.co.uk` or `.edu.hk`, not just the regular TLDs.

Tom 2010-07-07 22:27:30

"You", beat me to it.

Tom 2010-07-07 22:27:47

I read the whole question - regular expression matching based upon ccSLDS (replace the current matched ccSLDS with the ones that you'd like or extrapolate). I answered with a solid regular expression patter that will answer the solution if you provide all of the suffixes lists you want...I would create a datastore to house the solution and read through the store to build my regular expression list...As I said this is a solution that could be used, you may not like it but it would work.

Joe Garrett 2010-07-07 22:29:17

-1 -- It's not an answer to this question, and extending it to become one would make for an unpleasant and probably inefficient solution. Moreover, you've provided no explanation of how to make it applicable.

Benson 2010-07-08 21:25:31

Answer 2

+5 A:

It sounds like you are looking for the information available through the Public Suffix List project.

A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". The Public Suffix List is a list of all known public suffixes.

There is no single regular expression that will reasonably match the list of public suffixes. You will need to implement code to use the public suffix list, or find an existing library that already does so.

Greg Hewgill 2010-07-07 22:23:45

Interesting and probably very useful list.

You 2010-07-07 22:29:13

Thanks, Greg. That's absolutely the right answer. There are libraries to do Public Suffix List processing in several languages at http://www.dkim-reputation.org/regdom-libs/

Anirvan 2010-07-07 23:17:29

@Anirvan, do you know an equivalent for Python? The library you posted is only available in C, PHP, and Perl.

Tom 2010-07-07 23:26:26

Answer 3

+2 A:

I would probably solve this by getting a complete list of TLDs and using it to create the regex. For example (in Ruby, sorry, not a Pythonista yet):

tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|')
regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i

I don't think it's possible to properly differentiate between a real two part TLD and a subdomain without knowing the actual list of TLDs (ie: you could always construct a subdomain that looks like a TLD if you knew how the regex worked.)

fd 2010-07-07 22:32:23

Answer 4

+2 A:

Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.

First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes. If this isn't something your comfortable with, comment and I can add some code that will do it.

suffixes = parse_suffix_list("suffix_list.txt")

Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:

def is_domain(d):
    for suffix in suffixes:
        if d.endswith(suffix):
            # Get the base domain name without suffix
            base_name = d[0:-(suffix.length + 1)]
            # If it contains '.', it's a subdomain. 
            if not base_name.contains('.'):
                return true
    # If we get here, no matches were found
    return false

Benson 2010-07-08 21:41:39

Thanks! I can find my way from here.

Tom 2010-07-08 22:30:10

No problem! Glad I could help.

Benson 2010-07-09 06:48:24

ansaurus

tags:

views:

answers:

Regex to match Domain.CCTLD

related questions