tags:

views:

4201

answers:

4

Hello, I'm trying to extract the domain name from a string in C#. You don't necessarily have to use a RegEx but we should be able to extract yourdomain.com from all of the following:

yourdomain.com
www.yourdomain.com
http://www.yourdomain.com
http://www.yourdomain.com/
store.yourdomain.com
http://store.yourdomain.com
whatever.youdomain.com
*.yourdomain.com

Also, any TLD is acceptable, so replace all the above with .net, .org, 'co'uk, etc.

A: 

I would recommend trying this yourself. Using regulator and a regex cheat sheet.

http://sourceforge.net/projects/regulator/

http://regexlib.com/CheatSheet.aspx

Also find some good info on Regular Expressions at coding horror.

James_Dude
I've been trying this myself for about an hour now. Thanks for the links though.
Kyle West
A: 

A regex doesn't really fit your requirement of "any TLD", since the format and number of TLDs is quite large and continually in flux. If you limited your scope to:

(?<domain>[^\.]+\.([A-Z]+$|co\.[A-Z]$))

You would catch .anything and .co.anything, which I imagine covers most realistic cases...

Rex M
I don't really need ANY TLD, I do need the most popular ones though.
Kyle West
This regex covers every popular TLD (.anything and .co.anything). It does assume you'll strip off the ending slash if there is one.
Rex M
There are many, many popular SLDs that do not involve .co.
bobince
Depends on your definition of popular and what parts of the world your site generally serves
Rex M
A: 

Have a look at this other answer. It was for PHP but you'll easily get the regex out of the 4-5 lines of PHP and you can benefit from the discussion that followed (see Alnitak's answer).

lpfavreau
+11  A: 
  1. If no scheme present (no colon in string), prepend "http://" to make it a valid URL.

  2. Pass string to Uri constructor.

  3. Access the Uri's Host property.

Now you have the hostname. What exactly you consider the ‘domain name’ of a given hostname is a debatable point. I'm guessing you don't simply mean everything after the first dot.

It's not possible to distinguish hostnames like ‘whatever.youdomain.com’ from domains-in-an-SLD like ‘warwick.ac.uk’ from just the strings. Indeed, there is even a bit of grey area about what is and isn't a public SLD, given the efforts of some registrars to carve out their own niches.

A common approach is to maintain a big list of SLDs and other suffixes used by unrelated entities. This is what web browsers do to stop unwanted public cookie sharing. Once you've found a public suffix, you could add the one nearest prefix in the host name split by dots to get the highest-level entity responsible for the given hostname, if that's what you want. Suffix lists are hell to maintain, but you can piggy-back on someone else's efforts.

Alternatively, if your app has the time and network connection to do it, it could start sniffing for information on the hostname. eg. it could do a whois query for the hostname, and keep looking at each parent until it got a result and that would be the domain name of the lowest-level entity responsible for the given hostname.

Or, if all that's too much work, you could try just chopping off any leading ‘www.’ present!

bobince
+1 for using existing facilities
Ates Goral
There is an opensource C# library that uses publicsuffix.org to parse domains, here: http://code.google.com/p/domainname-parser/
Dan Esparza