views:

277

answers:

3

I had a search and found lot's of similar regex examples, but not quite what I need.

I want to be able to pass in the following urls and return the results:

  • www.google.com returns google.com

  • sub.domains.are.cool.google.com returns google.com

  • doesntmatterhowlongasubdomainis.idont.wantit.google.com returns google.com

  • sub.domain.google.com/no/thanks returns google.com

Hope that makes sense :) Thanks in advance!-James

A: 

I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...

([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

EDIT:

To clarify, it's looking for:

one or more alpha-numeric characters or dashes, followed by a literal dot

and then one of three things...

  1. three or more alpha characters (i.e. com/net/mil/coop, etc.)
  2. two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
  3. two alpha characters (i.e. us/uk/to, etc)

and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).

As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.

theraccoonbear
Could you explain what it does please, my understanding of regex is minimal. And how it would be implemented.
sparkyfied
90% is generous. Basically, there IS no simple way to do this. The domain name system is way too convoluted and allows a lot of variation.
hallvors
Given that the examples provided are "normalish" looking domains, I think you can probably hit a substantial chunk, but sure, maybe not 90%. As I said though (and really to the point) it's unlikely you'll get 100% for all of your test cases.
theraccoonbear
A: 

You can't do this with a regular expression because you don't know how many blocks are in the suffix.

For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.

If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.

You will actually need to look up the suffix from a list like http://publicsuffix.org/

Tatham Oddie
A: 

Don't use regex, use the .split() method and work from there.

var s = domain.split('.');

If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:

return s.slice(-2).join('.');

It'll make your eyes bleed less than any regex solution.

stormsweeper