views:

635

answers:

9

Need a function like:

function isGoogleURL(url) { ... }

that returns true iff URL belongs to Google. No false positives; no false negatives.

Luckily there's this as a reference:

.google.com .google.ad .google.ae .google.com.af .google.com.ag .google.com.ai .google.am .google.it.ao .google.com.ar .google.as .google.at .google.com.au .google.az .google.ba .google.com.bd .google.be .google.bg .google.com.bh .google.bi .google.com.bn .google.com.bo .google.com.br .google.bs .google.co.bw .google.com.by .google.com.bz .google.ca .google.cd .google.cg .google.ch .google.ci .google.co.ck .google.cl .google.cn .google.com.co .google.co.cr .google.com.cu .google.cz .google.de .google.dj .google.dk .google.dm .google.com.do .google.dz .google.com.ec .google.ee .google.com.eg .google.es .google.com.et .google.fi .google.com.fj .google.fm .google.fr .google.ge .google.gg .google.com.gh .google.com.gi .google.gl .google.gm .google.gp .google.gr .google.com.gt .google.gy .google.com.hk .google.hn .google.hr .google.ht .google.hu .google.co.id .google.ie .google.co.il .google.im .google.co.in .google.is .google.it .google.je .google.com.jm .google.jo .google.co.jp .google.co.ke .google.com.kh .google.ki .google.kg .google.co.kr .google.kz .google.la .google.li .google.lk .google.co.ls .google.lt .google.lu .google.lv .google.com.ly .google.co.ma .google.md .google.mn .google.ms .google.com.mt .google.mu .google.mv .google.mw .google.com.mx .google.com.my .google.co.mz .google.com.na .google.com.nf .google.com.ng .google.com.ni .google.nl .google.no .google.com.np .google.nr .google.nu .google.co.nz .google.com.om .google.com.pa .google.com.pe .google.com.ph .google.com.pk .google.pl .google.pn .google.com.pr .google.pt .google.com.py .google.com.qa .google.ro .google.ru .google.rw .google.com.sa .google.com.sb .google.sc .google.se .google.com.sg .google.sh .google.si .google.sk .google.sn .google.sm .google.st .google.com.sv .google.co.th .google.com.tj .google.tk .google.tl .google.tm .google.to .google.com.tr .google.tt .google.com.tw .google.co.tz .google.com.ua .google.co.ug .google.co.uk .google.com.uy .google.co.uz .google.com.vc .google.co.ve .google.vg .google.co.vi .google.com.vn .google.vu .google.ws .google.rs .google.co.za .google.co.zm .google.co.zw .google.cat

Any ideas how to do this elegantly?

Some Clarifications:

  • I need this for a greasemonkey script I wrote that currently only works for google.com (and should work for all other TLDs as well). Here is the script (it modifies Google Reader to work on wide screens better).
  • It should work on URLs that belong to the above domains (not blogger.com, etc.).
+1  A: 

Do you count other Google properties as "belonging to Google"? FeedBurner, Blogger etc?

Can I ask what the purpose of this is? There may be a better way of doing what you want... and if it's reasonable I can ask internally for you.

Jon Skeet
If so, i can imagine you could ping the domain and see if it returns google's ips?
Karan
I strongly suspect there's a lot of IP addresses (or ranges) to match, and they're likely to vary significantly over time.
Jon Skeet
A: 

I wouldn't do this client-side.
The list of google domains doesn't change so frequently, so you could store a list server-side and then dinamically generate the .js to check it.

friol
The fact that it changes infrequently would seem to be an argument _against_ generating the .js dynamically from server side... no?
Assaf Lavie
A: 

Without a regex to individually match each and every TLD, there isn't really an 'elegant way of doing it'.

Echilon
A: 

A regular expression may be what you need. An example is:

<script>
var elem = document.getElementById("a");
var regex = new RegExp("(http://)?(www\\.)?google\\.com");

elem.innerHTML = regex.test(elem.innerHTML);
</script>

This would get the content of a span element "a", and would change it to "true" if google.com, and "false" otherwise. Note that it doesn't consider all other URLs(although the regex could easily be modified to do so), and "pages.google.com", for example, wouldn't match.

Also, your URLs all have a "." before them(".google.com" instead of "google.com"). Does this have any reason or is it just a mistake?

luiscubal
+1  A: 

If you don't need the test to be 100% accurate, this simple regex would do for all the domains you posted above:

"(http://)?([\w]+)?\.google\.([\w]{2,3})"

Just testing the presence of ".google." would suffice in most cases, although it could easily be fooled by adding a "google" domain in the url (not so easy though, nor quickly done).

Or just wait for google to buy their own google TLD.

Berzemus
I'm pretty sure this regex would allow sites like www.google.some_mal_site.com to match, and I don't want that.
Assaf Lavie
A: 

You could use a regular expression like....

^https?://[-A-Za-z0-9\.]+(\.google\.com|\.google\.ad|\.google\.ae|\.google\.com\.af|\.google\.com\.ag|\.google\.com\.ai|\.google\.am|\.google\.it\.ao|\.google\.com\.ar|\.google\.as|\.google\.at|\.google\.com\.au|\.google\.az|\.google\.ba|\.google\.com\.bd|\.google\.be|\.google\.bg|\.google\.com\.bh|\.google\.bi|\.google\.com\.bn|\.google\.com\.bo|\.google\.com\.br|\.google\.bs|\.google\.co\.bw|\.google\.com\.by|\.google\.com\.bz|\.google\.ca|\.google\.cd|\.google\.cg|\.google\.ch|\.google\.ci|\.google\.co\.ck|\.google\.cl|\.google\.cn|\.google\.com\.co|\.google\.co\.cr|\.google\.com\.cu|\.google\.cz|\.google\.de|\.google\.dj|\.google\.dk|\.google\.dm|\.google\.com\.do|\.google\.dz|\.google\.com\.ec|\.google\.ee|\.google\.com\.eg|\.google\.es|\.google\.com\.et|\.google\.fi|\.google\.com\.fj|\.google\.fm|\.google\.fr|\.google\.ge|\.google\.gg|\.google\.com\.gh|\.google\.com\.gi|\.google\.gl|\.google\.gm|\.google\.gp|\.google\.gr|\.google\.com\.gt|\.google\.gy|\.google\.com\.hk|\.google\.hn|\.google\.hr|\.google\.ht|\.google\.hu|\.google\.co\.id|\.google\.ie|\.google\.co\.il|\.google\.im|\.google\.co\.in|\.google\.is|\.google\.it|\.google\.je|\.google\.com\.jm|\.google\.jo|\.google\.co\.jp|\.google\.co\.ke|\.google\.com\.kh|\.google\.ki|\.google\.kg|\.google\.co\.kr|\.google\.kz|\.google\.la|\.google\.li|\.google\.lk|\.google\.co\.ls|\.google\.lt|\.google\.lu|\.google\.lv|\.google\.com\.ly|\.google\.co\.ma|\.google\.md|\.google\.mn|\.google\.ms|\.google\.com\.mt|\.google\.mu|\.google\.mv|\.google\.mw|\.google\.com\.mx|\.google\.com\.my|\.google\.co\.mz|\.google\.com\.na|\.google\.com\.nf|\.google\.com\.ng|\.google\.com\.ni|\.google\.nl|\.google\.no|\.google\.com\.np|\.google\.nr|\.google\.nu|\.google\.co\.nz|\.google\.com\.om|\.google\.com\.pa|\.google\.com\.pe|\.google\.com\.ph|\.google\.com\.pk|\.google\.pl|\.google\.pn|\.google\.com\.pr|\.google\.pt|\.google\.com\.py|\.google\.com\.qa|\.google\.ro|\.google\.ru|\.google\.rw|\.google\.com\.sa|\.google\.com\.sb|\.google\.sc|\.google\.se|\.google\.com\.sg|\.google\.sh|\.google\.si|\.google\.sk|\.google\.sn|\.google\.sm|\.google\.st|\.google\.com\.sv|\.google\.co\.th|\.google\.com\.tj|\.google\.tk|\.google\.tl|\.google\.tm|\.google\.to|\.google\.com\.tr|\.google\.tt|\.google\.com\.tw|\.google\.co\.tz|\.google\.com\.ua|\.google\.co\.ug|\.google\.co\.uk|\.google\.com\.uy|\.google\.co\.uz|\.google\.com\.vc|\.google\.co\.ve|\.google\.vg|\.google\.co\.vi|\.google\.com\.vn|\.google\.vu|\.google\.ws|\.google\.rs|\.google\.co\.za|\.google\.co\.zm|\.google\.co\.zw|\.google\.cat)

and I'd imagine generating that in JavaScript (or whatever language you choose) from an array or some other data set would be relatively easy.

theraccoonbear
That would match 'www.google.com.other.site.com', no?
Assaf Lavie
+1  A: 

I agree that you probably shouldn't do this... However, if you are going to do it (and you aren't content with the previously offered solutions that just check for a google-like pattern) then this is how I would approach it:

var GOOGLE_DOMAINS = ([
    '.google.com',
    '.google.ad',
    '.google.ae',
    '.google.com.af',
    '.google.com.ag',
    '.google.com.ai',
    '.google.am',
    '.google.it.ao',
    '.google.com.ar',
    '.google.as',
    '.google.at',
    '.google.com.au',
    '.google.az',
    '.google.ba',
    '.google.com.bd'
]).join('\n');

function isGoogleUrl(url) {
    var url = 'http://www.google.ba/the/page.html';

    // get the domain from the url
    var domain = /\.google\.[^\/\\]+/i.exec(url) + '';
    if(!domain) return false;

    // create a regex to check to see if the domain is supported
    var re = new RegExp('^' + domain.replace(/\./g, '\\.') + '$', 'mi');
    return re.test(GOOGLE_DOMAINS);
}

This creates a regex based on the domain your url and uses it to test the list of domains.

Note: The GOOGLE_DOMAINS variable is just a string that holds the contents returned from the url you posted. There is no way for you to retrieve that string via AJAX or iframe because you cannot make such a request across domains. You'll have to hard code it or make a request server-side to retrieve that list.

Prestaul
Are you sure this regex captures only the domain name and won't capture a ".google." appearing further along the URL?
Assaf Lavie
It will capture only the first dot-google-dot-something in the url. If there is not one then the function returns false, and if there is then it checks the domain list to see if the "google-ish" domain is in the list.
Prestaul
needs to be tweaked because it accepts isGoogleUrl('http://www.malware.cn/www.google.ba/page.html') and does not accept isGoogleUrl('http://google.com/').
Wimmel
+5  A: 

Here is an updated version of Prestaul's answer which solves the two problems I mentioned in the comment there.

var GOOGLE_DOMAINS = ([
    '.google.com',
    '.google.ad',
    '.google.ae',
    '.google.com.af',
    '.google.com.ag',
    '.google.com.ai',
    '.google.am',
    '.google.it.ao',
    '.google.com.ar',
    '.google.as',
    '.google.at',
    '.google.com.au',
    '.google.az',
    '.google.ba',
    '.google.com.bd'
]).join('\n');

function isGoogleUrl(url) {
    // get the 2nd level domain from the url
    var domain = /^https?:\/\/[^\///]*(google\.[^\/\\]+)\//i.exec(url);
    if(!domain) return false;

    domain = '.'+domain[1];
    // create a regex to check to see if the domain is supported
    var re = new RegExp('^' + domain.replace(/\./g, '\\.') + '$', 'mi');
    return re.test(GOOGLE_DOMAINS);
}

alert(isGoogleUrl('http://www.google.ba/the/page.html')); // true
alert(isGoogleUrl('http://some_mal_site.com/http://www.google.ba/')); // false
alert(isGoogleUrl('https://google.com.au/')); // true
alert(isGoogleUrl('http://www.google.com.some_mal_site.com/')); // false
alert(isGoogleUrl('http://yahoo.com/')); // false
Wimmel
+1  A: 

All the domains end in either "google.xx", "google.co.xx", or "google.com.xx" except "google.it.ao" and "google.com", so if you just look at the domain, this regular expression should work for most cases (it's not perfect, but it accepts all the listed domains, and rejects most other valid domains that happen to include "google"):

/^(\w+\.)*google\.((com\.|co\.|it\.)?([a-z]{2})|com)$/i

As a function you could do something like this:

function isGoogleUrl(url) {
    url = url.replace(/^https?:\/\//i, ''); // Strip "http://" from the beginning
    url = url.replace(/\/.*/, ''); // Strip off the path
    return /^(\w+\.)*google\.((com\.|co\.|it\.)?([a-z]{2})|com)$/i.test(url);
}

You could simplify it if you use window.location.hostname:

function isGoogleUrl() {
    return /^(\w+\.)*google\.((com\.|co\.|it\.)?([a-z]{2})|com)$/i.test(window.location.hostname);
}

The only way this should allow a false positive is if there is a "google.(some other TLD)". For example, "google.tv", is not on the list (it redirects to google.com), but it would pass.

Edit: Like Wimmel pointed out, it also accepts invalid domains like "google.com.fr" which are not listed. It will basically accept any "google.whatever" domain name.

Matthew Crumley
looks to me this allows google.com.fr, where just google.fr is valid.
Wimmel
It does. It's definitely not perfect. It's just the closest thing I could think of without listing each valid domain. Maybe I should further clarify that in the answer.
Matthew Crumley