tags:

views:

484

answers:

5

Hi All,

I am wondering if there is a parser or library in java for extracting the second level domain (SLD) in an URL - or failing that an algo or regex for doing the same. For example:

URI uri = new URI("http://www.mydomain.ltd.uk/blah/some/page.html");

String host = uri.getHost();

System.out.println(host);

which prints:

mydomain.ltd.uk

Now what I'd like to do is robustly identify the SLD ("ltd.uk") component. Any ideas?

Edit: I'm ideally looking for a general solution, so I'd match ".uk" in "police.uk", ".co.uk" in "bbc.co.uk" and ".com" in "amazon.com".

Thanks

A: 

If you want the second-level domain, you can split the string on "." and take the last two parts. Of course, this assumes you always have a second-level domain that is not specific to the site (since it sounds like that's what you want).

danben
sadly, this is a general problem - so I need to match .com, .co.uk, .uk (e.g as in police.uk) etc
Richard
In that case, you could try to build a a hash from this list - https://wiki.mozilla.org/TLD_List and then check the last two parts for a second level domain match, otherwise take the last part as the TLD.
danben
yes i don't think there's a programmatic solution to this beyond building a big list to match against
Richard
A: 

I don't have an answer for your specific case — and Jonathan's comment points out that you should probably refactor your question.

Still, I suggest taking a look at the Reference class of the Restlet project. It has a ton of useful methods. And since Restlet is Open Source, you wouldn't have to use the entire library — you could download the source and add just that one class to your project.

Avi Flax
A: 
Pattern getSLD = Pattern.compile("(?:^|\\.)([^.]+\\.[^.]+)$");
Jonathan Feinberg
+5  A: 

Don't know your purpose but Second-Level Domain may not mean much to you. You probably need to find public suffix and the domain right below it is what you are looking for.

Apache Http Component (HttpClient 4) comes with classes to handle this,

org.apache.http.impl.cookie.PublicSuffixFilter
org.apache.http.impl.cookie.PublicSuffixListParser

You need to download the public suffix list from here,

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/effective%5Ftld%5Fnames.dat?raw=1

ZZ Coder
yup that'll do nicely cheers
Richard
I found that list earlier but didn't bother posting it because it doesn't contain ltd.uk. Will this still do what you want?
danben
@danben: yeah I'll have to add the ltd.uk (which i was using only by way of example). Basically this solution comes down to using a list, and this seems reasonably comprehensive. I marked as correct this answer for including the parser.
Richard
Yeah, I didn't know about public suffixes either. +1 from me.
danben
A: 
  1. the mentioned list + reading the wikipedia updates gives a 98% correct TLD list
  2. going yourself through http://www.iana.org/domains/root/db/ and click each nic and see the latest news gives you the other 2% (like .com.aq and .gov.an)
  3. unfortunately large "free webspace" providers are another thing to take into account e.g. the countless *.blogspot.com domains, if you download the alexa top 100.000 (free csv file) you can at least get a good overview of the most used of these that should get you for a certain percentage covered for these domains (e.g. when comparing alexa rating with stumbleupon pageviews with delicious bookmarks) (alexa sometimes only takes the topdomain while delicious really md5's every url, so 1 alexa --> multiple delicious md5 hashes
  4. apart from that sometimes in the case of twitter, that what goes after the / is also of importance if you are looking for uniqueness to rate something.

Here is a list of the Alexa top 40.000 when the real TLD's are filtered out to give you a feeling: (which means Alexa does NOT count the rating together for the domain for the following) :

bp.blogspot.com---espn.go.com---files.wordpress.com---abcnews.go.com---disney.go.com---troktiko.blogspot.com---en.wordpress.com---api.ning.com---abc.go.com---220.181.38.82---213.174.154.20---abclocal.go.com---feedproxy.google.com/~r---forums.wordpress.com---googleblog.blogspot.com---1.cnm999.com/user/10008---213.174.143.196---92.42.51.201---googlewebmastercentral.blogspot.com---myespn.go.com---213.174.143.197---61.132.221.146---support.wordpress.com---dashboard.wordpress.com---sethgodin.typepad.com---paygo.17zhifu.com/user/10005---go2.wordpress.com---1.1.1.1---movies.go.com---home.comcast.net---googlesystem.blogspot.com---abcfamily.go.com---home.spaces.live.com---196.1.237.210---kaixin001.com/~record---xhamster.com/user/video---gold-oil-commodity.blogspot.com---journeyplanner.tfl.gov.uk/user/XSLT_TRIP_REQUEST2---206.108.48.238---blog.wordpress.com---67.220.92.21---183.101.80.130---211.94.190.80---youtube-global.blogspot.com---uta-net.com/user/phplib---cinema3satu.blogspot.com---119.147.41.16---sites.google.com/site/sites---kk.iij4u.or.jp/~dyo---220.181.6.19---toontown.go.com---signup.wordpress.com---thesartorialist.blogspot.com---analytics.blogspot.com---ss.iij4u.or.jp/~ceh2---67.220.92.23---gmailblog.blogspot.com---183.99.121.86---vgorode.ru/user/create---61.132.216.243---217.175.53.72---labnol.blogspot.com---adsense.blogspot.com---subscribe.wordpress.com---fimotro.blogspot.com---creators.ning.com---sarkari-naukri.blogspot.com---search.wordpress.com---orange-hiyoko.blogspot.com---cashewmaniakpop.wordpress.com---pixiehollow.go.com---adwords.blogspot.com---202.53.226.102---lorelle.wordpress.com---homestead.com/~site---multiply.com/user/signout---221.231.148.249---183.101.80.77---windowsliveintro.spaces.live.com---124.228.254.234---streaming-web.blogspot.com---id.tianya.cn/user/message---familyfun.go.com---tro-ma-ktiko.blogspot.com---about.ning.com---paygo.17zhifu.com/user/10020---tututina.blogspot.com---toolserver.org/~geohack---superjob.ru/user/resume---ejobs.ro/user/locuri-de-munca---gnula.blogspot.com---alles.or.jp/~uir---chiark.greenend.org.uk/~sgtatham---woork.blogspot.com---88.208.32.218---webstreamingmania.blogspot.com---spaces.live.com---youtube.com/user/RayWilliamJohnson---cloob.com/user/login---asstr.org/~Kristen---getclicky.com/user/login---guesshermuff.blogspot.com---211.98.70.195---222.73.105.196---pp.iij4u.or.jp/~taakii---unsoloclic.blogspot.com---photoshopdisasters.blogspot.com---218.83.161.253---217.16.18.163---217.16.18.207---217.16.28.104---222.73.105.210---youtube.com/user/OldSpice---hubpages.com/user/new---pelisdvdripdd.blogspot.com---95.143.193.60---es.wordpress.com---217.16.18.206---61.147.116.146---damncoolpics.blogspot.com---family.go.com---81.176.235.162---gutteruncensorednewsr.blogspot.com---terselubung.blogspot.com---faisalardhy.blogspot.com---67.220.92.14---goodreads.com/user/show---116.228.55.34---profile.typepad.com---kaixin001.com/~truth---linkbuildersassociated.ning.com---nicotto.jp/user/mypage---ritemail.blogspot.com---hyperboleandahalf.blogspot.com---carscoop.blogspot.com---tubemogul.com/user/dash---press-gr.blogspot.com---81.176.235.164---soapnet.go.com---208.98.30.69---trelokouneli.blogspot.com---help.ning.com---id.tianya.cn/user/register---slovari.yandex.ru/~%D0%BA%D0%BD%D0%B8%D0%B3%D0%B8---printable-coupons.blogspot.com---unic77.blogspot.com---globaleconomicanalysis.blogspot.com---183.101.80.68---221.194.33.60---doujin-games88.blogspot.com---magaseek.com/user/SearchProducts---files.posterous.com---wwwnew.splinder.com---kolom-tutorial.blogspot.com---strobist.blogspot.com---67.21.91.73---needanarticle.com/user/activity---forum.moe.gov.om/~moeoman---milasdaydreams.blogspot.com---88.208.17.189---67.220.92.22---115.238.100.211---nonews-news.blogspot.com---testosterona.blog.br---nn.iij4u.or.jp/~has---cs.tut.fi/~jkorpela---youtube.com/user/oldspice---67.159.53.25---taxalia.blogspot.com---208.98.30.70---filmesporno.blog.br---alles-schallundrauch.blogspot.com---vatera.hu/user/account---78.140.136.182---us.my.alibaba.com/user/join---stores.homestead.com---pes2008editing.blogspot.com---ocn.ne.jp/~matrix---adweek.blogs.com---115.238.55.94---markjaquith.wordpress.com---k3.dion.ne.jp/~dreamlov---38.99.186.222---film.tv.it---android-developers.blogspot.com---217.218.110.147---kadokado.com/user/login---bollyvideolinks4u.blogspot.com---sookyeong.wordpress.com---87.101.230.11---livecodes.blogspot.com---67.220.91.19---homepage2.nifty.com/bustered---pp.iij4u.or.jp/~manga100---110.173.49.202---erogamescape.dyndns.org/~ap2---cs.berkeley.edu/~lorch---cakewrecks.blogspot.com---59.106.117.185---119.75.213.61---id.wordpress.com---de.wordpress.com---telefilmdblink.blogspot.com---61.139.105.138---multiply.com/user/join---programseo.blogspot.com---collectivebias.ning.com---bablorub.blogspot.com---thinkexist.com/user/personalAccount---us.my.alibaba.com/user/sign---66.70.56.90---getsarkari-naukri.blogspot.com---59.106.117.183---productreviewplace.ning.com---support.weebly.com---kaixin001.com/~lucky---football-russia.blogspot.com---magaseek.com/user/ItemDetail---polprav.blogspot.com---atlasshrugs2000.typepad.com---jpn-manga.blogspot.com---88.208.32.219---google-latlong.blogspot.com---59.106.117.188---erogamescape.ddo.jp/~ap2---218.87.32.245---watchhorrormovies.blogspot.com---sarotiko.blogspot.com---googlewebmastercentral-de.blogspot.com---colmeia.blog.br---us.my.alibaba.com/user/webatm---220.170.79.109---darkville.blogspot.com---youtube.com/user/PiMPDailyDose---disneymovierewards.go.com---fukuoka.lg.jp---61.147.115.16---iisc.ernet.in---youtube.com/user/HuskyStarcraft---202.108.212.211---homepage3.nifty.com/otakarando---94.77.215.37---pitchit.ning.com---59.106.117.186---thestar.blogs.com---1.254.254.254---piratesonline.go.com---animedblink.blogspot.com---137.32.44.152---eurus.dti.ne.jp/~yoneyama---state.la.us---lastminute.is.it---bangpai.taobao.com/user/groups---csse.monash.edu.au/~jwb---jquery-howto.blogspot.com---sakura.ne.jp/~moesino---users.skynet.be/mgueury---saitama.lg.jp---portaldasfinancas.gov.pt---bnonline.fi.cr---135.125.60.11---zhuhai.gd.cn---kuna.net.kw---59.175.213.77---58.218.199.7---multiply.com/user/signin---youtube.com/user/HDstarcraft---blinklist.com/user/join---us.my.alibaba.com/user/company---jptwitterhelp.blogspot.com---67.220.92.017---88.208.17.51---youtube.com/user/GoogleWebmasterHelp---208.53.156.229---filmdblink.blogspot.com---blinklist.com/user/signup---3arbtop.blogspot.com---attivissimo.blogspot.com---onlinemovie12.blogspot.com---98.126.189.86---mytvsource.blogspot.com---blinklist.com/user/login---googlejapan.blogspot.com---76.73.65.166---gutteruncensorednewsb.blogspot.com---issuu.com/user/upload---86.51.174.18---88.208.17.120---profile.china.alibaba.com/user/admin---jntuworldportal.blogspot.com---sz.js.cn---disneymovieclub.go.com---a1.com.mk---dd.iij4u.or.jp/~madonna---rr.iij4u.or.jp/~plasma---mlmlaunchformula.ning.com---112.78.7.151---blogdelatele.blogspot.com---googlemobile.blogspot.com---78.109.199.240---wsu.edu/~brians---internapoli-city.blogspot.com---hh.iij4u.or.jp/~dmt---kaixin001.com/~house---61.155.11.14---youtube.com/user/SHAYTARDS---turbobit.net/user/files---qjy168.com/user/do---hubpages.com/user/finished---upload2.dyndns.org---f32.aaa.livedoor.jp/~azusa---naruto-spoilers.blogspot.com---205.209.140.195---193.227.20.21---adsenseforfeeds.blogspot.com---group.ameba.jp/user/groups---

edelwater