views:

776

answers:

6

Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example:

ilikecheese.com becomes "i like cheese"
sanfranciscohotels.com becomes "san francisco hotels"
...

Any suggestions for accomplishing this efficiently and effectively?

Edit: I'd like to write this in PHP.

A: 

You would have to use a dictionary engine against a domain entry to find valid words and the run that dictionary engine against the result to ensure the result is valid words.

Examples of how this would be done?
Kevin
You need to obtain a dictionary list. Then you need to convert that list into a format that is friendly to you whether that be an array of a certain syntax, a csv list or what ever. Then you need to write a program that evaluates your domain entry against that list.
+1  A: 

If you have a list of valid words, you can loop through your domain string, and try to cut off a valid word each time with a backtracking algorithm. If you managed to use up all words, you are finished. Be aware that the time-complexity of this is not optimal :)

Zed
Not to say disastrous.
Dykam
True, but this could have been implemented in an hour, and he would be already five days ahead in splitting the domain names of the world :)
Zed
+2  A: 

choosespain.com kidsexpress.com childrenswear.com dicksonweb.com

Have fun (and a good lawyer) if you are going to try to parse the url with a dictionary.

You might do better if you can find the same characters but separated by white space on their web site.

Other possiblities: extract data from ssl certificate; query top level domain name server; Access the domain name server (TLD); or use one of the "whois" tools or services (just google "whois").

Dipstick
The domains I'm analyzing usually don't have a site. I'm not sure what you mean by using DNS/whois to find the keywords used in the domains.
Kevin
don't forget penisland.com ;)
Charles Ma
@Kevin. Traffic on the internet isn't routed using the names but using IP addresses e.g. 213.171.218.121. A domain name server translates the name to the IP address. TLD is the top level domain name server. In order to get a domain name a company has to register and there are ways to get hold of some of that information. Obviously this isn't true for domains on a private network so it might not be applicable to what you are trying to do.
Dipstick
I understand that, I just don't understand how any of that helps achieve my goal of extracting the keywords used in a domain name.
Kevin
+3  A: 

Might want to check out this SO question.

Zed
This is the closest I've seen to a solution :) Bounty is teetering in your direction.
Kevin
Dude, that should so be MY bounty :-) I guess I'll run the perl script and let you know what it extracts..
SquareCog
+3  A: 

You need to develop a heuristic that will get likely matches out of the domain. The way I would do it is first find a large corpus of text. For example, you could download Wikipedia.

Next take your corpus, and combine every two adjacent words. For example, if your sentence is:

quick brown fox jumps over the lazy dog

You'll create a list:

quickbrown
brownfox
foxjumps
jumpsover
overthe
thelazy
lazydog

Each of these would have a count of one. As you parse your corpus, you'll keep track of the frequency pairs of every two words. Additionally, for each pair, you'll need to sort what the original two words were.

Sort this list by frequency, and then attempt to find matches in your domain based on these words.

Lastly, do a domain check for the top two word phrases which aren't registered!

I think the sites like DomainTool take a list of the highest ranking words. They then try to parse these words out first. Depending on the purpose, you may want to consider using MTurk to do the job. Different people will parse the same words differently, and might not do so in proportion to how common the words are.

brianegge
I'm not sure MTurk would be the right tool for the job as I'll be processing thousands of domains per day. However, I do like the method you suggested!
Kevin
The Viterbi algorithm is probably much better than the one I suggested. WRT, MTurk, it depends on how much value your word splitting adds. You could have an MTurk hit to split 10 works, and it could cost you $0.015. That's about $15 per 1,000 names. If the 'expert sex change' site had tested their domain a bit first, they might have started with the hyphen in the middle.
brianegge
+6  A: 

Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.

For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.

Below are the results, I saved the top three for each combination.

expertsexchange: 97 possibilities
 -  experts exchange -23.71
 -  expert sex change -31.46
 -  experts ex change -33.86

penisland: 11 possibilities
 -  pen island -20.54
 -  penis land -22.64
 -  pen is land -25.06

choosespain: 28 possibilities
 -  choose spain -21.17
 -  chooses pain -23.06
 -  choose spa in -29.41

kidsexpress: 15 possibilities
 -  kids express -23.56
 -  kid sex press -32.65
 -  kids ex press -34.98

childrenswear: 34 possibilities
 -  children swear -19.85
 -  childrens wear -25.26
 -  child ren swear -32.70

dicksonweb: 8 possibilities
 -  dickson web -27.09
 -  dick son web -30.51
 -  dicks on web -33.63
SquareCog
Looks like the only one that didn't pass was "childrens wear," but I wasn't expecting this automation to be 100% accurate. And I agree, you do deserve the bounty :) Would you mind posting the source with your changes?
Kevin
It's basically the same as on my blog -- just replace the multiplication in find_word_seq_score with `$score += log($DICT->{$_}/$TOTAL);`
SquareCog
um > is supposed to be the greater-then sign.
SquareCog
Oh and technically "childrens" is not a word, so no surprise there :-).
SquareCog