views:

56

answers:

2

I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features associated with that URL. I am looking for your suggestions for this set of features. For example I would like to store www.nytimes.com as following:

[www.nytimes.com: [lang:en, alexa_rank:96, content_type:news, spam_probability: 0.0001, etc..]

Why I am building this? Well the ultimate goal is to do some interesting things with this index, for example I may do clustering on this index and find interesting groups etc. I have with me a whole lot of text which was generated by whole lot URLs over a period of whole lot of time :) So data is not a problem.

Any kind of suggestions are very welcome.

A: 

Make it work first with what you've already suggested. Then start adding features suggested by everybody else.

ideas are worth nothing unless executed.

-- http://www.codinghorror.com/blog/2010/01/cultivate-teams-not-ideas.html

John K
I have a working version of what I just mentioned (except spam probability and content type). I have created a map reduce job which does this for me. Sorry I forgot to mention that :) Now I need to enrich the set of features.
shrijeet
A: 

Hi,

My first answer so pls bare with me...

I would maybe start here: Google white papers on IR

Then also search for white papers on IR on Google maybe?

Also a few things to add to your index:

  1. Subdomains associated with the domain
  2. IP addresses associated with the domain
  3. Average page speed
  4. Links pointing at the domain in Yahoo - e.g link:nytimes.com or search on yahoo
  5. Number of pages on the domain - site:nytimes.com on Google
  6. traffic nos on compete.com or google trends
  7. whois info e.g. age of domain, length of time registered for etc.

Some other places to research - http://www.majesticseo.com/, http://www.opensearch.org/Home and http://www.seomoz.org they all have their own indexes

Im sure theres plenty more but hopefully the IR stuff will get the cogs whirring :)

Ke
Thanks for answering, it gave me some insight to problems lying ahead. One of them is sub domain to domain mapping. My initial experiments highlighted this problem. I am looking for approaches to solve this issue (mapping subdomain --> domain), if you have any ideas please share.
shrijeet
Here is what I mean, mjimenez0.gizmodo.com 99 <--, ichsagpop.wordpress.com 99 , misterdna.gizmodo.com 94 <-- , wwww.gizmodo.com 93 <-- , us.gizmodo.com 91 <-- , blogs.sun.com 91redkitten.gizmodo.com 90 <--
shrijeet
i guess there are a multitude of ways to go about this. you will probably want to be able to view info for both sub and domain. programatically you will need to identify the domain in the subdomain. This is easy because the domain is always between the two dots, so in your fave language you can strip out/identify the domain. TLD extensions you could also strip out for analysis. How you store this info is up to you, but youll probably want to see 1) info on just the domain, 2) info on the aggregate of all subdomains and domain, 3) info on each subdomain and perhaps lastly to look at tld info
Ke