I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features associated with that URL. I am looking for your suggestions for this set of features. For example I would like to store www.nytimes.com as following:
[www.nytimes.com: [lang:en, alexa_rank:96, content_type:news, spam_probability: 0.0001, etc..]
Why I am building this? Well the ultimate goal is to do some interesting things with this index, for example I may do clustering on this index and find interesting groups etc. I have with me a whole lot of text which was generated by whole lot URLs over a period of whole lot of time :) So data is not a problem.
Any kind of suggestions are very welcome.