views:

19

answers:

0

This is for http://cssfingerprint.com

I have a largish database (~100M rows) of websites. This includes both main domains (both 2LD and 3LD) and particular URLs scraped from those domains (whether hosted there [like most blogs] or only linked from it [like Digg], and with a reference to the host domain).

I also scrape the Alexa top million, Bloglines top 1000, Google pagerank, Technorati top 100, and Quantcast top million rankings. Many domains will have no ranking though, or only a partial set; and nearly all sub-domain URLs have no ranking at all other than Google's 0-10 pagerank (some don't even have that).

I can add any new scrapings necessary, assuming it doesn't require a massive amount of spidering.

I also have a fair amount of information about what sites previous users have visited.

What I need is an algorithm that orders these URLs by how likely a visitor is to have visited that URL without any knowledge of the current visitor. (It can, however, use aggregated information about previous users.)

This question is just about the relatively fixed (or at least aggregated) a priori ranking; there's another question that deals with getting a dynamic ranking.

Given that I have limited resources (both computational and financial), what's the best way for me to rank these sites in order of a priori probability of their having been visited?