views:

22

answers:

0

This is for http://cssfingerprint.com

I have a system (see about page on site for details) where:

  • I need to output a ranked list, with confidences, of categories that match a particular feature vector
  • the binary feature vectors are a list of site IDs & whether this session detected a hit
  • feature vectors are, for a given categorization, somewhat noisy (sites will decay out of history, and people will visit sites they don't normally visit)
  • categories are a large, non-closed set (user IDs)
  • my total feature space is approximately 50 million items (URLs)
  • for any given test, I can only query approx. 0.2% of that space
  • I can only make the decision of what to query, based on results so far, ~10-30 times, and must do so in <~100ms (though I can take much longer to do post-processing, relevant aggregation, etc)
  • getting the AI's probability ranking of categories based on results so far is mildly expensive; ideally the decision will depend mostly on a few cheap sql queries
  • I have training data that can say authoritatively that any two feature vectors are the same category but not that they are different (people sometimes forget their codes and use new ones, thereby making a new user id)

I need an algorithm to determine what features (sites) are most likely to have a high ROI to query (i.e. to better discriminate between plausible-so-far categories [users], and to increase certainty that it's any given one).

This needs to take into balance exploitation (test based on prior test data) and exploration (test stuff that's not been tested enough to find out how it performs).

There's another question that deals with a priori ranking; this one is specifically about a posteriori ranking based on results gathered so far.

Right now, I have little enough data that I can just always test everything that anyone else has ever gotten a hit for, but eventually that won't be the case, at which point this problem will need to be solved.

I imagine that this is a fairly standard problem in AI - having a cheap heuristic for what expensive queries to make - but it wasn't covered in my AI class, so I don't actually know whether there's a standard answer. So, relevant reading that's not too math-heavy would be helpful, as well as suggestions for particular algorithms.

What's a good way to approach this problem?