views:

71

answers:

3

Top k problem - searching BEST k (3 or 1000) elements in DB

There is fundamental problem with relational DB, that to find top k elems, there is a need to process ALL rows in table. Which make it useless on big data.

I'm making application (for university research, not really my invention, I'm implementing and trying to improve original idea) that allows you to effectively find top k elements by visiting only 3-5% of stored data. Which make it really fast.

There are even user preferences, so on some domain, you can specify function that specify best value for user and aggregation function that specify most significant attributes.


For example DB of cars: attributes:(price, mileage, age of car, ccm, fuel/mile, type of car...) and user values for example 10*price + 5*fuel/mile + 4*mileage + age of car, (s)he doesn't care about type of car and other. - this is aggregation specification

Then for each attribute (price, mileage, ...), there can be totally different "value-function" that specifies best value for user. So for example (price: lower, the better, then value go down, up to $50k, where value is 0 (user don't want car more expensive than 50k). Mileage: other function based on his/hers criteria, ans so on...


You can see that there is quite freedom to specify your preferences and acording to it, best k elements in DB will be found quickly.

I've spent many sleepless night thinking about real-life usability. Who can benefit from that query db? But I failed to whomp up anything and sticking to only academic write-only stance. :-( I hope there can be some real usage for that, but I don't see any....

.... do YOU have any idea how to use that in real-life, real problem, etc...


I'd love to hear from You.

+2  A: 

Have a database of people's CVs and establish hiring criteria for different jobs, allowing for a dynamic display of the top k candidates.

Also, considering the fast nature of your solution, you can think of exploiting it in rendering near real-time graphs of highly dynamic data, like stock market quotes or even applications in molecular or DNA-related studies.

New idea: perhaps your research might have applications in clustering, where you would use it to implement a fast k - Nearest Neighbor clustering by complex criteria without having to scan the whole data set each time. This would lead to faster clustering of larger data sets in respect with more complex criteria in picking the K-NN for each data node.

luvieere
+1: This is great idea! :)
DinGODzilla
+1  A: 

There are unlimited possible real-use scenarios. Getting the top-n values is used all the time.

But I highly doubt that it's possible to get top-n objects without having an index. An index can only be built if the properties that will be searched are known ahead of searching. And if that's the case, a simple index in a relational database is able to provide the same functionality.

Georg
My solution won't use relational database. I have to import data from relational DB to my DB (implemented via BerkeleyDB as multidimensional B+trees). In relational DB, there is no way how to do that, You basically have to process all elements to find out best k.
DinGODzilla
There is even multi-user preferences way how to query. The more specific query, more faster it's found. The only restriction is that function have to be continous and linear-broken.
DinGODzilla
A: 

It's used in financial organizations all the time, you need to see the most profitable assets / least profitable, etc.

Yacoder
Yes. But these highly specific charasteristics can be computed ahead of searching, because they probably won't change much (or at all) and better approach could be used. Our solution is for unlimited (and multi-user) variety of top-k queries.
DinGODzilla
If is it a FOREX trading or portfolio management, where real-time prices come every N seconds, not so much, I would say. Different traders / portfolio managers can have their user preferences as well.
Yacoder