views:

89

answers:

3

I have quite a big amount of fixed size records. Each record has lots of fields, ID and Value are among them. I am wondering what kind of data structure would be best so that I can

  1. locate a record by ID(unique) very fast,

  2. list the 100 records with the biggest values.

Max-heap seems work, but far from perfect; do you have a smarter solution?

Thank you.

A: 

Max heap would match the second requirement, but hash maps or balanced search trees would be better for the first one. Make the choice based on frequency of these operations. How often would you need to locate a single item by ID and how often would you need to retrieve top 100 items?

Pseudo code:

add(Item t)
{
    //Add the same object instance to both data structures
    heap.add(t);
    hash.add(t);
}
remove(int id)
{
    heap.removeItemWithId(id);//this is gonna be slow
    hash.remove(id);
}
getTopN(int n)
{
    return heap.topNitems(n);
}
getItemById(int id)
{
    return hash.getItemById(id);
}
updateValue(int id, String value)
{
    Item t = hash.getItemById(id);
    //now t is the same object referred to by the heap and hash
    t.value = value;
    //updated both.
}
Amarghosh
>How oftenThat's the problem. I will have roughly 50 requests per second and what kind of request will be more varies according to time.Allocate memory for each record,have one ID->record pointer hash map/balanced search tree to serve the first requirement, and have a value->record pointer max-heap to satisfy the second requirement?
If modifications are not so frequent, I believe using a hash for id-based retrievals and a heap for sorted list wouldn't be a bad idea - can't say for sure without testing though.
Amarghosh
what makes it tricky is that the values are being updated all the time ...
If the hash and heap are both referring to the same object instance, you can modify values using an id-based retrieval from the hash and the changes made will be automatically reflected in the heap's references too.
Amarghosh
but won't that affect the value based max-heap?
not if both heap and hash are referring to the same instance of the data object. see the updated pseudo code
Amarghosh
There is an added overhead in terms of the speed of add/remove operations. Memory overhead would be minimal as only one copy of data object exists per item (both heap and hash points to the same item in the memory).
Amarghosh
A: 

I know you want pseudo-code algorithm, but in Java for example i would use TreeSet, add all the records by ID,value pairs.

The Tree will add them sorted by value, so querying the first 100 will give you the top 100. Retrieving by ID will be straight-forward.

I think the algorithm is called Binary-Tree or Balanced Tree not sure.

medopal
Interesting,maybe it's worthy of take a look on TreeSet's code.
+1  A: 

A hybrid data structure will most likely be best. For efficient lookup by ID a good structure is obviously a hash-table. To support top-100 iteration a max-heap or a binary tree is a good fit. When inserting and deleting you just do the operation on both structures. If the 100 for the iteration case is fixed, iteration happens often and insertions/deletions aren't heavily skewed to the top-100, just keep the top 100 as a sorted array with an overflow to a max-heap. That won't modify the big-O complexity of the structure, but it will give a really good constant factor speed-up for the iteration case.

Ants Aasma
If the IDs are unique and dense, an array is better than a hash table, if unique and sparse, a B-tree is often used.
Svante