views:

73

answers:

2

Hi!

I'm looking for efficient algorithm for storing and fetching data by key. I've already read about Litvin linear dynamic hash and another methods, but still, i wonder is there some way to get (search, calculate) key in VERY large binary file (consider more than 100 gb)?

I'm just curios is there ANY algorithm which works without performance degradation :)

as was asked, some example of key:value here:

key: sha1 hash (20 bytes raw data - indexing and retrieving done by this key)

value: bytes[100] always.

A: 

Well, as said earlier 100GB doesn't day much. Perhaps a sample of how the key:values are could give better clues. But having said that, you could take a look at some excellent open-source implementations in this regard.

  1. HBase - an open source implementation of Google's BigTable.
  2. Cassandra - Facebook's take on this topic.

Both these models are essentially key-value storage units. All other features like multi-node, distributed, fault-tolerant etc. are icing on the cake.

MovieYoda
thanks, but i'm looking for implementations of algoritm, not ready for production system. just interesting, maybe someone done that. for my db engine just key-value isn;t enough, but the way, how nodes are stored and retrieved is some where near hash and/or btree algo's.
excanoe
A: 

what about first 1) creating a B plus tree from the keys values (it will serve the storage problem) and then 2) using a bitmap created from the keys to speed up the searching.

yadab
and what if i have 10 billions of nodes(keys)? i'm a bit suspicious about b(+)tree's performance.
excanoe
@excanoe : hmm. But B+ is fine for you because the bitmap will help you to save space if you use them together.
yadab
@yadab thank you a lot, but i'm not familiar with this technique like bitmap (maybe you talk about Bloom filter?) so don't want to bother you :) also b+tree is kinda ... difficult to understand for me without some simple visualization.
excanoe
also as i can see there is no such algo which can store and retrieve keys without performance drop? or i'm wrong?
excanoe
@excanoe : yeah, you are correct about time and storage one, addressing both at the same time is difficult and there is some upper-bound. But using multi-level filter/lookup can help to reduce the storage problem but you are adding time overhead theoretically. index := search_bitmap(key); now you can have different b+ trees and "index" will identify to get the tree. may be it is not an efficient way of doing, but it helped me.
yadab
@yadab: hash algo's are in some special cases ugly if we provide specific data that fits into specifical buckets that overflow dramatically, can this scenario be reproduced with b+tree's? or they by their nature always become balanced? thanks!
excanoe
@excanoe : By nature balanced, if you have duplicate keys, still you can handle that. You do not have the hash chaining problem.
yadab