Just to validate my claim that a regular TreeMap should suffice for any bucket with as many as 1,000,000 entries, here's a very simple test case that gives some numbers (attention: this is not meant as a microbenchmark, it's just to get a feeling for the magnitude of this problem).
I've used randomly generated UUIDs to mimic keys (If you'd replace dashes with slashes, you would even get kind of a directory structure). Afterwards, I've put them into a regular java.util.TreeMap
and finally queried them with map.subMap(fromKey, toKey)
.
public static void main(String[] args) {
TreeMap<String, Object> map = new TreeMap<String, Object>();
int count = 1000000;
ArrayList<String> uuids;
{
System.out.print("generating ... ");
long start = System.currentTimeMillis();
uuids = new ArrayList<String>(count);
for (int i = 0; i < count; i++) {
uuids.add(UUID.randomUUID().toString());
}
System.out.println((System.currentTimeMillis() - start) + "ms");
}
{
System.out.print("inserting .... ");
long start = System.currentTimeMillis();
Object o = new Object();
for (int i = 0; i < count; i++) {
map.put(uuids.get(i), o);
}
System.out.println((System.currentTimeMillis() - start) + "ms");
}
{
System.out.print("querying ..... ");
String from = "be400000-0000-0000-0000-000000000000";
String to = "be4fffff-ffff-ffff-ffff-ffffffffffff";
long start = System.currentTimeMillis();
long matches = 0;
for (int i = 0; i < count; i++) {
Map<String, Object> result = map.subMap(from, to);
matches += result.size();
}
System.out.println((System.currentTimeMillis() - start) + "ms (" + matches/count
+ " matches)");
}
}
and here is some sample output from my machine (1,000,000 keys, 1,000,000 range queries):
generating ... 6562ms
inserting .... 2933ms
querying ..... 5344ms (229 matches)
Inserting 1 key took an average of 0.003 ms (certainly more towards the end though) while querying a sub range with 229 matches took 0.005 ms per query. That's some pretty sane performance, isn't it?
After increasing the number to 10,000,000 keys and queries, the numbers are as follows:
generating ... 59562ms
inserting .... 47099ms
querying ..... 444119ms (2430 matches)
Inserting 1 key took an average of 0.005 ms while querying a sub range with 2430 matches took 0.044 ms per query. Even though querying got 10 times slower (at the end, it's iterating through all matches which always is O(n)) the performance still isn't too bad either.
As S3 is a cloud service, I'd assume that it's pretty much limited by networking anyway. Hence there's no urgent need for an extremely fancy data structure to get the required performance. Still, some features are missing from my test case, most notably concurrency and persistence. Nevertheless, I think I've shown that a regular tree structure is sufficient for this use case. If you want to do something fancy, experiment with subtree read-write locking and maybe a replacement for .subMap(fromKey, toKey);