views:

539

answers:

7

I'm developing a 'TreeDict' class in Python. This is a basically a dict that allows you to retrieve its key-value pairs in sorted order, just like the Treemap collection class in Java.

I've implemented some functionality based on the way unique indexes in relational databases can be used, e.g. functions to let you retrieve values corresponding to a range of keys, keys greater than, less than or equal to a particular value in sorted order, strings or tuples that have a specific prefix in sorted order, etc.

Unfortunately, I can't think of any real life problem that will require a class like this. I suspect that the reason we don't have sorted dicts in Python is that in practice they aren't required often enough to be worth it, but I want to be proved wrong.

Can you think of any specific applications of a 'TreeDict'? Any real life problem that would be best solved by this data structure? I just want to know for sure whether this is worth it.

+1  A: 

I often use Dict<DateTime, someClassOrValue> when working with industrial process data-- Valve open/close, machinery start/stop, etc.

Having the keys sorted is especially useful when I need to compare time intervals between start/stop or open/close events in a decent amount of time.

However, since I've been able to use linq in C# I've found that it's often easier to just work with IEnumerables and use the IQueryable extension methods to get the information I need.

nvuono
+2  A: 

The reason for keeping the elements in sorted order is for faster retrieval. Say I wanted all of the values in the dictionary in a sorted range. This is much faster with a TreeDict then with the regular hashmap. It basically allows you to keep everything in the dictionary in sorted order. I know in the application I'm currently working on uses a class like this to basically query the data structure.

Polaris878
A: 

They can make various algorithms easier to implement.

samoz
+1  A: 

Almost all "GROUP BY" reporting requires a sorted dictionary.

summary = sortedDefaultDict()
for row in somePileOfData:
    summary[row.group_by] += row.balance
for k in sorted(summary.keys()):
    print k, summary[k]

This is done so often in data warehousing applications, that it's difficult to express how central this is.

If the sorted function call does no work, it saves a ton of time in the long run.

S.Lott
OTOH, aren't balanced trees used instead of dicts in 'group by' applications?
Eli Bendersky
Inside the RDBMS, they use file-system-oriented structures because there's so little available memory. Some DB's do a query with an implied ORDER-BY and then the groups will be adjacent rows: no dict is required.
S.Lott
Seems as if a hash-based defaultdict will execute the group by query just fine in this case. The part that actually requires the sorting is the "ORDER BY" aspect of the query. It's more 'GROUP BY ORDER BY'.
Seun Osewa
@Seun Osewa: That's why the sorted() is applied to the keys of the resulting dictionary. If that dictionary is ordered, the `sorted()` function will do nothing -- a big speed improvement.
S.Lott
Hmm, turns out that sorted() on a sorted list is not exactly a no-op, but it's 50% as fast as creating a copy of a 10 million element list and more than 70 times faster than sorting a shuffled list of the same size. You're not exactly right, but not completely wrong.
Seun Osewa
+1  A: 

It's useful when you need to go through a Dictionary in order of the keys; which comes up on occasion. I've actually found its infinitely more common in certain programming contests then anything else (think ACM, etc).

The most useful feature of a TreeMap is when you want to quickly find the min or max key; using a sorted dictionary this is often a single method call; and algorithmically can be done in O(log(n)) time, as opposed to iterating over each key looking for a min/max if the collection is unsorted. Basically, a much friendlier interface.

One of the more common times I run into it is when objects are identified by a specific name, and you want to print out the objects ordered according to the name; say a mapping from directory name to number of files in a directory.

One other place I've used it is in an excel spreadsheet wrapper; mapping from row number to row object. This lets you quickly find the last row index, without looping through each row.

Also, it's useful when you can easily define a comparison relation on keys, but not necessarily a hashing function, as needed for HashMaps. The best (though weak) example I can think of is case insensitive string keys.

CoderTao
+2  A: 

I've seen several answers pointing to the "walk in ordered sequence" feature, which is indeed important, but none highlighting the other big feature, which is "find first entry with a key >= this". This has many uses even when there's no real need to "walk" from there.

For example (this came up in a recent SO answer), say you want to generate pseudo-random values with given relative frequencies -- i.e, you're given, say, a dict d:

{'wolf': 42, 'sheep': 15, 'dog': 23, 'goat': 15, 'cat': 5}

and need a way to generate 'wolf' with a probability of 42 out of 100 (since 100 is the total of the relative frequencies given), 'sheep' 15 out of 100, and so on; and the number of distinct values can be quite large, as can the relative frequencies.

Then, store the given values (in whatever order) as the values in a tree map, with the corresponding keys being the "total cumulative frequency" up to that point. I.e.:

def preprocess(d):
    tot = 0
    for v in d:
        tot += d[v]
        treemap.insert(key=tot, value=v)
    return tot, treemap

Now, generating a value can be pretty fast (O(log(len(d)))), as follows:

def generate(tot, treemap, r=random):
    n = r.randrange(tot)
    return treemap.firstGTkey(n).value

where firstGTKey is a method that returns the first entry (with .key and .value attributes, in this hypothetical example) with a key > the given argument. I've used this approach with large files stored as B-Trees, for example (using e.g. bsddb.bt_open and the set_location method).

Alex Martelli
Insightful and original answer, thanks.
Seun Osewa
@seun, you're welcome! Other (more normal;-) cases in which getting the first key>=x is useful include "real-time suggestions" when the users starts typing something on an input field, javascript sends that prefix to the server, and the server needs to respond quickly with "known terms" matching that prefix...
Alex Martelli
+1  A: 

Have you seen that: http://code.activestate.com/recipes/576998/ ?

zuo

zuo