views:

106

answers:

3

I'm looking for a data structure for string(UTF-8) indices that is highly optimized for range queries and space usage. Thanks!

Elaboration: I have list of arbitrary length utf-8 strings that i need to index. I will be use only range queries.

Example: I have strings - apple, ape, black, cool, dark.

Query will be something like this - "get from 2 to 3 element in desc order" or "get strings that start by 'ap'"

+3  A: 

Since you mentioned "relatively static", a simple sorted array would do everything you want and is highly optimized both in terms of space and time.

"get from 2 to 3 element in desc order" is simply a lookup of the corresponding array indices.

"get strings that start by 'ap'" can be done with a binary search. The search will stop at or just before the first string that starts with 'ap', and from there on, you just scan through until you find all such strings.

casablanca
+1  A: 

Did you check Tries?

The structure should fit what you need - both the range and the 'start with' should be easy, plus memory consumption is also good.

Unreason
A: 

I do like casablanca's answer: since your data are relatively static, a sorted list should be ok.

In case you want something more easily updatable, you can work with a blist.sortedlist.

You could even work with ZODB's BTrees, which already include much of the functionality you want (i.e range searches).

ΤΖΩΤΖΙΟΥ