If you have a Python set (no ordering) there is no way to select all relevant items without at least looking at all items -- so it's impossible for any solution to be "sub linear" (wrt the number of items) as you require.
If you have a list, instead of a set, then it can be ordered -- but that cannot be achieved in linear time in the general case (O(N log N)
is provably the best you can do for a general-case sorting -- and building sorted indices would be similar -- unless there are constraints that let you use "bucket-sort-like" approaches). You can spread around the time it takes to keep indices over all insertions in the data structure -- but that won't reduce the total time needed to build such indices, just, as I said, spread them around.
Hash-based (not sorted) indices can be faster for your special case (where you only need to select by equality, not by "less than" &c) -- but the time to construct such indices is linear in the number of items (obviously you can't construct such an index without at least looking once at each item -- sublinear time requires some magic that lets you completely ignore certain items, and that can't happen without supporting "structure" (such as sortedness) which in turn requires time to achieve (though it can be achieved "incrementally" ahead of time, such an approach doesn't reduce the total time required).
So, taken to the letter, your requirements appear overconstrained: neither Python, nor any other language, nor any database engine, etc, can possibly achieve them -- if interpreted literally exactly as you state them. If "incremental work done ahead of time" doesn't count (as breaching your demands of linearity and sublinearity), if you take about expected/typical rather than worst-case behavior (and your items have friendly probability distributions), etc, then it might be possible to come close to achieving your very demanding requests.
For example, consider maintaining for each of the vectors' D dimensions a dictionary mapping the value an item has in that dimension, to a set of indices of such items; then, selecting the items that meet the D-1 requirements of equality for every dimension but the i
th one can be done by set intersections. Does this meet your requirements? Not by taking the latter strictly to the letter, as I've explained above -- but maybe, depending on how much each requirement can perhaps be taken in a more "relaxed" sense.
BTW, I don't understand what a "group by" implies here since all the vectors in each group would be identical (since you said all dimensions but one are specified by equality), so it may be that you've skipped a COUNT(*)
in your SQL-equivalent -- i.e., you need a count of how many such vectors have a given value in the i-th dimension. In that case, it would be achievable by the above approach.
Edit: as the OP has clarified somewhat in comments and an edit to his Q I can propose an approach in more details:
import collections
class Searchable(object):
def __init__(self, toindex=None):
self.toindex = toindex
self.data = []
self.indices = None
def makeindices(self):
if self.indices is not None:
return
self.indices = dict((i, collections.defaultdict(set))
for i in self.toindex)
def add(self, record):
if self.toindex is None:
self.toindex = range(len(record))
self.makeindices()
where = len(self.data)
self.data.append(record)
for i in self.toindex:
self.indices[i][record[i]].add(where)
def get(self, indices_and_values, indices_to_get):
ok = set(range(len(self.data)))
for i, v in indices_and_values:
ok.intersection_update(self.indices[i][v])
result = set()
for rec in (self.data[i] for i in ok):
t = tuple(rec[i] for i in indices_to_get)
result.add(t)
return result
def main():
c = Searchable()
for r in ((1,2,3), (1,2,4), (1,5,4)):
c.add(r)
print c.get([(0,1),(1,2)], [2])
main()
This prints
set([(3,), (4,)])
and of course could be easily specialized to return results in other formats, accept indices (to index and/or to return) in different ways, etc. I believe it meets the requirements as edited / clarified since the extra storage is, for each indexed dimension/value, a set of the indices at which said value occurs on that dimension, and the search time is one set intersection per indexed dimension plus a loop on the number of items to be returned.