views:

121

answers:

5

I have a set of unique vectors (10k's worth). And I need to, for any chosen column, extract the set of values that are seen in that column, in rows where all the others columns are given values.

I'm hoping for a solution that is sub linear (wrt the item count) in time and at most linear (wrt the total size of all the items) in space, preferably sub linear extra space over just storing the items.

Can I get that or better?

BTW: it's going to be accessed from python and needs to simple to program or be part of an existing commonly used library.


edit: the costs are for the lookup, and do not include the time to build the structures. All the data that will ever be indexed is available before the first query will be made.


It seems I'm doing a bad job of describing what I'm looking for, so here is a solution that gets close:

class Index:
  dep __init__(self, stuff):  # don't care about this O() time
    self.all = set(stuff)
    self.index = {}
    for item in stuff:
      for i,v in item:
        self.index.getdefault(i,set()).add(v)

  def Get(self, col, have):  # this O() matters
    ret = []
    t = array(have)  # make a copy.
    for i in self.index[col]:
      t[col] = i
      if t in self.all:
        ret.append(i)
    return ret

The problem is that this give really bad (O(n)) worst case perf.

+11  A: 

Since you are asking for a SQL-like query, how about using a relational database? SQLite is part of the standard library, and can be used either on-disk or fully in memory.

Ned Batchelder
Good idea, but given that I'm dealing with such a constrained and regular case it seems like overkill. Also I'd need to create a whole pile of indexes to make a fast.
BCS
BCS: There are many things you can do with a hammer, but that does not mean hitting a nail with it is overkill.
THC4k
+1  A: 

Suppose you have a 'tuple' class with fields x, y, and z and you have a bunch of such tuples saved in an enumerable var named myTuples. Then:

A) Pre-population:

dct = {}
for tpl in myTuples:
    tmp = (tpl.y, tpl.z)
    if tmp in dct:
        dct[tmp].append(tpl.x)
    else: 
        dct[tmp] = [tpl.x]

B) Query:

def findAll(y,z):
    tmp = (y,z)
    if tmp not in dct: return ()
    return [(x,y,z) for x in dct[tmp]]

I am sure that there is a way to optimize the code for readability, save a few cycles, etc. But essentially you want to pre-populate a dict, using a 2-tuple as a key. If I did not see a request for sub-linear then I would not have though of this :)

A) The pre-population is linear, sorry. B) Query should be as slow as the number of items returned - most of the time sub-linear, except for weird edge cases.

Hamish Grubijan
That would work, but I'd need 8 such dics with 7-tuples for each key. At that point, I might as well pre-compute the answer for all cases and get O(1) lookup.
BCS
+5  A: 

If you have a Python set (no ordering) there is no way to select all relevant items without at least looking at all items -- so it's impossible for any solution to be "sub linear" (wrt the number of items) as you require.

If you have a list, instead of a set, then it can be ordered -- but that cannot be achieved in linear time in the general case (O(N log N) is provably the best you can do for a general-case sorting -- and building sorted indices would be similar -- unless there are constraints that let you use "bucket-sort-like" approaches). You can spread around the time it takes to keep indices over all insertions in the data structure -- but that won't reduce the total time needed to build such indices, just, as I said, spread them around.

Hash-based (not sorted) indices can be faster for your special case (where you only need to select by equality, not by "less than" &c) -- but the time to construct such indices is linear in the number of items (obviously you can't construct such an index without at least looking once at each item -- sublinear time requires some magic that lets you completely ignore certain items, and that can't happen without supporting "structure" (such as sortedness) which in turn requires time to achieve (though it can be achieved "incrementally" ahead of time, such an approach doesn't reduce the total time required).

So, taken to the letter, your requirements appear overconstrained: neither Python, nor any other language, nor any database engine, etc, can possibly achieve them -- if interpreted literally exactly as you state them. If "incremental work done ahead of time" doesn't count (as breaching your demands of linearity and sublinearity), if you take about expected/typical rather than worst-case behavior (and your items have friendly probability distributions), etc, then it might be possible to come close to achieving your very demanding requests.

For example, consider maintaining for each of the vectors' D dimensions a dictionary mapping the value an item has in that dimension, to a set of indices of such items; then, selecting the items that meet the D-1 requirements of equality for every dimension but the ith one can be done by set intersections. Does this meet your requirements? Not by taking the latter strictly to the letter, as I've explained above -- but maybe, depending on how much each requirement can perhaps be taken in a more "relaxed" sense.

BTW, I don't understand what a "group by" implies here since all the vectors in each group would be identical (since you said all dimensions but one are specified by equality), so it may be that you've skipped a COUNT(*) in your SQL-equivalent -- i.e., you need a count of how many such vectors have a given value in the i-th dimension. In that case, it would be achievable by the above approach.

Edit: as the OP has clarified somewhat in comments and an edit to his Q I can propose an approach in more details:

import collections

class Searchable(object):

  def __init__(self, toindex=None):
    self.toindex = toindex
    self.data = []
    self.indices = None

  def makeindices(self):
    if self.indices is not None:
      return
    self.indices = dict((i, collections.defaultdict(set))
                        for i in self.toindex)

  def add(self, record):
    if self.toindex is None:
      self.toindex = range(len(record))
    self.makeindices()
    where = len(self.data)
    self.data.append(record)
    for i in self.toindex:
      self.indices[i][record[i]].add(where)

  def get(self, indices_and_values, indices_to_get):
    ok = set(range(len(self.data)))
    for i, v in indices_and_values:
      ok.intersection_update(self.indices[i][v])
    result = set()
    for rec in (self.data[i] for i in ok):
      t = tuple(rec[i] for i in indices_to_get)
      result.add(t)
    return result

def main():
  c = Searchable()
  for r in ((1,2,3), (1,2,4), (1,5,4)):
    c.add(r)
  print c.get([(0,1),(1,2)], [2])

main()

This prints

set([(3,), (4,)])

and of course could be easily specialized to return results in other formats, accept indices (to index and/or to return) in different ways, etc. I believe it meets the requirements as edited / clarified since the extra storage is, for each indexed dimension/value, a set of the indices at which said value occurs on that dimension, and the search time is one set intersection per indexed dimension plus a loop on the number of items to be returned.

Alex Martelli
I second the sub-linear skepticism. I took a guess for what the asker wanted.
Hamish Grubijan
Third, I ignored the constraints as well and gave it Python's best attempt.
Jesse Dhillon
The time to build the structure is not a constraint. See edit.
BCS
The group by is supposed to eliminate duplicates. My SQL is a little rusty.
BCS
Jesse Dhillon
@BCS, just as a SQL refresher: `SELECT DISTINCT X` is the simplest way to eliminate duplicates.
Alex Martelli
The more I think about it the more I like that solution. OTOH it's space overhead is worse than my reference implementation. -- Also it's a more general solution than I asked for, and not in the direction that I would need: In the (unmentioned) cases where I would want more than one column of output, I'd want an tuple of sets, not a set of tuples; but that's an easy enough tweak.
BCS
+2  A: 

I'm assuming that you've tried the dictionary and you need something more flexible. Basically, what you need to do is index values of x, y and z

def build_index(vectors):
  index = {x: {}, y: {}, z: {}}

  for position, vector in enumerate(vectors):
    if vector.x in index['x']:
      index['x'][vector.x].append(position)
    else:
      index['x'][vector.x] = [position]

    if vector.y in index['y']:
      index['y'][vector.y].append(position)
    else:
      index['y'][vector.y] = [position]

    if vector.z in index['z']:
      index['z'][vector.z].append(position)
    else:
      index['z'][vector.z] = [position]

  return index

What you have in index a lookup table. You can say, for example, select x,y,z from vectors where x=42 by doing this:

def query_by(vectors, index, property, value):
  results = []
  for i in index[property][value]:
    results.append(vectors[i])

vecs_x_42 = query_by(index, 'x', 42)
# now vec_x_42 is a list of all vectors where x is 42

Now to do a logical conjunction, say select x,y,z from vectors where x=42 and y=3 you can use Python's sets to accomplish this:

def query_by(vectors, index, criteria):
  sets = []
  for k, v in criteria.iteritems():
    if v not in index[k]:
      return []
    sets.append(index[k][v]) 

  results = []
  for i in set.intersection(*sets):
    results.append(vectors[i])

  return results

vecs_x_42_y_3 = query_by(index, {'x': 42, 'y': 3})

The intersection operation on sets produces values which only appear in both sets, so you are only iterating the positions which satisfy both conditions.

Now for the last part of your question, to group by x:

def group_by(vectors, property):
  result = {}
  for v in vectors:
    value = getattr(v, property)
    if value in result:
      result[value].append(v)
    else:
      result[value] = [v]

  return result

So let's bring it all together:

vectors = [...] # your vectors, as objects such that v.x, v.y produces the x and y values
index = build_index(vectors)
my_vectors = group_by(query_by(vectors, index, {'y':42, 'z': 3}), 'x')

# now you have in my_vectors a dictionary of vectors grouped by x value, where y=42 and z=3

Update I updated the code above and fixed a few obvious errors. It works now and it does what it claims to do. On my laptop, a 2GHz core 2 duo with 4GB RAM, it takes less than 1s to build_index. Lookups are very quick, even when the dataset has 100k vectors. If I have some time I'll do some formal comparisons against MySQL.

You can see the full code at this Codepad, if you time it or improve it, let me know.

Jesse Dhillon
Interesting stuff ... care to time it?
Hamish Grubijan
Hah, honestly I'd never use something like this. I would use an RDBMS, since the nature of the problem is... relational. If I have some time I'll generate a test data set and time it compared with MySQL or something.
Jesse Dhillon
A: 

So you have 3 coordinates and one value for start and end of vector (x,y,z)?

How is it possible to know the seven known values? Are there many coordinate triples multiple times?

You must be doing very tight loop with the function to be so conserned of look up time considering the small size of data (10K).

Could you give example of real input for your class you posted?

Tony Veijalainen
I intentionally didn't specify the length of the tuple (as it happens, I have more than one length I need to deal with) and they are significantly longer than 3 (if you need to know, one case is 8 but a solution that depends on that will be significantly less than ideal). Also I need to be able to do the lookup for any one column, not just one fixed column. As for real inputs; no. but think vectors of 3 to 30 small integers and you should be close enough.
BCS