views:

131

answers:

4

I have a set of lists that look like this:

conditions = [
["condition1", ["sample1", "sample2", "sample3"]],
["condition2", ["sample4", "sample5", "sample6"],
...]

how can I do the following things efficiently and elegantly in Python?

  1. Find all the elements in a certain condition?

    e.g. get all the samples in condition2. Right now I can do:

    for cond in conditions:
      cond_name, samples = cond
      if cond_name == requested_cond:
        return samples
    

    but that's clunky.

  2. Find the ordered union of a list of conditions? E.g. ordered_union(["condition1", "condition2"], conditions) should return:

    ["sample1", "sample2", "sample3", "sample4", "sample5", "sample6"]
    

How can I do this efficiently in Python? There are probably clever one liners?

+6  A: 

This looks more like a job for a dict:

conditions = {
"condition1": ["sample1", "sample2", "sample3"],
"condition2": ["sample4", "sample5", "sample6"],
...}

You could then get the "ordered union" using

>>> conditions["condition1"]+conditions["condition2"]
['sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6']

In Python 3.1 or 2.7, you can preserve the order using an OrderedDict instead:

from collections import OrderedDict
conditions = OrderedDict([
["condition1", ["sample1", "sample2", "sample3"]],
["condition2", ["sample4", "sample5", "sample6"]]
])

You could then get the "ordered union", also for OrderedDicts of arbitrary size:

>>> import itertools
>>> [item for item in itertools.chain(*conditions.values())]
['sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6']
Tim Pietzcker
This makes the assumption that the sample names can be sorted alphabetically like that, which happens to be true in my example but is not true in general in my code -- sorry for that confusion. condition1 could be called foo and condition2 could be bar, and the samples in them can be named arbitrarily in a non-easily sortable way.
Concatenating the lists won't be a union if there are repeated elements.
tzaman
@user248237: OK, how should they be sorted then? Or how did you arrive at `["sample1", "sample2", "sample3", "sample4", "sample5", "sample6"]`?
Tim Pietzcker
@tzaman: You're right. But we don't know what exactly he means by "ordered union". Let's wait for clarification.
Tim Pietzcker
@tim: I arrive at that ordering by simple concatenating the two lists. I assume no sample belongs to more than one condition. So the ordering is given by the list. This is why I used a list of lists instead of a dictionary.
Well, in this case, iterating over the entries of the `OrderedDict` and adding the values to a list will result in exactly that.
Tim Pietzcker
+2  A: 

You need to use a dict (dictionary) instead of a list. Also, you can keep the samples in a set if you want efficient set-based operations.

conditions = { "condition1" : set(["sample1", "sample2", "sample3"]),
               "condition2" : set(["sample4", "sample5", "sample6"]) }

print conditions["condition2"]
# set(['sample5', 'sample4', 'sample6'])
union = conditions["condition1"].union(conditions["condition2"])
print sorted(union)
# ['sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6']
tzaman
+3  A: 

Ah well, if you're forced to keep that clunky data structure, you can't expect much. The one-liner equivalent of your first solution is going to be something like:

def samplesof(requested_cond, conditions):
    return next(s for c, s in conditions if c==requested_cond)

and for the second one, if you insist on one-liners, it's going to be something like:

def ordered_union(the_conds, conditions):
    return [s for c in the_conds for s in samplesof(c, conditions)]

There are faster ways to solve the second problem, but they're all multi-line, e.g.:

aux_set = set(the_conds)
samples_by_cond = dict((c, s) for c, s in conditions if c in aux_set)
return [s for c in the_conds for s in samples_by_cond[c]]

Note that the key to the reason this latter approach is faster is that it uses the right data structures (a set and a dict) -- unfortunately it has to build them itself, because the incoming conditions nested list is really the wrong data structure.

Couldn't you encapsulate conditions as a member variable of a class that builds the crucial (right, fast) auxiliary data structures just once? E.g.:

class Sensible(object):
  def __init__(self, conditions):
    self.seq = []
    self.dic = {}
    for c, s in conditions:
      self.seq.append(c)
      self.dic[c] = s
  def samplesof(self, requested_condition):
    return self.dic[requested_condition]
  def ordered_union(self, the_conds):
    return [s for c in the_conds for s in self.dic[c]]

Now that is fast and elegant!

I'm assuming that you need self.seq (the sequence of conditions) for something else (it's certainly not needed for the two operations you mention!), and that there are no repetitions in that sequence and in the samples (whatever your actual specs are they won't be hard to accomodate, but blindly trying to guess them when you mention nothing about them would be very hard and pointless;-).

Alex Martelli
+1 however reading the comments, the op will likely think this problem doesn't merit a *whole class to itself*!
gnibbler
Thanks for all your comments and replies, I appreciate it. I'm getting the impression that everyone thinks this should be done with a dictionary -- I am very open to that, as long as I can maintain ordering. The above class does that, but like gnibbler said, it's a whole class just for this. Is that what everyone would recommend? If so I am open to this solution, I am just wondering if it isn't an overkill. Basically, as far as I can tell, this has to be a dictionary with an extra parameter which is a list that provides the order.
in `return next( <some generator> )` above do you mean `return (<some generator>).next()` ? not sarcastic, just don't know such fn next()
Nas Banov
@EnTerr, in Python 2.6 and better there's a nice `next` built-in (use the method instead only if you're stuck with old releases, 2.5 or earlier). "A whole class" costs essentially nothing, so _of course_ it's worth it if it makes a solution nicer, faster, etc.
Alex Martelli
@Alex Martelli: thanks - i am learning quite a few new things on S.O. - yes, i am on 2.5; i agree that there is probably a better structure for the data than list of lists but we lack 'domain knowledge' to suggest what. Re the `class Sensible` above, while i agree it is *readily perceived by the senses*, i can't say it's use is *marked by the exercise of good judgment* (dictionary defs. of *sensible* :) ). It's just a single-use wrapper.PS. and of course i like my functional solution better
Nas Banov
@EnTerr, I love functional programming **when appropriate**, but your code below is really terrible (`reduce(list._add__, ...`, **eep**, not to mention the uselessly repeated building of intermediate `dict`s -- which `Sensible` sensibly does once and for all instead, and that's the crucial speed-up and not obtainable via functional programming in this case).
Alex Martelli
@Alex Martelli: thanks for the critique, appreciate it. The reason i had example #2 with `reduce(list.__add__, ...)` first was because i couldn't make sum of list of lists work initially - e.g. sum([[1],[2]]) was giving me `TypeError: unsupported operand type(s) for +: 'int' and 'list'` - and it took me a bit to figure out to force the type by adding empty list as second argument. I agree that creating dictionary N times is creepy - i was doing one-liner for sake of the example. Ok, edited to address your concern - two lines it is!
Nas Banov
@user248237, No this isn't overkill. Having classes such as this is no big deal in Python. More important is having readable+maintainable code.
gnibbler
@EnTerr, `sum` is meant to work on _numbers_ (I should know -- I was the designer and first implementer of it in the Python C runtime;-) and has bad performance on lists (in short: because it's defined to use `+`, _not_ `+=` -- this makes and tosses temporary lists and copies data around like mad, resulting `O(N squared)` performance where a simple loop with `+=` would be `O(N)`). There are many other such issues (`next` short-circuits, `dict` never can, etc), too many to discuss within a comment.
Alex Martelli
@Alex Martelli: by g*d, you are right! i am big fan of your `timeit` examples, so i did some benchmarks (len in 10, 100 ... 100000) and sum(l) time indeed is quadratic. OTOH i implemented `sum1` with `+=` and its performance is linear - it outperforms embedded `sum` for len(list)>15, ouch! Is there a good reason why sum() can't use `+=` for mutable objects?
Nas Banov
@EnTerr, you did right in checking for yourself -- now I think you should edit your answer accordingly.
Alex Martelli
@Alex Martelli: yes, i will edit mine - i think i found linear functional solution w/o sum. But take heed, for small list lengths sum(map(..)..) still beats Sensible() and your other, 3-line idea (Sensible() 16 uSec/pass, genUnionAM()=19uS, getUnionBySum()=13uS from 100k runs on the example from the question) ... and user248237 clarified in his comments that his lists are short
Nas Banov
@EnTerr, yep, it's typical for tiny inputs to have different performance characteristics than larger ones (e.g., insertion sort and other `O(N squared)` ones can easily beat mergesort, quicksort and other `O(N log N)` ones, provided only the input is tiny enough) -- other anomalous situations (e.g. doing a single operation on an object, affording no amortization of the object's construction [[as would happen in normal use patterns]]) can also produce "topsy turvy" results. But in general you don't **need** to care as much about the tiny cases: it's the `O(N**2)` on big input that kills you!-)
Alex Martelli
@Alex Martelli: ya, ya - i know - i was just pointing out that since OP mentioned his lists are short, it won't be a heresy to use sum(). Still, why is sum implementation impaired, i.e. does not use += for lists of mutables? quite a long comment list here - should i create separate q. regarding sum()?!
Nas Banov
@gnibbler: Why yes, it is overkill creating classes for that! That is, unless you live in the Kingdom of Nouns under King Java (http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom-of-nouns.html) - then you must. Verbosity is not pythonic - it hampers readability.
Nas Banov
@EnTerr, yep, this comment thread is way too long, make a (community wiki, maybe) Q for discussing `sum`'s proper use and implementation. Wrt "overkill creating classes", it's only **excessive** verbosity that is not Pythonic -- Python's Zen says "sparse is better than dense", minimizing verbosity gives you **perl**, definitely _not_ **python** (feel free to open another Q on this -- that one, definitely community wiki).
Alex Martelli
@EnTerr, if the structure is not quite suited to a `dict` or a `list`, isn't it better to encapsulate it's behaviour so the code that uses it is nice and tidy? The alternative is to litter the code with glue and temporary variables.
gnibbler
+1  A: 
Nas Banov
you could use `tmpdict.get` and/or `itertools.chain`.
J.F. Sebastian
@J.F. Sebastian: thanks re dic.get(i) - i used \__getitem__ because i was looking for the _exact_ fn equivalent of `dict[i]` and i know `dict.get()` returns `None` when item is not found while `[]` throws exception. Except there is no reason why i should prefer one over the other - trying to append `None` to a list will cause exception too, so simplest - get - wins. re itertools.chain - i don't know itertools (yet), will look into it - seems convenient!
Nas Banov
`reduce(operator.iadd,...)` is fast. http://stackoverflow.com/questions/406121/flattening-a-shallow-list-in-python/408281#408281
J.F. Sebastian