tags:

views:

1343

answers:

11

What's the best way, both aesthetically and from a performance perspective, to split a list of items into multiple lists based on a conditional? The equivalent of:

good = [x for x in mylist if x in goodvals]
bad  = [x for x in mylist if x not in goodvals]

is there a more elegant way to do this?

Update: here's the actual use case, to better explain what I'm trying to do:

# files looks like: [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi'), ... ]
IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
images = [f for f in files if f[2].lower() in IMAGE_TYPES]
anims  = [f for f in files if f[2].lower() not in IMAGE_TYPES]
+8  A: 

First go (pre-OP-edit): Use sets:

mylist = [1,2,3,4,5,6,7]
goodvals = [1,3,7,8,9]

myset = set(mylist)
goodset = set(goodvals)

print list(myset.intersection(goodset))  # [1, 3, 7]
print list(myset.difference(goodset))    # [2, 4, 5, 6]

That's good for both readability (IMHO) and performance.

Second go (post-OP-edit):

Create your list of good extensions as a set:

IMAGE_TYPES = set(['.jpg','.jpeg','.gif','.bmp','.png'])

and that will increase performance. Otherwise, what you have looks fine to me.

RichieHindle
not best solution if the lists were in some order before splitting and you need them to stay in that order.
Daniyar
Wouldn't that remove duplicates?
mavnn
Thanks Rich. I hadn't asked the question very well, updated it with the actual use case. I don't think sets will work very smoothly in this case.
Parand
@Parand: OK, I've updated my answer.
RichieHindle
+1 for being mathematically elegant
Kevin D.
+2  A: 

Personally, I like the version you cited, assuming you already have a list of goodvals hanging around. If not, something like:

good = filter(lambda x: is_good(x), mylist)
bad = filter(lambda x: !is_good(x), mylist)

Of course, that's really very similar to using a list comprehension like you originally did, but with a function instead of a lookup:

good = [x for x in mylist if is_good(x)]
bad  = [x for x in mylist if !is_good(x)]

In general, I find the aesthetics of list comprehensions to be very pleasing. Of course, if you don't actually need to preserve ordering and don't need duplicates, using the intersection and difference methods on sets would work well too.

BJ Homer
+1  A: 

I agree with RIchieHindle. you can also check this link
http://house.hcn-strela.ru/doc/python/lib/types-set.html

Jace Jung
+1  A: 

For perfomance, try itertools.

The itertools module standardizes a core set of fast, memory efficient tools that are useful by themselves or in combination. Together, they form an “iterator algebra” making it possible to construct specialized tools succinctly and efficiently in pure Python.

See itertools.ifilter or imap.

itertools.ifilter(predicate, iterable)

Make an iterator that filters elements from iterable returning only those for which the predicate is True

gimel
+9  A: 

Problem with all proposed solutions is that it will scan and apply the filtering function twice. I'd make a simple small function like this:

def SplitIntoTwoLists(l, f):
  a = []
  b = []
  for i in l:
    if f(i):
      a.append(i)
    else:
      b.append(i)
 return (a,b)

That way you are not processing anything twice and also are not repeating code.

winden
The second a.append() should be b.append().
balpha
I agree. I was looking for an "elegant" (i.e. here meaning short and built-in/implicit) way to do this without scanning the list twice, but this seems (without profiling) to be the way to go. Of course it would only matter anyway for large amounts of data.
Matthew Flaschen
IMHO, if you know a way of doing it with less cpu usage (and thus less power drain), there is no reason not to use it.
winden
+6  A: 

Here's the lazy iterator approach:

from itertools import tee

def split_on_condition(seq, condition):
    l1,l2 = tee((condition(item),item) for item in seq)
    return (i for p, i in l1 if p), (i for p, i in l2 if not p)

It evaluates the condition once per item and returns two generators, first yielding values from the sequence where the condition is true, the other where it's false.

Because it's lazy you can use it on any iterator, even an infinite one:

from itertools import count, islice

def is_prime(n):
    return n > 1 and all(n % i for i in xrange(2,n))

primes, not_primes = split_on_condition(count(), is_prime)
print("First 10 primes", list(islice(primes, 10)))
print("First 10 non-primes", list(islice(not_primes, 10)))

Usually though the non-lazy list returning approach is better:

def split_on_condition(seq, condition):
    a, b = [], []
    for item in seq:
        (a if condition(item) else b).append(item)
    return a,b

Edit: For your more specific usecase of splitting items into different lists by some key, heres a generic function that does that:

DROP_VALUE = lambda _:_
def split_by_key(seq, resultmapping, keyfunc, default=DROP_VALUE):
    """Split a sequence into lists based on a key function.

        seq - input sequence
        resultmapping - a dictionary that maps from target lists to keys that go to that list
        keyfunc - function to calculate the key of an input value
        default - the target where items that don't have a corresponding key go, by default they are dropped
    """
    result_lists = dict((key, []) for key in resultmapping)
    appenders = dict((key, result_lists[target].append) for target, keys in resultmapping.items() for key in keys)

    if default is not DROP_VALUE:
        result_lists.setdefault(default, [])
        default_action = result_lists[default].append
    else:
        default_action = DROP_VALUE

    for item in seq:
        appenders.get(keyfunc(item), default_action)(item)

    return result_lists

Usage:

def file_extension(f):
    return f[2].lower()

split_files = split_by_key(files, {'images': IMAGE_TYPES}, keyfunc=file_extension, default='anims')
print split_files['images']
print split_files['anims']
Ants Aasma
That's a lot of code to replace two list comprehensions..
dbr
You're probably right that this violates the YAGNI principle. It is based on the assumption that number of different lists that things can be partitioned into will grow in the future.
Ants Aasma
A: 

If you insist on clever, you could take Winden's solution and just a bit spurious cleverness:

def splay(l, f, d=None):
  d = d or {}
  for x in l: d.setdefault(f(x), []).append(x)
  return d
Anders Eurenius
The "d or {}" is a bit dangerous. If an empty dict gets passed in, it won't be mutated in place.
Brian
True, but it gets returned, so... Actually, this is the perfect example of why you *don't* want to add more clever to your code. :-P
Anders Eurenius
+7  A: 
good = [x for x in mylist if x in goodvals]
bad  = [x for x in mylist if x not in goodvals]

is there a more elegant way to do this?

That code is perfectly readable, and extremely clear!

# files looks like: [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi'), ... ]
IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
images = [f for f in files if f[2].lower() in IMAGE_TYPES]
anims  = [f for f in files if f[2].lower() not in IMAGE_TYPES]

Again, this is fine!

There might be slight performance improvements using sets, but it's a trivial difference, and I find the list comprehension far easier to read, and you don't have to worry about the order being messed up, duplicates being removed as so on.

In fact, I may go another step "backward", and just use a simple for loop:

images, anims = [], []

for f in files:
    if f.lower() in IMAGE_TYPES:
        images.append(f)
    else:
        anims.append(f)

The a list-comprehension or using set() is fine until you need to add some other check or another bit of logic - say you want to remove all 0-byte jpeg's, you just add something like..

if f[1] == 0:
    continue
dbr
+2  A: 

itertools.groupby almost does what you want, except it requires the items to be sorted to ensure that you get a single contiguous range, so you need to sort by your key first (otherwise you'll get multiple interleaved groups for each type). eg.

def is_good(f):
    return f[2].lower() in IMAGE_TYPES

files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi'), ('file3.gif', 123L, '.gif')]

for key, group in itertools.groupby(sorted(files, key=is_good), key=is_good):
    print key, list(group)

gives:

False [('file2.avi', 999L, '.avi')]
True [('file1.jpg', 33L, '.jpg'), ('file3.gif', 123L, '.gif')]

Similar to the other solutions, the key func can be defined to divide into any number of groups you want.

Brian
+1  A: 

If you want to make it in FP style:

good, bad = [ sum(x, []) for x in zip(*(([y], []) if y in goodvals else ([], [y])
                                        for y in mylist)) ]

Not the most readable solution, but at least iterates through mylist only once.

michau
A: 

I basically like Anders' approach as it is very general. Here's a version that puts the categorizer first (to match filter syntax) and uses a defaultdict (assumed imported).

def categorize(func, seq):
    """Return mapping from categories to lists
    of categorized items.
    """
    d = defaultdict(list)
    for item in seq:
        d[func(item)].append(item)
    return d
Alan Isaac