views:

245

answers:

6
# I have 3 lists:
L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2 = [4, 7, 8]
L3 = [5, 2, 9]
# I want to create another that is L1 minus L2's memebers and L3's memebers, so:
L4 = (L1 - L2) - L3  # Of course this isn't going to work

I'm wondering, what is the "correct" way to do this. I can do it many different ways, but Python's style guide says there should be only 1 correct way of doing each thing. I've never known what this was.

A: 

Assuming your individual lists won't contain duplicates....Use Set and Difference

L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2 = [4, 7, 8]
L3 = [5, 2, 9]
print(list(set(L1) - set(L2) - set(L3)))
st0le
This would lose order.
Brandon Craig Rhodes
Yeah, the key difference between a list and a set...
mepcotterell
If order/duplicates are NOT an issue, this is the cleanest option, IMO
Mark Peters
It's nice, but loses order and duplicates. I posted a (much less good-looking, I admit) solution below, that preserves both.
Santiago Lezica
that's a fairly big IF.
aaronasterling
And if you want to order the result set, S4, before turning it into a list; [n for n in L1 if n in S4].
istruble
+6  A: 

Here are some tries:

L4 = [ n for n in L1 if (n not in L2) and (n not in L3) ]  # parens for clarity

tmpset = set( L2 + L3 )
L4 = [ n for n in L1 if n not in tmpset ]

Now that I have had a moment to think, I realize that the L2 + L3 thing creates a temporary list that immediately gets thrown away. So an even better way is:

tmpset = set(L2)
tmpset.update(L3)
L4 = [ n for n in L1 if n not in tmpset ]

Update: I see some extravagant claims being thrown around about performance, and I want to assert that my solution was already as fast as possible. Creating intermediate results, whether they be intermediate lists or intermediate iterators that then have to be called into repeatedly, will be slower, always, than simply giving L2 and L3 for the set to iterate over directly like I have done here.

$ python -m timeit \
  -s 'L1=range(300);L2=range(30,70,2);L3=range(120,220,2)' \
  'ts = set(L2); ts.update(L3); L4 = [ n for n in L1 if n not in ts ]'
10000 loops, best of 3: 39.7 usec per loop

All other alternatives (that I can think of) will necessarily be slower than this. Doing the loops ourselves, for example, rather than letting the set() constructor do them, adds expense:

$ python -m timeit \
  -s 'L1=range(300);L2=range(30,70,2);L3=range(120,220,2)' \
  'unwanted = frozenset(item for lst in (L2, L3) for item in lst); L4 = [ n for n in L1 if n not in unwanted ]'
10000 loops, best of 3: 46.4 usec per loop

Using iterators, will all of the state-saving and callbacks they involve, will obviously be even more expensive:

$ python -m timeit \
  -s 'L1=range(300);L2=range(30,70,2);L3=range(120,220,2);from itertools import ifilterfalse, chain' \
  'L4 = list(ifilterfalse(frozenset(chain(L2, L3)).__contains__, L1))' 
10000 loops, best of 3: 47.1 usec per loop

So I believe that the answer I gave last night is still far and away (for values of "far and away" greater than around 5µsec, obviously) the best, unless the questioner will have duplicates in L1 and wants them removed once each for every time the duplicate appears in one of the other lists.

Brandon Craig Rhodes
It might be possible to eke out some more performance by building a frozen set from a chain of two list iterators.
intuited
No, frozen sets have exactly the same speed as normal ones, but normally require more expense to create because you have to create an intermediate result or loop yourself if, as here, you are building them from several input iterables.
Brandon Craig Rhodes
@Brandon - thanks for this
orokusaki
A: 

Doing such operations in Lists can hamper your program's performance very soon. What happens is with each remove, List operations do a fresh malloc & move elements around. This can be expensive if you have a very huge list or otherwise. So I would suggest this -

I am assuming your list has unique elements. Otherwise you need to maintain a list in your dict having duplicate values. Anyway for the data your provided, here it is-

METHOD 1

d = dict()
for x in L1: d[x] = True

# Check if L2 data is in 'd'
for x in L2:
    if x in d:
        d[x] = False

for x in L3:
    if x in d:
        d[x] = False

# Finally retrieve all keys with value as True.
final_list = [x for x in d if d[x]]

METHOD 2 If all that looks like too much code. Then you could try using set. But this way your list will loose all duplicate elements.

final_set  = set.difference(set(L1),set(L2),set(L3))
final_list = list(final_set)
MovieYoda
The list comprehension doesn't do remove operations which are the expensive ones.
aaronasterling
#aaron yes I know. I was referring to the solution posted by Santiago.
MovieYoda
Hey, you're basically using a dictionary as a set. They have a whole other data type for that: http://docs.python.org/library/stdtypes.html#types-set
intuited
@intuited yes I realized that after I posted the solution :) . Hence I poseted the method 2 with the shorter `set` version.
MovieYoda
Ah, okay. I didn't refresh; it didn't notify me. Clearly a bug in stackoverflow :) Though see the discussion for http://stackoverflow.com/questions/3947654/python-removing-items-from-lists/3947659#3947659
intuited
A: 

This may be less pythonesque than the list-comprehension answer, but has a simpler look to it:

l1 = [ ... ]
l2 = [ ... ]

diff = list(l1) # this copies the list
for element in l2:
    diff.remove(element)

The advantage here is that we preserve order of the list, and if there are duplicate elements, we remove only one for each time it appears in l2.

Santiago Lezica
That is incredibly expensive and is, to the contrary, more complicated to look at than a simple comprehension.
aaronasterling
A taste issue, it seems. I like lists comprehensions a lot, I actually tend to overuse them, but I don't think "n for n in L if n not in..." is nice in the eye. One way or another, it is, I'll admit, computationally expensive.
Santiago Lezica
+5  A: 
intuited
+1 Very informative. The most recent addition (with itertools) is very nice. I'd say you've earned your Ph.D in filtering lists based on inclusion in a set of lists.
aaronasterling
@aaron: It's taken years of study, but it was worth it.
intuited
Am I missing something or is your generator expression just `itertools.chain`? If yes, just use that (you can keep the explanation of generators and genrator expressions though, people need to learn about them).
delnan
No, frozen sets are *not* more efficient than normal sets, they merely have their "modify bit unset" so that you can use them as dictionary keys. They are, in other words, hashable, which is a really big deal in terms of what you can and cannot do with them in Python. (They can also be members of other sets that way.) It also means that you can pass them around to other routines without being afraid that they will come back modified, but hashability is a much bigger deal.But, as your answer illustrates, they are awkward to create. So I think your solution is overly complex with no "win".
Brandon Craig Rhodes
@Brandon: Thanks for the info; I had forgotten about the hashability aspect. As you say, this is pretty "key". The [link](http://labs.kortina.net/2010/10/13/list-dict-set-and-frozen-set-performance-in-python/) I posted does show a significant speed advantage for frozen sets; any comment on that?
intuited
@delnan: it seems to be a bit of a contentious issue whether `itertools.chain` is actually preferable to nested genexps. I gather that it's faster, but is arguably less self-explanatory than `for .. in .. for .. in ..` From what I've gathered (mostly from posts/comments on this site), a significant portion of the Python userbase finds nested genexps easier to read and/or is reluctant to import from `itertools` just to get different syntax.
intuited
@intuited: Yes! My comment is that you are being mislead by that post. If you will edit the sample code shown in the link, change the line `words_frozenset = frozenset(words_set)` so that it says `words_frozenset = set(words_set)` instead, and re-run the program, the second set *still* wins. The program is not showing that a frozenset is faster. It is showing that a set created directly from another set has a speed advantage, probably because the members get laid out in memory in a more efficient order than when the members are added in the order they happen to arrive from the input file.
Brandon Craig Rhodes
@Brandon: I'm not sure if I read you correctly or not. I swapped the way that the test's `set` and `frozenset` are created, and the frozenset test is consistently running in about 5/6 of the time of the set test. I tried a few different approaches, e.g. creating them both directly from the original list; the ratio stayed pretty much the same. [This gist revision](http://gist.github.com/629781/04aadc5621755cdb36d89a26c6938d89fc1e8c8e) does a swap of the instantiation of those variables. When I run it, the frozenset is still faster. Maybe you can figure out where the discrepancy lies?
intuited
@intuited: I will investigate why they are not running at the same speed. My only point is that if you will get your code from gist and change the frozen set to a normal set, then the normal set will also win the race. You will get two sets that have different speeds. Try it!
Brandon Craig Rhodes
@intuited: Having experimented with your branch a bit more, I note that it is always the second set-or-frozenset that gets tested that winds up winning by about 30%. My guess is that this author is doing something subtly wrong with their testing logic, which is generally what comes of trying to re-implement `timeit()`.
Brandon Craig Rhodes
@intuited: Yes! I think I see it! The problem, I think, is that rather than putting each test in its own loop, he does them all in one loop — so they each see an L1 and L2 cache in the state the previous guy left it in, which gives some of them an unfair advantage!
Brandon Craig Rhodes
@Brandon: looks like [that's it](https://gist.github.com/629781/9b34a593c9865622e2d6d693af55bcf4181a9782). I got rid of the list and dict tests, added 2 more sets of each of the set tests, and found that the times tend to even out in the later sets. Good eye.
intuited
@Brandon: Is what you said about frozen sets 'merely [having] their "modify bit unset"' documented somewhere, or did you learn that from reading the Python source, or from another source?
intuited
@intuited: I gave the "Mighty Dictionary" talk at PyCon, so I know how Python uses hash tables at the C level. With that knowledge, I didn't need to look at the set and frozenset source code; I already know how hash table lookup works, and I know that making the data structure read-only cannot make lookup any faster. My talk is here: http://blip.tv/file/3332763
Brandon Craig Rhodes
@Brandon: awesome, I'll check it out. Makes sense, now that I think about it, actually.. :) I guess it could theoretically choose the hashtable size in a way that would minimize collisions for that data set, but that would probably be a case of diminishing returns and escalating memory usage, and/or require significant investment at set instantiation.
intuited
@intuited: indeed, there would be little benefit — since hash tables grow automatically to stay ahead of the size of their contents, the only really bit mismatch can happen if you empty out a big set or dictionary and leave its slots mostly empty. I've enjoyed this exchange, by the way — we had to dig way farther into this guy's issue than I ever expected. :-) Are you coming to PyCon Atlanta 2011? I'll buy you a beer if I see you there!
Brandon Craig Rhodes
@intuited - thanks for this.
orokusaki
@orokusaki: Cheers! @Brandon: Cheers!
intuited
A: 

I think intuited's answer is way too long for such a simple problem, and Python already has a builtin function to chain two lists as a generator.

The procedure is as follows:

  1. Use itertools.chain to chain L2 and L3 without creating a memory-consuming copy
  2. Create a set from that (in this case, a frozenset will do because we don't change it after creation)
  3. Use list comprehension to filter out elements that are in L1 and also in L2 or L3. As set/frozenset lookup (x in someset) is O(1), this will be very fast.

And now the code:

L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2 = [4, 7, 8]
L3 = [5, 2, 9]

from itertools import chain
tmp = frozenset(chain(L2, L3))
L4 = [x for x in L1 if x not in tmp] # [1, 3, 6]

This should be one of the fastest, simplest and least memory-consuming solution.

AndiDog
It is not fastest; check the tests in my post. Putting an iterator in between the set and the already-iterable lists just slows things down.
Brandon Craig Rhodes
@Brandon Craig Rhodes: Ok, let's say "one of the fastest solutions". Thanks for posting your benchmark results.
AndiDog
Indeed — your solutions is definitely one of the fastest, and certainly one of the class of O(*n*log*m*) solutions that this problem deserves. I just wanted to make sure that programmers realize that iterators are not pixie dust that are somehow faster than looping over a container itself; every item returned by an iterator requires its scope to be re-activated and its code re-commenced, so their benefits do not come for free.
Brandon Craig Rhodes