tags:

views:

144

answers:

3

I have a list of tuples I am trying to sort and could use some help.

The field I want to sort by in the tuples looks like "XXX_YYY". First, I want to group the XXX values in reverse order, and then, within those groups, I want to place the YYY values in normal sort order. (NOTE: I am just as happy, actually, sorting the second item in the tuple in this way, reverse order first word, normal order second.)

Here is an example of what I have and what I would like in the end ... not sure how to do it.

mylist = [
    (u'community_news', u'Community: News & Information'), 
    (u'kf_video', u'KF: Video'), 
    (u'community_video', u'Community: Video'), 
    (u'kf_news', u'KF: News & Information'), 
    (u'kf_magazine', u'KF: Magazine')
]

I would like to perform some sort of sort() on this list that will change the output to:

sorted = [
    (u'kf_magazine', u'KF: Magazine'),
    (u'kf_news', u'KF: News & Information'), 
    (u'kf_video', u'KF: Video'), 
    (u'community_news', u'Community: News & Information'), 
    (u'community_video', u'Community: Video'), 
]

I suspect there may be a pythonic way to handle this but am not able to wrap my head around it.

+10  A: 
def my_cmp(x, y):
  x1, x2 = x[0].split('_')
  y1, y2 = y[0].split('_')
  return -cmp(x1, y1) or cmp(x2, y2)

my_list = [
    (u'community_news', u'Community: News & Information'), 
    (u'kf_video', u'KF: Video'), 
    (u'community_video', u'Community: Video'), 
    (u'kf_news', u'KF: News & Information'), 
    (u'kf_magazine', u'KF: Magazine')
]

sorted_list = [
    (u'kf_magazine', u'KF: Magazine'),
    (u'kf_news', u'KF: News & Information'), 
    (u'kf_video', u'KF: Video'), 
    (u'community_news', u'Community: News & Information'), 
    (u'community_video', u'Community: Video'), 
]

my_list.sort(cmp=my_cmp)
assert my_list == sorted_list
Ayman Hourieh
I was just about to edit mine to negate the cmp call instead when you posted your answer. :)
Kylotan
I've simplified the comparison further to `-cmp(x1, y1) or cmp(x2, y2)`. :)
Ayman Hourieh
you could also pass the key argument to sort and get rid of the split at the top of your function: my_list.sort(cmp=my_cmp,key=lambda x: x[0].split('_'))
Steven Graham
that is a beautiful thing, thank you!
thornomad
+2  A: 
>>> def my_cmp(tuple_1, tuple_2):
    xxx_1, yyy_1 = tuple_1[0].split('_')
    xxx_2, yyy_2 = tuple_2[0].split('_')
    if xxx_1 > xxx_2:
     return -1
    elif xxx_1 < xxx_2:
     return 1
    else:
     return cmp(yyy_1, yyy_2)


>>> import pprint
>>> pprint.pprint(sorted(mylist, my_cmp))
[(u'kf_magazine', u'KF: Magazine'),
 (u'kf_news', u'KF: News & Information'),
 (u'kf_video', u'KF: Video'),
 (u'community_news', u'Community: News & Information'),
 (u'community_video', u'Community: Video')]

Not the prettiest solution in the world...

Kylotan
+8  A: 

Custom comparison functions for sorting, as suggested in existing answers, do make it easy to sort in a mix of ascending and descending orders -- but they have serious performance issues and have been removed in Python 3, leaving only the preferred customization approach -- custom key-extraction functions... much speedier, though more delicate to use for the relatively rare use case of mixed ascending/descending sorts.

In Python 2.*, which supports either kind of customization (not both in the same call to sort or sorted:-), a custom comparison function can be passed as a cmp= named argument; or, a custom key-extraction function can be passed as a key= named argument. In Python 3.*, only the latter option is available.

It's definitely worth understanding the key-extraction approach, even if you think you've just solved your problem with a custom-comparison approach instead: not just for performance, but for future-proofness (Python 3) and for generality (the key= approach also applies to min, max, itertools.groupby... much more general than the cmp= approach!).

Key-extraction is very simple when all the key subfields are to be sorted the same way (all ascending, or all descending) -- you just extract them; it's still pretty easy if the subfields that go "the other way" are numbers (you just change their sign while extracting); the delicate case is exactly the one you have -- multiple string fields that must be compared in oppposite ways.

A reasonably simple approach to solving your problem is a tiny shim class:

class Reverser(object):
  def __init__(self, s): self.s = s
  def __lt__(self, other): return other.s < self.s
  def __eq__(self, other): return other.s == self.s

Note that you only have to supply __lt__ and __eq__ (the < and == operators) -- sort and friends synthesize all other comparisons, if needed, based on those two.

So, armed with this little auxiliary tool, we can proceed easily...:

def getkey(tup):
    a, b = tup[0].split('_')
    return Reverser(a), b

my_list.sort(key=getkey)

As you see, once you "get" the reverser and key extraction concepts, you pay essentially no price for using key extraction instead of custom comparison: the code I suggest is 4 statements for the reverser class (which you can write once and put into your "goodies bag" module somewhere), three for the key extraction function, and of course one for the sort or sorted call -- a total of eight vs the 4 + 1 == 5 of the custom comparison approach in the most compact form (i.e. the one using either cmp with a sign change, or cmp with swapped arguments). Three statements are not much of a price to pay for key-extraction's advantages!-)

Performance is clearly not a big issue with such a short list, but with an even modestly longer (10 times) one...:

# my_list as in the Q, my_cmp as per top A, getkey as here

def bycmp():
  return sorted(my_list*10, cmp=my_cmp)

def bykey():
  return sorted(my_list*10, key=getkey)

...

$ python -mtimeit -s'import so' 'so.bykey()'
1000 loops, best of 3: 548 usec per loop
$ python -mtimeit -s'import so' 'so.bycmp()'
1000 loops, best of 3: 995 usec per loop

I.e., the key= approach is already showing a performance gain of almost two times (sorting the list twice as fast) when working on a 50-items list -- well worth the modest price of "8 lines rather than 5", particularly with all the other advantages I already mentioned!

Alex Martelli
Wow, I like your solution. I didn't know that the cmp= approach had such a penalty.
Steven Graham
@Steven, tx -- yep, not everybody understands why cmp= was removed in Python 3 (as an "attractive nuisance" which tempts people into suffering a performance penalty!-), that's exactly why I posted this detailed explanation, so thanks for confirming it can help!-)
Alex Martelli
@Alex: I hesitate to edit one of *your* answers but perhaps my_list.key(cmp=my_cmp) should be my_list.sort(key=getkey) ?
Ned Deily
@Ned, absolutely, silly me, editing now ("quandoque bonus dormitat Homerus...!-)
Alex Martelli
This is really nice. +1 for Reverser to be a library class
gnibbler
@Alex, I think you are doing `bycmp()` a favour! By having so many duplicates the performance is much closer to O(n) that it would be in real life
gnibbler
@Alex, how is the key extraction technique able to be faster? Does Python store previously extracted keys for an element (with an accompanying cost in memory)?
Kylotan
@gnibbler, maybe: feel free to benchmark with random data!-). @Kylotan: yes, a single pass (O(N), all cases, in both time and space) extracts all keys, then sorting proper occurs (O(N) in the best case, O(N log N) typical and worst case). `cmp` needs O(N log N) (typical and worst case) extra time. The "cost in memory" is of course totally transient (handled internally in C, free of garbage collection issues, and comparable to and typically less than the input -- and the extra working memory, for `sorted`).
Alex Martelli
@Alex: it sounds like it could still be a lot of memory in some cases, especially compared to a plain sort which theoretically requires no extra memory allocation. It seems like something you might need to be able to bypass - is this possible? I couldn't find anything elucidating at docs.python.org/3.1 , except that the docs still say "The sort() and reverse() methods modify the sequence in place for economy of space" which would seem to be at odds with taking copies of all the keys!
Kylotan
@Kylotan, you're probably misled by thinking of "copies" where no copy needs to happen (_references_ are in play). No, it's not possible to "bypass" key extraction when customizing sorting in Python 3 -- exactly because your fears about "a lot of extra memory" are simply overblown. Working in place rather than making a new list while leaving the old one alone halves the total amount of (reference) memory in play _in a way that's not intrinsically transient and does have to deal with gc_. Why not do some measurements rather than play Chicken Little?-)
Alex Martelli