tags:

views:

148

answers:

4

I'm completely new to Python and while trying various random bits and pieces I've struck upon a problem that I believe I've "solved", but the code doesn't feel right - I strongly suspect there is going to be a better way to get the desired result.

FYI - I'm using whatever the latest version of Python 3 is, on Windows.

Problem definition

Briefly, what I'm doing is sorting a list of pairs, such that the pairs containing the elements that appears in the fewest pairs are sorted to the front.

The pairs are in the form [i,j] with 0 <= i <= j < n, where n is a known maximum value for the elements. There are no duplicate pairs within the list.

The count of an element i is a simple count of the number of pairs (not pair elements) in the forms [i,j],[j,i] and [i,i] where j is any value that results in a valid pair.

In the sorted result, a pair [i,j] should appear before a pair [k,l] if count(i) < count(k) or count(i) == count(k) and count(j) < count(l) (If count(j) == count(l) the two can be in either order - I'm not bothered about the sort being stable, would be a bonus though).

In the sorted result, a pair [i,j] should appear before a pair [k,l] if
min(count(i),count(j)) < min(count(k),count(l)) or
min(count(i),count(j)) == min(count(k),count(l)) and max(count(i),count(j)) < max(count(k),count(l)).
In otherwords, if the pair is [0,1] and 1 has a count of one, but 0 has a count of four hundred, the pair should still be at (or at least very near) the front of the list - they need sorting by the least frequent element in the pair.

Here's a contrived example I've built:

input   [[0,0],[1,2],[1,4],[2,2],[2,3],[3,3],[3,4]]

Here's the individual element counts and the source pairs they come from:

0: 1   [0,0]
1: 2   [1,2],[1,4]
2: 3   [1,2],[2,2],[2,3]
3: 3   [2,3],[3,3],[3,4]
4: 2   [1,4],[3,4]

And here's the result, along with the pair scores:

output: [[0,0],[1,4],[1,2],[3,4],[2,2],[2,3],[3,3]]
scores:   1     1-2   1-3   2-3   3     3     3

Here, 0 has a count of one (it appears in one pair, albeit twice) so comes first. 1 has a count of two, so appears second - with [1,4] before [1,2] because 4 has a count of two and 2 has a count of three, et cetera.

My current solution

As said, I believe this implimentation works accurately, but it just feels that there must be a better way to go about doing this. Anyway, here's what I've got so far:

#my implementation uncommented to reduce post size, see history for comments
def sortPairList( data , n ):
    count = []
    for i in range(0,n):
        count.append( 0 )

    #count up the data
    for p in data:
        count[p[0]] += 1
        if p[1] != p[0]:
            count[p[1]] += 1

    maxcount = 0
    for i in range(0,n):
        if count[i] > maxcount:
            maxcount = count[i]

    def elementFrequency(p):
        if count[ p[0] ] < count[ p[1] ]:
            return count[ p[0] ] + float(count[ p[1] ]) / (maxcount+1)
        else:
            return count[ p[1] ] + float(count[ p[0] ]) / (maxcount+1)

    data.sort( key=elementFrequency )

Any suggestions on a more "Python" way of doing this?
Or anything that's wrong with my current attempt?

New Test Case (see answer's comments)

input:    [[0,0],[0,3],[0,5],[0,7],[1,1],[1,2],[1,8],[2,4],[2,5],[3,4],[3,5],[3,9],[4,4],[4,7],[4,8],[6,8],[7,7],[7,9],[8,9]]
expected: [[6,8],[1,1],[1,2],[2,5],[0,5],[1,8],[3,5],[3,9],[7,9],[8,9],[2,4],[0,0],[0,3],[0,7],[7,7],[3,4],[4,7],[4,8],[4,4]]
+3  A: 

I would probably use a Counter (needs Python ≥2.7 or ≥3.1) for tallying.

from collections import Counter
from itertools import chain
def sortPairList2(data):
    tally = Counter(chain(*map(set, data)))
    data.sort(key=lambda x: sorted(tally[i] for i in x))

Note that:

  1. You can create an anonymous function with lambda. For example,

    >>> c = 4
    >>> a = lambda p: p - c
    >>> a(7)
    3
    
  2. The sort key need not be a number. Anything comparable can be used as the return value of the key function. In my code, a list is used for ordering.

  3. There are many simpler idioms in Python for your original code.

    • The count can be initialized using count = [0] * n instead of that loop.
    • The maxcount can be obtained with the max function. maxcount = max(count)
  4. List comprehension is used a lot in Python. If your target is to transform an iterable into another iterable, prefer comprehension over loops.

KennyTM
That looks promising, much *much* more concise. Unfortunatly, I don't have a Python interpreter available from my current location - but I'll give this a try once I'm home.
DMA57361
@DMA: There is an [online compiler](http://www.ideone.com) for many language. For this code, see http://www.ideone.com/5VFuw.
KennyTM
@KennyTM - I'd tried briefy in codepad and it failed on the import (probably a version mismatch). Your other points also look pretty useful as well, thanks. I'll test with some real input this evening and see what result I get.
DMA57361
@KennyTM - I wasn't getting the expected results from your suggestion on a second test case, it turns out I'd botched the problem definition up. Sorry about that, the question is updated if you're interested. I don't yet really understand your suggestion yet, I've got some reading to do before I can attempt to fix it.
DMA57361
@DMA: See update.
KennyTM
@KennyTM - thank's for the help, that does the job nicely, and the additional points in your answer look like they will be very useful.
DMA57361
+1  A: 
>>> n = 4
>>> freqs = {i: sum(i in j for j in inp) for i in range(n+1)}
>>> def key(x):
    a, b = x
    return min(freqs[a], freqs[b]), max(freqs[a], freqs[b])

>>> sorted(inp, key=key)

P.S. Please note that input is a bad name for a variable as it shadows built-in.

SilentGhost
@SilentGhost `input` was only for the purposes of my question, not a real variable name, but I'll make sure I watch out for that in the future. These look promising, I'll give them a test with some real input this evening.
DMA57361
@SilentGhost - I wasn't getting the expected results from your suggestion on a second test case, it turns out I'd botched the problem definition up. Sorry about that, the question is updated if you're interested. I've fixed your first suggestion by replacing the return statement with `if( aa < bb ): return aa * n + bb else: return bb * n + aa` (I think using the sum method to precomute frequencies might work for me). I don't understand your second one yet, so I'm going to have a read.
DMA57361
@DMA: see my edit
SilentGhost
@SilentGhost - thanks for the help, that works well, but I'm giving the tick to KennyTM today because of the extra information in the post.
DMA57361
@DMA: I hope you're going to use his suggestion then and not mine, thanks.
SilentGhost
A: 

While KennyTM solution works I tried to do it myself.

My solution precomputes frequencies and stores it in dictionary where str(n) is key. I had some trouble with changing compare function known from Python2 to key used with Python3, but I found recipe at ActiveState code

item_cnt = {}

def icount(n):
    return item_cnt[str(n)]

def add_item(n):
    sn = str(n)
    try:
        item_cnt[sn] += 1
    except KeyError:
        item_cnt[sn] = 1

# sort callback
def cmp_items(ij, kl):
    i, j = ij
    k, l = kl
    if icount(i) < icount(k) or icount(i) == icount(k) and icount(j) < icount(l):
        return -1
    return 1

input = [[0,0],[1,2],[1,4],[2,2],[2,3],[3,3],[3,4]]
# count all items
for (i, j) in input:
    add_item(i)
    add_item(j)

# works with Python 2.x
#input.sort(cmp_items)
# works with Python2.6 and Python 3.x
# to convert compare function to key look at:
# http://code.activestate.com/recipes/576653-convert-a-cmp-function-to-a-key-function/
input.sort(key=cmp_to_key(cmp_items))
print(input)
Michał Niklas
A: 

Similar to KennyTM's solution, but for Python 2.5 or greater:

import collections

def sort_by_occurence(sequences):
    tally = collections.defaultdict(int)
    for sequence in sequences:
        for item in sequence:
            tally[item] += 1
    sequences.sort(key=lambda x:map(tally.get, x))


pair_list = [[0,0],[1,2],[1,4],[2,2],[2,3],[3,3],[3,4]]
sort_by_occurence(pair_list)
print pair_list
pillmuncher