I am wanting to zip up a list of entities with a new entity to generate a list of coordinates (2-tuples), but I want to assure that for (i, j) that i < j is always true.
However, I am not extremely pleased with my current solutions:
from itertools import repeat
mems = range(1, 10, 2)
mem = 8
def ij(i, j):
if i < j:
return (i, j)
else:
return (j, i)
def zipij(m=mem, ms=mems, f=ij):
return map(lambda i: f(i, m), ms)
def zipij2(m=mem, ms=mems):
return map(lambda i: tuple(sorted([i, m])), ms)
def zipij3(m=mem, ms=mems):
return [tuple(sorted([i, m])) for i in ms]
def zipij4(m=mem, ms=mems):
mems = zip(ms, repeat(m))
half1 = [(i, j) for i, j in mems if i < j]
half2 = [(j, i) for i, j in mems[len(half1):]]
return half1 + half2
def zipij5(m=mem, ms=mems):
mems = zip(ms, repeat(m))
return [(i, j) for i, j in mems if i < j] + [(j, i) for i, j in mems if i > j]
Output for above:
>>> print zipij() # or zipij{2-5}
[(1, 8), (3, 8), (5, 8), (7, 8), (8, 9)]
Instead of normally:
>>> print zip(mems, repeat(mem))
[(1, 8), (3, 8), (5, 8), (7, 8), (9, 8)]
Timings: snipped (no longer relevant, see much faster results in answers below)
For len(mems) == 5
, there is no real issue with any solution, but for zipij5() for instance, the second list comprehension is needlessly going back over the first four values when i > j
was already evaluated to be True
for those in the first comprehension.
For my purposes, I'm positive that len(mems)
will never exceed ~10000, if that helps form any answers for what solution is best. To explain my use case a bit (I find it interesting), I will be storing a sparse, upper-triangular, similarity matrix of sorts, and so I need the coordinate (i, j)
to not be duplicated at (j, i)
. I say of sorts because I will be utilizing the new Counter()
object in 2.7 to perform quasi matrix-matrix and matrix-vector addition. I then simply feed counter_obj.update()
a list of 2-tuples and it increments those coordinates how many times they occur. SciPy sparse matrices ran about 50x slower, to my dismay, for my use cases... so I quickly ditched those.
So anyway, I was surprised by my results... The first methods I came up with were zipij4
and zipij5
, and yet they are still the fastest, despite building a normal zip()
and then generating a new zip after changing the values. I'm still rather new to Python, relatively speaking (Alex Martelli, can you hear me?), so here are my naive conclusions:
tuple(sorted([i, j]))
is extremely expensive (Why is that?)map(lambda ...)
seems to always do worse than a list comp (I think I've read this and it makes sense)- Somehow
zipij5()
isn't much slower despite going over the list twice to check for i-j inequality. (Why is this?)
And lastly, I would like to know which is considered most efficient... or if there are any other fast and memory-inexpensive ways that I haven't yet thought of. Thank you.
Current Best Solutions
## Most BRIEF, Quickest with UNSORTED input list:
## truppo's
def zipij9(m=mem, ms=mems):
return [(i, m) if i < m else (m, i) for i in ms]
## Quickest with pre-SORTED input list:
## Michal's
def zipij10(m=mem, ms=mems):
i = binsearch(m, ms) ## See Michal's answer for binsearch()
return zip(ms[:i], repeat(m)) + zip(repeat(m), ms[i:])
Timings
# Michal's
Presorted - 410µs per loop
Unsorted - 2.09ms per loop ## Due solely to the expensive sorted()
# truppo's
Presorted - 880µs per loop
Unsorted - 896µs per loop ## No sorted() needed
Timings were using mems = range(1, 10000, 2)
, which is only ~5000 in length. sorted()
will probably become worse at higher values, and with lists that are more shuffled. random.shuffle()
was used for the "Unsorted" timings.