views:

527

answers:

8

I need to store a big list of integers in Bigtable(db). For efficiency I am storing them as diff between 2 consecutive items.

for eg:

 original_list = [1005, 1004, 1003, 1004, 1006] 

Storing the above list(which actually contains more than 1000k items) as

start = 1005
diff = [-1, -1, 1, 2]

The closest I could manage is,

ltp = [start]
map(lambda x: ltp.append(ltp[-1] + x), tick)

I am looking for an efficient way to convert it back into original list.

A: 

Although I don't get why this should be more efficient, I am pretty sure a for loop will give the best performance:

l = [start]
for i in diff:
    l.append(l[-1] + i)
bayer
sudhakar
List comprehensions are usually faster than for loops
fsanches
Care to elaborate how you would construct this list comprehension? I can't think of one that's elegant and does not create a new list every time.
bayer
mshsayem made a simple one , but it seems he deleted his answer.Here's my take (not as good as his one):[start.append(start[-1]+x) for x in diff]Anyway, map is also faster than for loops (according to Lutz's book again) if you don't use lambdas.
fsanches
Map comprehension's are only faster, if you use them construct a list. Here, you are using them to do some function calls (and construct a list containing Nones as a side effect).
bayer
@fsanches list comprehension creates an additional list (apart from the start list to which values are appended)
sudhakar
+6  A: 

The following works for me:

orig = [start]
for x in diff:
    orig.append(orig[-1] + x)

Using map will create an new array of the same size, filled with None. I also find a simple for loop more readable, and in this case as fast as you can get.

spatz
sudhakar
@sudhakar: In Python, lambda is very inefficient -- Function calls are slow, and calls to anonymous functions even slower.
John Millikin
@fsanches: list comprehensions are faster for some cases, but in this one (building a list using previous values) would be significantly slower because it would require intermediate lists to be created.
John Millikin
In terms of speed, according to Mark Lutz's book Learning Python: list comprehension > map > for loop
fsanches
@sudhakar map is faster than for loops, but lambdas aren't, as John stated.
fsanches
Thanks spatz/John. I will use use ur solution, unless someone comes up with more efficient one :)
sudhakar
+2  A: 

Several of the other respondents have reasonable implementations of the algorithm you asked for, but I'm unclear on exactly what problem it is you're really trying to solve.

Unless the numbers being stored are very large (i.e., overflow an integer and require bignums), your list of diffs won't gain you any efficiency -- an integer is an integer from the Python runtime POV, so your example "diff" list of [-1, -1, 1, 2] will consume just as much memory as the original list [1005, 1004, 1003, 1004, 1006].

rcoder
He noted he's storing the values in a DB. I assumed he's using a small data type, such as byte, in the DB to store a large list of large integers, which *is* more efficient.
spatz
He's using the `array` type, presumably with 8- or 16-bit values.
John Millikin
Yes I am using array type with 8 bit unsigned chartickvalue = ArrayProperty('b', default=None)
sudhakar
The main reason why I am using diff is that, appengine has 1 mb limit. Using the orignal list, I can store upto 65k*13 items in a single record, b4 I reach the 1 mb limit
sudhakar
+6  A: 

For such large data structures numpy will work well. For this example, it's over 200x faster (see below), and a bit easier to code, basically just

add.accumulate(diff)

Comparison between numpy and direct list manipulation:

import numpy as nx
import timeit

N = 10000

diff_nx = nx.zeros(N, dtype=nx.int)
diff_py = list(diff_nx)

start = 1005

def f0():
    orig = [start]
    for x in diff_py: 
        orig.append(orig[-1] + x)

def f1():
    diff_nx[0] = start
    nx.add.accumulate(diff_nx)

t = timeit.Timer("f0()", "from __main__ import f0, f1, diff_nx, diff_py, nx, start")
print t.timeit(number=1000)
t = timeit.Timer("f1()", "from __main__ import f0, f1, diff_nx, diff_py, nx, start")
print t.timeit(number=1000)

gives

13.4044158459     # for list looping
0.0474112033844   # for numpy accumulate

Really, though, it seems better to reuse an established compression algorithm, like can easily be done with PyTables, rather than rolling your own like it seems that you're doing here.

Also, here, I'm suggesting that you read in the data with room for the prepended start term, rather than rebuild the list with the prepended term, of course, so you don't have to do the copy.

tom10
+1 I've been looking for an accumulate function!
chrispy
Unfortunately Appengine is yet to support numpy :(
sudhakar
+3  A: 

Perfect for generators:

def diff2abs( diffs, start ):
    yield start
    for diff in diffs:
        start += diff
        yield start

start = 1005
diffs = [-1, -1, 1, 2]
original_list = list( diff2abs( diffs, start ))
THC4k
+1 for using generators
sudhakar
+2  A: 
class runningtotal:
    def __init__(self, start = 0):
        self.total = start
    def __call__(self, value):
        self.total += value
        return self.total

Now try:

>>> map(runningtotal(start), [0,]+diff)
[1005, 1004, 1003, 1004, 1006]
chrispy
+1 for using different approach
sudhakar
+1  A: 

As mshsayem suggested, use list comprehensions - they are generally faster than for loops or map/lambdas (according do Mark Lutz's book Learning Python).

If you really want to use an more FP-ish solution, the proper function would be "scan", wich [I believe] isn't implemented in Python so you would have to implement it yourself (which is not a hard task).

"scan" is basically a reduce, but instead of reducing the list to a single value, it stores the result of each "iteration" in a new list.

If you implemented it, you could do something like:

scan(lambda x,y: x+y, [start]++diff)
fsanches
You're right, there is no built-in `scan` in Python.
Mark Rushakoff
I really like this scan method. Let me see if i can come up with a good implementation. +1 for different approach
sudhakar
List comprehensions are not always faster - especially if (as in your example) for every iteration a new list is created. In that case, runtime goes up to n^2.
bayer
@usualyuseless: My example isn't a list comprehension and is creating only one list.
fsanches
Correct, My fault. However, the key point is still valid. You create a new list for every step.
bayer
No, it creates only one list if you use mutable lists (and there's no reason you wouldn't use them here).
fsanches
A: 

I don't know about your reasoning for storing the integers as diffs -- rcoder gave a good answer about why this generally is not more efficient than storing the integers themselves -- but if you don't need to have access to the entire list at once, it's more efficient memory-wise for you to use a generator. Since you say this is a "big list," you can save a lot of memory this way, instead of allocating the entire list at once. Here's a generator comprehension to get your list back:

start = 1005
def mod_start(x):
    global start
    start += x
    return start
int_generator = (mod_start(i) for i in diffs)

You can then iterate over int_generator like you would any list, without having the entire list in memory at once. Note, however, that you cannot subscript or slice a generator, but you can use it in many useful situations.

You can clean up the example so that the start variable does not need to be global. It just can't be local to the mod_start function.

Edit: You don't have to use the generator comprehension to get a generator. You can also use a generator function with the yield expression, like THC4k did. That avoids the start variable scope issue and is probably a little cleaner. You can also get a list from a generator at any time by passing it to the list() built-in function.

Jeff