views:

138

answers:

7

I have a series of data points (tuples) in a list with a format like:

points = [(1, 'a'), (2, 'b'), (2, 'a'), (3, 'd'), (4, 'c')]

The first item in each tuple is an integer and they are assured to be sorted. The second value in each tuple is an arbitrary string.

I need them grouped in lists by their first value in a series. So given an interval of 3, the above list would be broken into:

[['a', 'b', 'a', 'd'], ['c']]

I wrote the following function, which works fine on small data sets. However, it is inneficient for large inputs. Any tips on how to rewrite/optimize/mininize this so I can process large data sets?

def split_series(points, interval):
    series = []

    start = points[0][0]
    finish = points[-1][0]

    marker = start
    next = start + interval
    while marker <= finish:
        series.append([point[1] for point in points if marker <= point[0] < next])
        marker = next
        next += interval

    return series
+2  A: 

One way to do it (no promises on speed):

Break your list of tuples into two lists: [1,2,2,3,4] and ['a','b','a','d','c']

Since the first list is sorted, you can just keep iterating over it until you get to an element out of the range. Then, you know the indexes of the start and end elements so you can just slice the strings out of second array. Continue until you've got all the intervals.

I'm not sure how efficient that'll be with tradition Python lists, but if your dataset is large enough, you could try using a NumPy array, which will slice really quickly.

perimosocordiae
A traditional Python list is an array, so the subscripting and slicing should be pretty efficient. IMO good answer.
Steve314
+2  A: 

Your code is O(n2). Here's an O(n) solution:

def split_series(points, interval):
    series = []
    current_group = []
    marker = points[0][0]
    for value, data in points:
        if value >= marker + interval:
            series.append(current_group)
            current_group = []
            marker += interval
        current_group.append(data)

    if current_group:
        series.append(current_group)

    return series

points = [(1, 'a'), (2, 'b'), (2, 'a'), (3, 'd'), (4, 'c')]
print split_series(points, 3)  # Prints [['a', 'b', 'a', 'd'], ['c']]
RichieHindle
Linked list would make this O(log(n)) problem, but i have no idea how to implement this efficently in python.
ralu
this version is much faster and is very readable. but it doesn't store empty groups for empty intervals. see Nicholas Riley's answer and comments
Corey Goldberg
+1  A: 

From your code, I'm assuming my prior comment is correct. The problem here appears to be that the performance is O(n^2) - you repeat the list comprehension (which iterates all items) multiple times.

I say, use a simple for loop. If the current item belongs in the same group as the previous one, add it to the existing inner list [["a"], ["b"]] -> [["a"], ["b", "c"]]. If it doesn't, add it to a new inner list, perhaps adding empty padding lists first.

Steve314
+1  A: 

Expanding on Am's answer, use a defaultdict, and floor-divide the key by the interval to break them up correctly.

from collections import defaultdict
def split_series(points, interval):
    vals = defaultdict(list)
    for key, value in points:
        vals[(key-1)//interval].append(value)
    return vals.values()
ABentSpoon
this version is fast, but it doesn't store empty groups for empty intervals. see Nicholas Riley's answer and comments
Corey Goldberg
+2  A: 

For completeness, here's a solution with itertools.groupby, but the dictionary solution will probably be faster (not to mention a lot easier to read).

import itertools
import operator

def split_series(points, interval):
    start = points[0][0]

    return [[v for k, v in grouper] for group, grouper in
            itertools.groupby((((n - start) // interval, val)
                               for n, val in points), operator.itemgetter(0))]

Note that the above assumes you've got at least one item in each group, otherwise it'll give different results from your script, i.e.:

>>> split_series([(1, 'a'), (2, 'b'), (6, 'a'), (6, 'd'), (11, 'c')], 3)
[['a', 'b'], ['a', 'd'], ['c']]

instead of

[['a', 'b'], ['a', 'd'], [], ['c']]

Here's a fixed-up dictionary solution. At some point the dictionary lookup time will begin to dominate, but maybe it's fast enough for you like this.

from collections import defaultdict

def split_series(points, interval):
    offset = points[0][0]
    maxval = (points[-1][0] - offset) // interval
    vals = defaultdict(list)
    for key, value in points:
        vals[(key - offset) // interval].append(value)
    return [vals[i] for i in xrange(maxval + 1)]
Nicholas Riley
if you factor out the key function, it actually becomes quite readable, i think
hop
i actually need to know when a group is empty, so can't make that assupmtion.
Corey Goldberg
OK, try this one. Since itertools.groupby is implemented in C (at least in 2.6) it'll be hard to beat the performance by implementing it in Python, so I think dictionary lookup will be faster. (When in doubt, benchmark, of course.)
Nicholas Riley
cool. i'll run some benchmarks
Corey Goldberg
your dict version is the only one that correctly printed empty groups. it is also very fast. accepting this answer
Corey Goldberg
A: 

How about using iterators for lazy evaluation?

This should be the equivalent of your initial solution:

from itertools import groupby

def split_series(points, interval):
    """
    >>> points = [(1, 'a'), (2, 'b'), (2, 'a'), (3, 'd'), (4, 'c')]
    >>> print list(split_series(points, 3))
    [['a', 'b', 'a', 'd'], ['c']]
    """

    def interval_key(t):
        return (t[0] - points[0][0]) // interval

    groups = groupby(points, interval_key)

    for group in groups:
        yield [v for _, v in group[1]]
hop
+1  A: 

Here's a lazy approach that uses the step behavior of xrange:

def split_series(points, interval):
    end_of_chunk = interval
    chunk = []
    for marker, item in points:
        if marker > end_of_chunk:
            for end_of_chunk in xrange(end_of_chunk, marker, interval):
                yield chunk
                chunk = []
            end_of_chunk += interval
        chunk.append(item)
    yield chunk
Ants Aasma
Other than that it's returning a generator instead of a list by design (the generator can then be materialized as a list as needed using `list()`), where does the output differ? Or is the problem the assumption that the chunks always start at 1? In that case it's easy to modify to peel of the first marker-item pair from points, calculate the first end_of_chunk off of that and stick the item in the first chunk.
Ants Aasma
oops.. my bad. it does give correct output. I am erasing my previous comment and upvoting this solution.
Corey Goldberg
also, this seems to be slightly faster than the defaultdict version
Corey Goldberg