ansaurus

Question

Answer 1

+2 A:

One way to do it (no promises on speed):

Break your list of tuples into two lists: [1,2,2,3,4] and ['a','b','a','d','c']

Since the first list is sorted, you can just keep iterating over it until you get to an element out of the range. Then, you know the indexes of the start and end elements so you can just slice the strings out of second array. Continue until you've got all the intervals.

I'm not sure how efficient that'll be with tradition Python lists, but if your dataset is large enough, you could try using a NumPy array, which will slice really quickly.

perimosocordiae 2009-10-11 00:02:23

A traditional Python list is an array, so the subscripting and slicing should be pretty efficient. IMO good answer.

Steve314 2009-10-11 00:13:25

Answer 2

+2 A:

Your code is O(n²). Here's an O(n) solution:

def split_series(points, interval):
    series = []
    current_group = []
    marker = points[0][0]
    for value, data in points:
        if value >= marker + interval:
            series.append(current_group)
            current_group = []
            marker += interval
        current_group.append(data)

    if current_group:
        series.append(current_group)

    return series

points = [(1, 'a'), (2, 'b'), (2, 'a'), (3, 'd'), (4, 'c')]
print split_series(points, 3)  # Prints [['a', 'b', 'a', 'd'], ['c']]

RichieHindle 2009-10-11 00:06:01

Linked list would make this O(log(n)) problem, but i have no idea how to implement this efficently in python.

ralu 2009-10-11 01:09:41

this version is much faster and is very readable. but it doesn't store empty groups for empty intervals. see Nicholas Riley's answer and comments

Corey Goldberg 2009-10-11 02:20:43

Answer 3

+1 A:

From your code, I'm assuming my prior comment is correct. The problem here appears to be that the performance is O(n^2) - you repeat the list comprehension (which iterates all items) multiple times.

I say, use a simple for loop. If the current item belongs in the same group as the previous one, add it to the existing inner list [["a"], ["b"]] -> [["a"], ["b", "c"]]. If it doesn't, add it to a new inner list, perhaps adding empty padding lists first.

Steve314 2009-10-11 00:11:51

Answer 4

+1 A:

Expanding on Am's answer, use a defaultdict, and floor-divide the key by the interval to break them up correctly.

from collections import defaultdict
def split_series(points, interval):
    vals = defaultdict(list)
    for key, value in points:
        vals[(key-1)//interval].append(value)
    return vals.values()

ABentSpoon 2009-10-11 00:16:34

this version is fast, but it doesn't store empty groups for empty intervals. see Nicholas Riley's answer and comments

Corey Goldberg 2009-10-11 02:37:57

Answer 5

+2 A:

For completeness, here's a solution with itertools.groupby, but the dictionary solution will probably be faster (not to mention a lot easier to read).

import itertools
import operator

def split_series(points, interval):
    start = points[0][0]

    return [[v for k, v in grouper] for group, grouper in
            itertools.groupby((((n - start) // interval, val)
                               for n, val in points), operator.itemgetter(0))]

Note that the above assumes you've got at least one item in each group, otherwise it'll give different results from your script, i.e.:

>>> split_series([(1, 'a'), (2, 'b'), (6, 'a'), (6, 'd'), (11, 'c')], 3)
[['a', 'b'], ['a', 'd'], ['c']]

instead of

[['a', 'b'], ['a', 'd'], [], ['c']]

Here's a fixed-up dictionary solution. At some point the dictionary lookup time will begin to dominate, but maybe it's fast enough for you like this.

from collections import defaultdict

def split_series(points, interval):
    offset = points[0][0]
    maxval = (points[-1][0] - offset) // interval
    vals = defaultdict(list)
    for key, value in points:
        vals[(key - offset) // interval].append(value)
    return [vals[i] for i in xrange(maxval + 1)]

Nicholas Riley 2009-10-11 00:23:40

if you factor out the key function, it actually becomes quite readable, i think

hop 2009-10-11 00:26:12

i actually need to know when a group is empty, so can't make that assupmtion.

Corey Goldberg 2009-10-11 00:36:36

OK, try this one. Since itertools.groupby is implemented in C (at least in 2.6) it'll be hard to beat the performance by implementing it in Python, so I think dictionary lookup will be faster. (When in doubt, benchmark, of course.)

Nicholas Riley 2009-10-11 00:52:40

cool. i'll run some benchmarks

Corey Goldberg 2009-10-11 00:54:55

your dict version is the only one that correctly printed empty groups. it is also very fast. accepting this answer

Corey Goldberg 2009-10-11 02:40:29

Answer 6

A:

How about using iterators for lazy evaluation?

This should be the equivalent of your initial solution:

from itertools import groupby

def split_series(points, interval):
    """
    >>> points = [(1, 'a'), (2, 'b'), (2, 'a'), (3, 'd'), (4, 'c')]
    >>> print list(split_series(points, 3))
    [['a', 'b', 'a', 'd'], ['c']]
    """

    def interval_key(t):
        return (t[0] - points[0][0]) // interval

    groups = groupby(points, interval_key)

    for group in groups:
        yield [v for _, v in group[1]]

hop 2009-10-11 00:25:13

Answer 7

+1 A:

Here's a lazy approach that uses the step behavior of xrange:

def split_series(points, interval):
    end_of_chunk = interval
    chunk = []
    for marker, item in points:
        if marker > end_of_chunk:
            for end_of_chunk in xrange(end_of_chunk, marker, interval):
                yield chunk
                chunk = []
            end_of_chunk += interval
        chunk.append(item)
    yield chunk

Ants Aasma 2009-10-13 16:57:37

Other than that it's returning a generator instead of a list by design (the generator can then be materialized as a list as needed using `list()`), where does the output differ? Or is the problem the assumption that the chunks always start at 1? In that case it's easy to modify to peel of the first marker-item pair from points, calculate the first end_of_chunk off of that and stick the item in the first chunk.

Ants Aasma 2009-10-13 19:44:06

oops.. my bad. it does give correct output. I am erasing my previous comment and upvoting this solution.

Corey Goldberg 2009-10-13 20:55:39

also, this seems to be slightly faster than the defaultdict version

Corey Goldberg 2009-10-13 21:05:14

ansaurus

tags:

views:

answers:

Grouping data points into series

related questions