ansaurus

Question

How to list an image sequence in an efficient way? Numercial sequence comparison in Python

Answer 1

+5 A:

Here is a working implementation of what you want to achieve, using the code you added as a starting point:

#!/usr/bin/env python

import itertools
import re

# This algorithm only works if DATA is sorted.
DATA = ["image_0001", "image_0002", "image_0003",
        "image_0010", "image_0011",
        "image_0011-1", "image_0011-2", "image_0011-3",
        "image_0100", "image_9999"]

def extract_number(name):
    # Match the last number in the name and return it as a string,
    # including leading zeroes (that's important for formatting below).
    return re.findall(r"\d+$", name)[0]

def collapse_group(group):
    if len(group) == 1:
        return group[0][1]  # Unique names collapse to themselves.
    first = extract_number(group[0][1])  # Fetch range
    last = extract_number(group[-1][1])  # of this group.
    # Cheap way to compute the string length of the upper bound,
    # discarding leading zeroes.
    length = len(str(int(last)))
    # Now we have the length of the variable part of the names,
    # the rest is only formatting.
    return "%s[%s-%s]" % (group[0][1][:-length],
        first[-length:], last[-length:])

groups = [collapse_group(tuple(group)) \
    for key, group in itertools.groupby(enumerate(DATA),
        lambda(index, name): index - int(extract_number(name)))]

print groups

This prints ['image_000[1-3]', 'image_00[10-11]', 'image_0011-[1-3]', 'image_0100', 'image_9999'], which is what you want.

HISTORY: I initially answered the question backwards, as @Mark Ransom pointed out below. For the sake of history, my original answer was:

You're looking for glob. Try:

import glob
images = glob.glob("image_[0-9]*")

Or, using your example:

images = [glob.glob(pattern) for pattern in ("image_000[1-3]*",
    "image_00[10-11]*", "image_0011-[1-3]*", "image_9999*")]
images = [image for seq in images for image in seq]  # flatten the list

Frédéric Hamidi 2010-10-13 20:39:16

I think this solution is backwards from what the question is asking. Given a flattened list, how would you derive the glob patterns?

Mark Ransom 2010-10-13 21:54:46

@Mark, you're right, I misunderstood the question (and its title really should be "Given a flattened list, how would you derive the glob patterns?"). I think I'll get some sleep before giving it another try :]

Frédéric Hamidi 2010-10-13 22:00:30

@Frédéric @Mark. Thankyou both for your assistance. I'm really enjoying this problem. I'm learning as I go.

2010-10-13 22:06:35

@Frédéric, based on the number of upvotes you've gotten I'd say you're not the only one who misunderstood.

Mark Ransom 2010-10-13 22:16:39

@Frédéric - THANKYOU - thats is exactly what i wanted to achieve. This is a really clear and readable result. I'll experiment with it and get back in touch. :o) Perfect

2010-10-14 22:23:18

You're welcome, your question was quite an interesting challenge to solve :)

Frédéric Hamidi 2010-10-14 22:26:34

Answer 2

+2 A:

def ranges(sorted_list):
    first = None
    for x in sorted_list:
        if first is None:
            first = last = x
        elif x == increment(last):
            last = x
        else:
            yield first, last
            first = last = x
    if first is not None:
        yield first, last

The increment function is left as an exercise for the reader.

Edit: here's an example of how it would be used with integers instead of strings as input.

def increment(x): return x+1

list(ranges([1,2,3,4,6,7,8,10]))
[(1, 4), (6, 8), (10, 10)]

For each contiguous range in the input you get a pair indicating the start and end of the range. If an element isn't part of a range, the start and end values are identical.

Mark Ransom 2010-10-13 21:18:34

@Mark Ransom, thanks I don't really understand. So, assume I've sorted the files into a list: sorted_list = ['image_0001','image_0002','image_0003','image_0010', 'image_0011'] ... can you explain what you have shown me. For each item in the sorted_list (if you increment it, check to see if it exists in the rest of the list)???

2010-10-13 21:28:06

@user, this algorithm tests each element to see if it should be included in the current sequence by testing to see if it's equal to last+1. If it is, then the current sequence is extended; otherwise the sequence is yielded as a tuple and the current sequence is reset to the new element. If we can assure that the input is not empty, this could even be simplified.

Mark Ransom 2010-10-13 21:37:30

@Mark Ransom, Thankyou. Ok, so I understand that it tests each element to see if it is equal to the previous element+1. I don't understand "otherwise the sequence is yielded as a tuple"...

2010-10-13 21:44:33

Answer 3

+3 A:

Okay, so I found your question to be a fascinating puzzle. I've left how to "compress" the numeric ranges up to you (marked as a TODO), as there are different ways to accomplish that depending on how you like it formatted and if you want the minimum number of elements or the minimum string description length.

This solution uses a simple regular expression (digit strings) to classify each string into two groups: static and variable. After the data is classified, I use groupby to collect the static data into longest matching groups to achieve the summary effect. I mix integer index sentinals into the result (in matchGrouper) so I can re-select the varying parts from all elements (in unpack).

import re
import glob
from itertools import groupby
from operator import itemgetter

def classifyGroups(iterable, reObj=re.compile('\d+')):
    """Yields successive match lists, where each item in the list is either
    static text content, or a list of matching values.

     * `iterable` is a list of strings, such as glob('images/*')
     * `reObj` is a compiled regular expression that describes the
            variable section of the iterable you want to match and classify
    """
    def classify(text, pos=0):
        """Use a regular expression object to split the text into match and non-match sections"""
        r = []
        for m in reObj.finditer(text, pos):
            m0 = m.start()
            r.append((False, text[pos:m0]))
            pos = m.end()
            r.append((True, text[m0:pos]))
        r.append((False, text[pos:]))
        return r

    def matchGrouper(each):
        """Returns index of matches or origional text for non-matches"""
        return [(i if t else v) for i,(t,v) in enumerate(each)]

    def unpack(k,matches):
        """If the key is an integer, unpack the value array from matches"""
        if isinstance(k, int):
            k = [m[k][1] for m in matches]
        return k

    # classify each item into matches
    matchLists = (classify(t) for t in iterable)

    # group the matches by their static content
    for key, matches in groupby(matchLists, matchGrouper):
        matches = list(matches)
        # Yield a list of content matches.  Each entry is either text
        # from static content, or a list of matches
        yield [unpack(k, matches) for k in key]

Finally, we add enough logic to perform pretty printing of the output, and run an example.

def makeResultPretty(res):
    """Formats data somewhat like the question"""
    r = []
    for e in res:
        if isinstance(e, list):
            # TODO: collapse and simplify ranges as desired here
            if len(set(e))<=1:
                # it's a list of the same element
                e = e[0]
            else: 
                # prettify the list
                e = '['+' '.join(e)+']'
        r.append(e)
    return ''.join(r)

fnList = sorted(glob.glob('images/*'))
re_digits = re.compile(r'\d+')
for res in classifyGroups(fnList, re_digits):
    print makeResultPretty(res)

My directory of images was created from your example. You can replace fnList with the following list for testing:

fnList = [
 'images/image_0001.jpg',
 'images/image_0002.jpg',
 'images/image_0003.jpg',
 'images/image_0010.jpg',
 'images/image_0011-1.jpg',
 'images/image_0011-2.jpg',
 'images/image_0011-3.jpg',
 'images/image_0011.jpg',
 'images/image_9999.jpg']

And when I run against this directory, my output looks like:

StackOverflow/3926936% python classify.py
images/image_[0001 0002 0003 0010].jpg
images/image_0011-[1 2 3].jpg
images/image_[0011 9999].jpg

Shane Holloway 2010-10-13 23:03:56

@Shane Holloway. Thanks, I am very unsure of what you are doing. Could you add some comments that help me relate to the example image_0002, image_0003 etc... If you could add a test list, I might be able to step through and run your solution bit by bit.

2010-10-13 23:48:59

THANKYOU for your time Shane. I'll continue looking at your itertools solution; I think I can learn a lot from it. The Edit2 in the original post, was a result of googling/studying your well commented solution.

2010-10-14 22:29:14

Answer 4

A:

Stackoverflow doesn't seem to allow me to post comments above :/

Anyway, I have been following this thread, however I find two problems with the examples provided:

@Frédéric your code generates an exception if the DATA list contains names without any numbers in them. My python's not good enough to figure out a way around it.

@Shane your example groups items multiple times. For instance, fnList = ['m_20091008-1118a.dat', 'm_20100407-1248a.dat'] returns m_[20091008 20100407]-[1118 1248]a.dat rather than m_20091008-1118a.dat m_20100407-1248a.dat

Jack 2010-10-18 11:32:54

Yup, I know :) It also only works if `DATA` is sorted, fails if `DATA` is empty, or not a sequence, or if we can't import `itertools` because we're running under python 2.2, or if we run out of memory during `groupby()`, etc. My goal was to deliver a clear implementation to the questioner, he's free to double the code's size with error handling if he so wishes.

Frédéric Hamidi 2010-10-18 15:07:04

It wasn't meant as a criticism; I guess what I'm asking is what in the code I would need to change in order to make it work in this situation

Jack 2010-10-19 07:44:51

@Jack, well, to solve the problem you pointed out, I would have `extract_number()` return some unique value if the name does not end with a number. `-1` would be optimal since it can never be returned with "valid" names. Then I would add a special case in `collapse_group()` that checks if `extract_number(group[0][1])` returns `-1`. If it does, I'd know I'm processing a group containing invalid names, and I would be able to return each one of them, like in the unique name case. I would probably have to change `collapse_group()` to return a sequence to achieve this, however.

Frédéric Hamidi 2010-10-19 10:02:31

thanks for that. I will play around with the code and see

Jack 2010-10-20 09:45:50

ansaurus

tags:

views:

answers:

How to list an image sequence in an efficient way? Numercial sequence comparison in Python

related questions