views:

237

answers:

3

Hello,

I am looking for advice about how to parse a single list, using two nested loops, in the fastest way, avoiding doing len(list)^2 comparisons, and avoiding duplicate files in groups.

More precisely: I have a list of 'file' objects, that each has a timestamp. I want to group the files by their timestamp and a time offset. Ex. starting from a file X, I want to create a group with all the files that have the timestamp < (timestamp(x) + offset).

For this, I did:

for file_a in list:
   temp_group = group()
   temp_group.add(file_a)
   list.remove(file_a)
   for file_b in list:
      if (file_b.timestamp < (file_a.timestamp + offset)):
         temp_group.add(file_b)
         list.remove(file_b)

   groups.add(temp_group)

(ok, the code is more complicated, but this is the main idea)

This obviously doesn't work, because I am modifying the list during the loops, and strange things happen :)

I thought I have to use copy of 'list' for the loops, but, this doesn't work either:

for file_a in list[:]:
   temp_group = group()
   temp_group.add(file_a)
   list.remove(file_a)
   for file_b in list[:]:
      if (file_b.timestamp < (file_a.timestamp + offset)):
         temp_group.add(file_b)
         list.remove(file_b)

   groups.add(temp_group)

Well.. I know I can do this without removing elements from the list, but then I need to mark the ones that were already 'processed', and I need to check them each time - which is a speed penalty.

Can anyone give me some advice about how this can be done in the fastest/best way?

Thank you,

Alex

EDIT: I have found an other solution, that doesn't exactly answers the question, but it is what I actually needed (my mistake for asking the question in that way). I am posting this here because it may help someone looking for related issues with loops over lists, in Python.

It may not be the fastest (considering the number of 'passes' through the list), but it was quite simple to understand and implement, and it does not require the list to be sorted.

The reason for which I avoid sorting is that it may take some more time, and because after I make the first set of groups, some of them will be 'locked', and the unlocked groups will be 'dissolved', and regrouped using a different time offset. (And when dissolving groups, the files order may be changed, and they will require sorting again).

Anyway, the solution was to control the loops index myself. If I delete a file from the list, I skip increasing the index (ex: when I delete index "3", the previous index "4" is now "3", and I don't want to increment the loop counter, because I will skip over it). If at that iteration I don't delete any item, then the index is increased normally. Here's the code (with some extras; ignore all that 'bucket' stuff):

def regroup(self, time_offset):
    #create list of files to be used for regrouping
    regroup_files_list = []

    if len(self.groups) == 0:
        #on first 'regroup', we start with a copy of jpeg_list, so that we do not change it further on
        regroup_files_list = copy.copy(self.jpeg_list) 

    else:
        i = 0
        while True:
            try:
                group = self.groups[i]
            except IndexError:
                break

            if group.is_locked == False:
                regroup_files_list.extend(group)                    
                self.groups.remove(group)
                continue
            else:
                i += 1

    bucket_group = FilesGroup()
    bucket_group.name = c_bucket_group_name

    while len(regroup_files_list) > 0: #we create groups until there are no files left
        file_a = regroup_files_list[0]
        regroup_files_list.remove(file_a)

        temp_group = FilesGroup()
        temp_group.start_time = file_a._iso_time
        temp_group.add(file_a)

        #manually manage the list index when iterating for file_b, because we're removing files
        i = 0

        while True:
            try:
                file_b = regroup_files_list[i]
            except IndexError:
                break

            timediff = file_a._iso_time - file_b._iso_time              
            if timediff.days < 0 or timediff.seconds < 0:
                timediff = file_b._iso_time - file_a._iso_time

            if timediff < time_offset:
                temp_group.add(file_b)
                regroup_files_list.remove(file_b)
                continue # :D we reuse the old position, because all elements were shifted to the left

            else:
                i += 1 #the index is increased normally

        self.groups.append(temp_group)

        #move files to the bucket group, if the temp group is too small
        if c_bucket_group_enabled == True:                    
            if len(temp_group) < c_bucket_group_min_count:
                for file in temp_group:
                    bucket_group.add(file)
                    temp_group.remove(file)    
            else:
                self.groups.append(temp_group)      

    if len(bucket_group) > 0:
        self.groups.append(bucket_group)
+3  A: 

A simple solution that works by sorting the list then using a generator to create groups:

def time_offsets(files, offset):

   files = sorted(files, key=lambda x:x.timestamp)

   group = []   
   timestamp = 0

   for f in files:
      if f.timestamp < timestamp + offset:
         group.append(f)
      else:
         yield group
         timestamp = f.timestamp
         group = [timestamp]
   else:
      yield group

# Now you can do this...
for group in time_offsets(files, 86400):
   print group

And here's a complete script you can run to test:

class File:
   def __init__(self, timestamp):
      self.timestamp = timestamp

   def __repr__(self):
      return "File: <%d>" % self.timestamp

def gen_files(num=100):
   import random
   files = []
   for i in range(num):
      timestamp = random.randint(0,1000000)
      files.append(File(timestamp))

   return files


def time_offsets(files, offset):

   files = sorted(files, key=lambda x:x.timestamp)

   group = []   
   timestamp = 0

   for f in files:
      if f.timestamp < timestamp + offset:
         group.append(f)
      else:
         yield group
         timestamp = f.timestamp
         group = [timestamp]
   else:
      yield group

# Now you can do this to group files by day (assuming timestamp in seconds)
files = gen_files()
for group in time_offsets(files, 86400):
   print group
Triptych
Thanks! I've tested this, and it works very close with what I need. I have to have a look at generators (not sure what 'yield' does), and lambda functions syntax is still new to me. Thank you for the complete example :)
Alex
+1  A: 

The best solution I can think of is O(n log n).

listA = getListOfFiles()
listB = stableMergesort(listA, lambda el: el.timestamp)
listC = groupAdjacentElementsByTimestampRange(listB, offset)

Note that groupAdjacentElementsByTimestampRange is O(n).

Justice
yup - pretty much what i did.
Triptych
Yep, that is exactly what you did!
Justice
+1  A: 

I'm not exactly sure what you are trying to do - it seems to me that the order of the list will affect the groupings, but your existing code can be modified to work like this.

#This is O(n^2)
while lst:
    file_a=lst.pop()
    temp_group = group()
    temp_group.add(file_a)
    while lst
        file_b=lst[-1] 
        if (file_b.timestamp < (file_a.timestamp + offset)):
            temp_group.add(lst.pop())
    groups.add(temp_group)

Does the group have to start at file_a.timestamp?

# This is O(n)
from collections import defaultdict
groups=defaultdict(list)  # This is why you shouldn't use `list` as a variable name
for item in lst:
    groups[item.timestamp/offset].append(item)

Is much simpler way to chop up into groups with similar timestamps

gnibbler
Thank you for both code snippets! The second one seems very interesting, but I have python 2.6.2 (linux), and apparently I don't have 'defaultdict' available anymore. I will try to adapt the code and see how it goes, because right now I am not sure how it works (I'm new to Python). Thank you!
Alex
sorry brain f*rt on my part. defaultdict comes from collections. It most certainly is in 2.6
gnibbler