Hello,
I am looking for advice about how to parse a single list, using two nested loops, in the fastest way, avoiding doing len(list)^2
comparisons, and avoiding duplicate files in groups.
More precisely: I have a list of 'file' objects, that each has a timestamp. I want to group the files by their timestamp and a time offset. Ex. starting from a file X, I want to create a group with all the files that have the timestamp < (timestamp(x) + offset)
.
For this, I did:
for file_a in list:
temp_group = group()
temp_group.add(file_a)
list.remove(file_a)
for file_b in list:
if (file_b.timestamp < (file_a.timestamp + offset)):
temp_group.add(file_b)
list.remove(file_b)
groups.add(temp_group)
(ok, the code is more complicated, but this is the main idea)
This obviously doesn't work, because I am modifying the list during the loops, and strange things happen :)
I thought I have to use copy of 'list' for the loops, but, this doesn't work either:
for file_a in list[:]:
temp_group = group()
temp_group.add(file_a)
list.remove(file_a)
for file_b in list[:]:
if (file_b.timestamp < (file_a.timestamp + offset)):
temp_group.add(file_b)
list.remove(file_b)
groups.add(temp_group)
Well.. I know I can do this without removing elements from the list, but then I need to mark the ones that were already 'processed', and I need to check them each time - which is a speed penalty.
Can anyone give me some advice about how this can be done in the fastest/best way?
Thank you,
Alex
EDIT: I have found an other solution, that doesn't exactly answers the question, but it is what I actually needed (my mistake for asking the question in that way). I am posting this here because it may help someone looking for related issues with loops over lists, in Python.
It may not be the fastest (considering the number of 'passes' through the list), but it was quite simple to understand and implement, and it does not require the list to be sorted.
The reason for which I avoid sorting is that it may take some more time, and because after I make the first set of groups, some of them will be 'locked', and the unlocked groups will be 'dissolved', and regrouped using a different time offset. (And when dissolving groups, the files order may be changed, and they will require sorting again).
Anyway, the solution was to control the loops index myself. If I delete a file from the list, I skip increasing the index (ex: when I delete index "3", the previous index "4" is now "3", and I don't want to increment the loop counter, because I will skip over it). If at that iteration I don't delete any item, then the index is increased normally. Here's the code (with some extras; ignore all that 'bucket' stuff):
def regroup(self, time_offset):
#create list of files to be used for regrouping
regroup_files_list = []
if len(self.groups) == 0:
#on first 'regroup', we start with a copy of jpeg_list, so that we do not change it further on
regroup_files_list = copy.copy(self.jpeg_list)
else:
i = 0
while True:
try:
group = self.groups[i]
except IndexError:
break
if group.is_locked == False:
regroup_files_list.extend(group)
self.groups.remove(group)
continue
else:
i += 1
bucket_group = FilesGroup()
bucket_group.name = c_bucket_group_name
while len(regroup_files_list) > 0: #we create groups until there are no files left
file_a = regroup_files_list[0]
regroup_files_list.remove(file_a)
temp_group = FilesGroup()
temp_group.start_time = file_a._iso_time
temp_group.add(file_a)
#manually manage the list index when iterating for file_b, because we're removing files
i = 0
while True:
try:
file_b = regroup_files_list[i]
except IndexError:
break
timediff = file_a._iso_time - file_b._iso_time
if timediff.days < 0 or timediff.seconds < 0:
timediff = file_b._iso_time - file_a._iso_time
if timediff < time_offset:
temp_group.add(file_b)
regroup_files_list.remove(file_b)
continue # :D we reuse the old position, because all elements were shifted to the left
else:
i += 1 #the index is increased normally
self.groups.append(temp_group)
#move files to the bucket group, if the temp group is too small
if c_bucket_group_enabled == True:
if len(temp_group) < c_bucket_group_min_count:
for file in temp_group:
bucket_group.add(file)
temp_group.remove(file)
else:
self.groups.append(temp_group)
if len(bucket_group) > 0:
self.groups.append(bucket_group)