views:

151

answers:

1

Not sure if this is the usual sort of question that gets asked around here, or if I'll get any answers to this one, but I'm looking for a pseudo-code approach to generating DB linking records from a folder structure containing image files.

I have a set of folders, structured as folllows:

+-make_1/
  | +--model_1/
  |    +-default_version/
  |    |   +--1999
  |    |   +--2000
  |    |   |   +--image_01.jpg
  |    |   |   +--image_02.jpg
  |    |   |   +--image_03.jpg
  |    |   |   ...
  |    |   +--2001
  |    |   +--2002
  |    |   +--2003
  |    |   ...
  |    |   +--2009
  |    +--version_1/
  |    |   +--1999
  |    |   ...
  |    |   +--2009
  |    +--version_2/
  |    |   +--1999
  |    |   +--2000
  |    |   +--2001
  |    |   |   +--image_04.jpg
  |    |   |   +--image_05.jpg
  |    |   |   +--image_06.jpg
  |    |   |   ...
  |    |   +--2002
  |    |   +--2003
  |    |   |   +--image_07.jpg
  |    |   |   +--image_08.jpg
  |    |   |   +--image_09.jpg
  |    |   ...
  |    |   +--2009
  ...  ... ...

In essence, it represents possible images for vehicles, by year starting in 1999.

Makes and models (e.g. Make: Alfa Romeo, Model: 145) come in various trims or versions. Each trim, or version may be found in a number of vehicles which will look the same but have say differences in fuel type or engine capacity.

To save duplication, the folder structure above makes use of a default folder... And images appear for the default version from 2000 onwards. I need to produce the links table for each version - based on whether the have their own overriding images, or whether make use of the default version...

So for example, version_1 has no image files, so I need to make links for to the default images, starting in 2000 and continuing until 2009.

Version 2 on the other hand starts out using the default images in 2000, but then uses two new sets first for 2001-2002, and then 2003-2009. The list of links required are therefore...

version    start     end   file_name
=======    =====   =====   =========
version_1   2000    2009   image_01.jpg
version_1   2000    2009   image_02.jpg
version_1   2000    2009   image_03.jpg
...
version_2   2000    2001   image_01.jpg
version_2   2000    2001   image_02.jpg
version_2   2000    2001   image_03.jpg
version_2   2001    2003   image_04.jpg
version_2   2001    2003   image_05.jpg
version_2   2001    2003   image_06.jpg
version_2   2003    2009   image_07.jpg
version_2   2003    2009   image_08.jpg
version_2   2003    2009   image_09.jpg
...

(Default is just that - a place holder, and no links are required for it.)

At the moment I'm running through the folders, building arrays, and then trimming the fat at the end. I was just wondering if there was a short cut, using some sort of text-processing approach? There are about 45,000 folders, most of which are empty :-)

+1  A: 

Here's some Python pseudocode, pretty close to executable (needs suitable imports and a def for a writerow function that will do the actual writing -- be it to an intermediate file, DB, CSV, whatever):

# first, collect all the data in a dict of dicts of lists
# first key is version, second key is year (only for non-empty years)

tree = dict()
for root, dirs, files in os.walk('make_1/model_1'):
    head, tail = os.path.split(root)
    if dirs:
       # here, tail is a version
       tree[tail] = dict
    elif files:
       # here, tail is a year
       tree[os.path.basename(head)][tail] = files

# now specialcase default_version
default_version = tree.pop('default_version')
# determine range of years; rule is quite asymmetrical:
#   for min, only years with files in them count
min_year = min(d for d in default_version if default_version[d])
#   for max, all years count, even if empty
max_year = max(default_version)

for version, years in tree.iteritems():
    current_files = default_version[min_year]
    years.append(max_year + 1)
    y = min_year
    while years:
        next_change = min(years)
        if y < next_change:
            for f in current_files:
                writerow(version, y, next_change-1, f)
        y = next_change
        current_files = years.pop(y)

One ambiguity in the spec and example is whether it's possible for the default_version to change the set of files in some years - here, I'm assuming that doesn't happen (only specific versions change that way, the default version always carries one set of files).

If this is not the case, what happens if the default version changes in years (say) 1999 and 2003, and version1 changes in 2001 and 2005 -- what files should version 1 use for 03 and 04, the new ones in the default version, or those it specified in 01?

In the most complicated version of the spec (where both default_version and a specific one can change, with the most recent change taking precedence, and if both specific and default change in the same year then the specific taking precedence) one needs to get all the "next change year" sequence, for each specific version, by careful "priority merging" of the sequences of change years for default and specific version, instead of just using years (the sequence of changes in the specific version) as I do here -- and each change year placed in the sequence must be associated with the appropriate set of files of course.

So if the exact spec can please be expressed, down to the corner cases, I can show how to do the needed merging by modifying this pseudocode -- I'd rather not do the work until the exact specs are clarified, because, if the specs are indeed simpler, the work would be unneeded!-)

Edit: as a new comment clarified, the exact specs is indeed the most complex one, so we have do do the merging appropriately. So the loop at the end of the simplistic answer above changes to:

for version, years_dict in tree.iteritems():
    # have years_dict override default_version when coincident
    merged = dict(default_version, **years_dict)
    current_files = merged.pop(min_year)
    merged[max_year + 1] = None
    y = min_year
    while merged:
        next_change = min(merged)
        for f in current_files:
            writerow(version, y, next_change-1, f)
        y = next_change
        current_files = merged.pop(y)

The key change is the merged = dict(... line: in Python, this means to make merged a new dict (a dict is a generic mapping, would be typically called a hashmap in other languages) which is the sum, or merge, of default_version and years_dict, but when a key is present in both of those, the value from years_dict takes precedence -- which meets the key condition for a year that's present (i.e., is a year with a change in files) in both.

After that it's plain sailing: anydict.pop(somekey) returns the value corresponding to the key (and also removes it from anydict); min(anydict) returns the minimum key in the dictionary. Note the "sentinel" idiom at merged[max_year + 1] = None: this says that the year "one after the max one" is always deemed to be a change-year (with a dummy placeholder value of None), so that the last set of rows is always written properly (with a maximum year of max_year + 1 - 1, that is, exactly max_year, as desired).

This algorithm is not maximally efficient, just simplest! We're doing min(merged) over and over, making it O(N squared) -- I think we can afford that because each merged should have a few dozen change-years at most, but a purist would wince. We can of course present an O(N logN) solution -- just sort the years once and for all and walk that sequence to get the successive values for next_change. Just for completeness...:

default_version[max_year + 1] = None

for version, years_dict in tree.iteritems():
    merged = dict(default_version, **years_dict)
    for next_change in sorted(merged):
        if next_change > min_year:
            for f in merged[y]:
                writerow(version, y, next_change-1, f)
        y = next_change

Here sorted gives a list with the keys of merged in sorted order, and I've switched to the for statement to walk that list from beginning to end (and an if statements to output nothing the first time through). The sentinel is now put in default_version (so it's outside the loop, for another slight optimization). It's funny to see that this optimized version (essentially because it works at a slightly higher level of abstraction) turns out to be smaller and simpler than the previous ones;-).

Alex Martelli
Good point, it's a poor spec! :-) Actually, default versions can have multiple filled folders.so when "the default version changes in years (say) 1999 and 2003, and version1 changes in 2001 and 2005... "... the default ones take precedence over the older version1 images. However! chances are the version1 folder will ALSO have new images at the same time, and in this case these should then take precendence. Hope that's a little clearer.(BTW, I keep promising myself I will learn python - I have several full of books on my 'to read' shelf - your solution may be just the incentive I need...)
Dycey
OK, let me edit the answer per newly clarified specs. (BTW I hope some of those Python books are mine?-)
Alex Martelli
Oh, that Martelli! Yes, the cookbook is up there ;-) Funnily enough, before I'd read your comment I was just about to thank you for such a concise and clear explanation - and suggest you should write a book!
Dycey
Heh, thanks, yes I should, but SO's been taking most of my free time since I joined two months ago!-)
Alex Martelli