views:

447

answers:

3

I have a bunch of files in a single directory that I would like to organize in sub-directories.

This directory structure (which file would go in which directory) is specified in a file list that looks like this:

Directory: Music\

-> 01-some_song1.mp3

-> 02-some_song2.mp3

-> 03-some_song3.mp3

Directory: Images\

-> 01-some_image1.jpg

-> 02-some_image2.jpg

......................

I was thinking of extracting the data (directory name and file name) and store it in a dictionary that would look like this:

dictionary = {'Music': (01-some_song1.mp3, 02-some_song2.mp3,
                         03-some_song3.mp3),
              'Images': (01-some_image1.jpg, 02-some_image2.jpg),
          ......................................................
}

After that I would copy/move the files in their respective directories.

I already extracted the directory names and created the empty dirs.

For the dictionary values I tried to get a list of lists by doing the following:

def get_values(file):
    values = []
    tmp = []
    pattern = re.compile(r'^-> (.+?)$')
    for line in file:
        if line.strip().startswith('->'):
            match = re.search(pattern, line.strip())
            if match:
                tmp.append(match.group(1))
        elif line.strip().startswith('Directory'):
            values.append(tmp)
            del tmp[:]
    return values

This doesn't seem to work. Each list from the values list contains the same 4 file names over and over again.

What am I doing wrong?

I would also like to know what are the other ways of doing this whole thing? I'm sure there's a better/simpler/cleaner way.

+1  A: 

I think that the cause is that you are reusing always the same list.

del tmp[:] clears the list and doesn't create a new instance. In your case, you need to create a new list by calling tmp = []

Following fix should work (I didn't test it)

def get_values(file):
    values = []
    tmp = []
    pattern = re.compile(r'^-> (.+?)$')
    for line in file:
        if line.strip().startswith('->'):
            match = re.search(pattern, line.strip())
            if match:
                tmp.append(match.group(1))
        elif line.strip().startswith('Directory'):
            values.append(tmp)
            tmp = []
    return values
luc
It works. Thanks.
+1  A: 

no need to use regular expression

d = {}
for line in open("file"):
    line=line.strip()
    if line.endswith("\\"):
        directory = line.split(":")[-1].strip().replace("\\","")
        d.setdefault(directory,[])
    if line.startswith("->"):
        song=line.split(" ")[-1]
        d[directory].append(song)
print d

output

# python python.py
{'Images': ['01-some_image1.jpg', '02-some_image2.jpg'], 'Music': ['01-some_song1.mp3', '02-some_song2.mp3', '03-some_song3.mp3']}
ghostdog74
I like your solution. It's simpler. Haven't thought of doing it this way. The only problem is that in my file, the file names contain spaces so I can't split on space. I'll just split on ">" instead and then use strip() for the remaining space. Thanks.
A: 

If you use collections.defaultdict(list), you get a list that dictionary whose elements are lists. If the key is not found, it is added with a value of empty list, so you can start appending to the list immediately. That's what this line does:

d[dir].append(match.group(1))

It creates the directory name as a key if it does not exist and appends the file name found to the list.

BTW, if you are having problems getting your regexes to work, try creating them with the debug flag. I can't remember the symbolic name, but the number is 128. So if you do this:

file_regex = re.compile(r'^-> (.+?)$', 128)

You get this additional output:

at at_beginning
literal 45
literal 62
literal 32
subpattern 1
  min_repeat 1 65535
    any None
at at_end

And you can see that there is a start line match plus '-> ' (for 45 62 32) and then a repeated any pattern and end of line match. Very useful for debugging.

Code:

from __future__ import with_statement

import re
import collections

def get_values(file):
    d = collections.defaultdict(list)
    dir = ""
    dir_regex = re.compile(r'^Directory: (.+?)\\$')
    file_regex = re.compile(r'\-\> (.+?)$')
    with open(file) as f:
        for line in f:
            line = line.strip()
            match = dir_regex.search(line)
            if match:
                dir = match.group(1)
            else:
                match = file_regex.search(line)
                if match:
                    d[dir].append(match.group(1))
    return d

if __name__ == '__main__':
    d = get_values('test_file')
    for k, v in d.items():
        print k, v

Result:

Images ['01-some_image1.jpg', '02-some_image2.jpg']
Music ['01-some_song1.mp3', '02-some_song2.mp3', '03-some_song3.mp3']
hughdbrown
Thanks for the detailed answer. While I find ghostdog's solution simpler, your answer was equally informative. Thank you.