tags:

views:

632

answers:

4

Ant has a nice way to select groups of files, most handily using ** to indicate a directory tree. E.g.

**/CVS/*            # All files immediately under a CVS directory.
mydir/mysubdir/**   # All files recursively under mysubdir

More examples can be seen here:

http://ant.apache.org/manual/dirtasks.html

How would you implement this in python, so that you could do something like:

files = get_files("**/CVS/*")
for file in files:
    print file

=>
CVS/Repository
mydir/mysubdir/CVS/Entries
mydir/mysubdir/foo/bar/CVS/Entries
+2  A: 

os.walk is your friend. Look at the example is in the Python manual (http://www.python.org/doc/2.6/library/os.html#os.walk) and try to build something from that.

To match "**/CVS/*" against a file name you get, you can do something like this:

def match(pattern, filename):
    if pattern.startswith("**"):
        return fnmatch.fnmatch(file, pattern[1:])
    else:
        return fnmatch.fnmatch(file, pattern)

In fnmatch.fnmatch, "*" matches anything (including slashes).

dkagedal
That's fine for enumerating a tree of files. The problem would be determining whether a particular file matches the given search pattern. E.g. Does '/mydir/CVS/Entries' match the given '**/CVS/*' pattern?
izb
Yes, but once you've started walking, matching files is simply a matter of translating the fileset expression into a suitable regular expression and filtering the results.
Aaron Maenpaa
A: 

Yup. Your best bet is, as has already been suggested, to work with 'os.walk'. Or, write wrappers around 'glob' and 'fnmatch' modules, perhaps.

ayaz
A: 

os.walk is your best bet for this. I did the example below with .svn because I had that handy, and it worked great:

import re

for (dirpath, dirnames, filenames) in os.walk("."):
    if re.search(r'\.svn$', dirpath):
        for file in filenames:
            print file
Jack M.
+3  A: 

As soon as you come across a **, you're going to have to recurse through the whole directory structure, so I think at that point, the easiest method is to iterate through the directory with os.walk, construct a path, and then check if it matches the pattern. You can probably convert to a regex by something like:

def glob_to_regex(pat, dirsep=os.sep):
    dirsep = re.escape(dirsep)
    print re.escape(pat)
    regex = (re.escape(pat).replace("\\*\\*"+dirsep,".*")
                           .replace("\\*\\*",".*")
                           .replace("\\*","[^%s]*" % dirsep)
                           .replace("\\?","[^%s]" % dirsep))
    return re.compile(regex+"$")

(Though note that this isn't that fully featured - it doesn't support [a-z] style glob patterns for instance, though this could probably be added). (The first **/ match is to cover cases like '**/CVS' matching ./CVS, as well as having just ** to match at the tail.)

However, obviously you don't want to recurse through everything below the current dir when not processing a ** pattern, so I think you'll need a two-phase approach. I haven't tried implementing the below, and there are probably a few corner cases, but I think it should work:

  1. Split the pattern on your directory seperator. ie pat.split('/') -> ['*','CVS','']

  2. Recurse through the directories, and look at the relevant part of the pattern for this level. ie. n levels deep -> look at pat[n].

  3. If pat[n] == '**' switch to the above strategy:

    • Reconstruct the pattern with dirsep.join(pat[n:])
    • Convert to a regex with glob_to_regex()
    • Recursively os.walk through the current directory, building up the path relative to the level you started at. If the path matches the regex, yield it.
  4. If pat doesn't match "**", and it is the last element in the pattern, then yield all files/dirs matching glob.glob(os.path.join(curpath,pat[n]))

  5. If pat doesn't match "**", and it is NOT the last element in the pattern, then for each directory, check if it matches (with glob) pat[n]. If so, recurse down through it, incrementing depth (so it will look at pat[n+1])

Brian