




Ant has a nice way to select groups of files, most handily using ** to indicate a directory tree. E.g.

**/CVS/*            # All files immediately under a CVS directory.
mydir/mysubdir/**   # All files recursively under mysubdir

More examples can be seen here:


How would you implement this in python, so that you could do something like:

files = get_files("**/CVS/*")
for file in files:
    print file

+2  A: 

os.walk is your friend. Look at the example is in the Python manual (http://www.python.org/doc/2.6/library/os.html#os.walk) and try to build something from that.

To match "**/CVS/*" against a file name you get, you can do something like this:

def match(pattern, filename):
    if pattern.startswith("**"):
        return fnmatch.fnmatch(file, pattern[1:])
        return fnmatch.fnmatch(file, pattern)

In fnmatch.fnmatch, "*" matches anything (including slashes).

That's fine for enumerating a tree of files. The problem would be determining whether a particular file matches the given search pattern. E.g. Does '/mydir/CVS/Entries' match the given '**/CVS/*' pattern?
Yes, but once you've started walking, matching files is simply a matter of translating the fileset expression into a suitable regular expression and filtering the results.
Aaron Maenpaa

Yup. Your best bet is, as has already been suggested, to work with 'os.walk'. Or, write wrappers around 'glob' and 'fnmatch' modules, perhaps.


os.walk is your best bet for this. I did the example below with .svn because I had that handy, and it worked great:

import re

for (dirpath, dirnames, filenames) in os.walk("."):
    if re.search(r'\.svn$', dirpath):
        for file in filenames:
            print file
Jack M.
+3  A: 

As soon as you come across a **, you're going to have to recurse through the whole directory structure, so I think at that point, the easiest method is to iterate through the directory with os.walk, construct a path, and then check if it matches the pattern. You can probably convert to a regex by something like:

def glob_to_regex(pat, dirsep=os.sep):
    dirsep = re.escape(dirsep)
    print re.escape(pat)
    regex = (re.escape(pat).replace("\\*\\*"+dirsep,".*")
                           .replace("\\*","[^%s]*" % dirsep)
                           .replace("\\?","[^%s]" % dirsep))
    return re.compile(regex+"$")

(Though note that this isn't that fully featured - it doesn't support [a-z] style glob patterns for instance, though this could probably be added). (The first **/ match is to cover cases like '**/CVS' matching ./CVS, as well as having just ** to match at the tail.)

However, obviously you don't want to recurse through everything below the current dir when not processing a ** pattern, so I think you'll need a two-phase approach. I haven't tried implementing the below, and there are probably a few corner cases, but I think it should work:

  1. Split the pattern on your directory seperator. ie pat.split('/') -> ['*','CVS','']

  2. Recurse through the directories, and look at the relevant part of the pattern for this level. ie. n levels deep -> look at pat[n].

  3. If pat[n] == '**' switch to the above strategy:

    • Reconstruct the pattern with dirsep.join(pat[n:])
    • Convert to a regex with glob_to_regex()
    • Recursively os.walk through the current directory, building up the path relative to the level you started at. If the path matches the regex, yield it.
  4. If pat doesn't match "**", and it is the last element in the pattern, then yield all files/dirs matching glob.glob(os.path.join(curpath,pat[n]))

  5. If pat doesn't match "**", and it is NOT the last element in the pattern, then for each directory, check if it matches (with glob) pat[n]. If so, recurse down through it, incrementing depth (so it will look at pat[n+1])
