views:

374

answers:

3

Assuming that the given directory tree is of reasonable size: say an open source project like Twisted or Python, what is the fastest way to traverse and iterate over the absolute path of all files/directories inside that directory?

I want to do this from within Python (subprocess is allowed). os.path.walk is slow. So I tried ls -lR and tree -fi. For a project with about 8337 files (including tmp, pyc, test, .svn files):

$ time tree -fi > /dev/null 

real    0m0.170s
user    0m0.044s
sys     0m0.123s

$ time ls -lR > /dev/null 

real    0m0.292s
user    0m0.138s
sys     0m0.152s

$ time find . > /dev/null 

real    0m0.074s
user    0m0.017s
sys     0m0.056s
$

tree appears to be faster than ls -lR (though ls -R is faster than tree, but it does not give full paths). find is the fastest.

Can anyone think of a faster and/or better approach? On Windows, I may simply ship a 32-bit binary tree.exe or ls.exe if necessary.

Update 1: Added find

Update 2: Why do I want to do this? ... I am trying to make a smart replacement for cd, pushd, etc.. and wrapper commands for other commands relying on passing paths (less, more, cat, vim, tail). The program will use file traversal occasionally to do that (eg: typing "cd sr grai pat lxml" would automatically translate to "cd src/pypm/grail/patches/lxml"). I won't be satisfied if this cd replacement took, say, half a second to run. See http://github.com/srid/pf

A: 

One solution you have not mentioned is 'os.walk'. I'm not sure it'd be any faster than os.path.walk, but it's objectively better.

You have not said what you're going to do with the list of directories when you have it, so it's hard to give more specific suggestions.

moshez
I timed it, os.walk is 15% slower on my system probably because it yields all files instead of just directories. But, as I note in my answer 2.3 seconds isn't all that large compared to 0.7s.
msw
what I am going to do with the filtered list of paths? I'd do a regex match. see my reply to Mike above. also http://github.com/srid/pf/blob/master/src/pf/__init__.py
Sridhar Ratnakumar
+1  A: 

It would be hard to get much better than find in performance, but the question is how much faster and why do you need it to be so fast? You claim that os.path.walk is slow, indeed, it is ~3 times slower on my machine over a tree of 16k directories. But then again, we're talking about the difference between 0.68 seconds and 1.9 seconds for Python.

If setting a speed record is your goal, you can't beat hard coded C which is completely 75% system call bound and you can't make the OS go faster. That said, 25% of the Python time is spent in system calls. What is it that you want to do with the traversed paths?

msw
"why do I need it to be so fast?" - I've updated the question.
Sridhar Ratnakumar
+3  A: 

Your approach in pf is going to be hopelessly slow, even if os.path.walk took no time at all. Doing a regex match containing 3 unbounded closures across all extant paths will kill you right there. Here is the code from Kernighan and Pike that I referenced, this is a proper algorithm for the task:

/* spname:  return correctly spelled filename */
/*
 * spname(oldname, newname)  char *oldname, *newname;
 *  returns -1 if no reasonable match to oldname,
 *           0 if exact match,
 *           1 if corrected.
 *  stores corrected name in newname.
 */

#include <sys/types.h>
#include <sys/dir.h>

spname(oldname, newname)
    char *oldname, *newname;
{
    char *p, guess[DIRSIZ+1], best[DIRSIZ+1];
    char *new = newname, *old = oldname;

    for (;;) {
        while (*old == '/') /* skip slashes */
            *new++ = *old++;
        *new = '\0';
        if (*old == '\0')   /* exact or corrected */
            return strcmp(oldname,newname) != 0;
        p = guess;  /* copy next component into guess */
        for ( ; *old != '/' && *old != '\0'; old++)
            if (p < guess+DIRSIZ)
                *p++ = *old;
        *p = '\0';
        if (mindist(newname, guess, best) >= 3)
            return -1;  /* hopeless */
        for (p = best; *new = *p++; ) /* add to end */
            new++;                    /* of newname */
    }
}

mindist(dir, guess, best)   /* search dir for guess */
    char *dir, *guess, *best;
{
    /* set best, return distance 0..3 */
    int d, nd, fd;
    struct {
        ino_t ino;
        char  name[DIRSIZ+1];   /* 1 more than in dir.h */
    } nbuf;

    nbuf.name[DIRSIZ] = '\0';   /* +1 for terminal '\0' */
    if (dir[0] == '\0')     /* current directory */
        dir = ".";
    d = 3;  /* minimum distance */
    if ((fd=open(dir, 0)) == -1)
        return d;
    while (read(fd,(char *) &nbuf,sizeof(struct direct)) > 0)
        if (nbuf.ino) {
            nd = spdist(nbuf.name, guess);
            if (nd <= d && nd != 3) {
                strcpy(best, nbuf.name);
                d = nd;
                if (d == 0)     /* exact match */
                    break;
            }
        }
    close(fd);
    return d;
}

/* spdist:  return distance between two names */
/*
 *  very rough spelling metric:
 *  0 if the strings are identical
 *  1 if two chars are transposed
 *  2 if one char wrong, added or deleted
 *  3 otherwise
 */

#define EQ(s,t) (strcmp(s,t) == 0)

spdist(s, t)
    char *s, *t;
{
    while (*s++ == *t)
        if (*t++ == '\0')
            return 0;       /* exact match */
    if (*--s) {
        if (*t) {
            if (s[1] && t[1] && *s == t[1] 
              && *t == s[1] && EQ(s+2, t+2))
                return 1;   /* transposition */
            if (EQ(s+1, t+1))
                return 2;   /* 1 char mismatch */
        }
        if (EQ(s+1, t))
            return 2;       /* extra character */
    }
    if (*t && EQ(s, t+1))
        return 2;           /* missing character */
    return 3;
}

Note: this code was written way before ANSI C, ISO C, or POSIX anything was even imagined when one read directory files raw. The approach of the code is far more useful than all the pointer slinging.

msw
Thanks a lot, I'll look into this this weekend and come back with questions, if any.
Sridhar Ratnakumar