ansaurus

Question

Answer 1

A:

Have you looked at MongoDB? What about mod_python? mod_python should allow you to do your os.walk() and just store the data in python data structures, since the script is persistent between connections.

robert 2010-08-21 10:53:47

is MongoDB easily bundle-able?, sorry I'm not sure why you're suggesting mod_python? isn't that limited to HTTP?

Daniel Hill 2010-08-21 11:17:11

Answer 2

+2 A:

You don't need to persist a tree structure -- in fact, your code is busily dismantling the natural tree structure of the directory tree into a linear sequence, so why would you want to restart from a tree next time?

Looks like what you need is just an ordered sequence:

i   X    result of os.path.join for X

where X, a string, names either a file or directory (you treat them just the same), i is a progressively incrementing integer number (to preserve the order), and the result column, also a string, is the result of os.path.join(name, *os.path.join(root, &c.

This is perfectly easy to put in a SQL table, of course!

To create the table the first time, just remove the guards if fnmatch.fnmatch (and the string argument) from your populate function, yield the dir or file before the os.path.join result, and use a cursor.executemany to save the enumerate of the call (or, use a self-incrementing column, your pick). To use the table, populate becomes essentially a:

select result from thetable where X LIKE '%foo%' order by i

where string is foo.

Alex Martelli 2010-08-21 15:02:51

Thanks for the answer, how would you say modify the a directory? so all the children are modified to?

Daniel Hill 2010-08-22 03:32:30

@Daniel, sorry, this new question out of the blue seems to have nothing to do with your original one -- and I have no idea what you mean (and there's hardly space to clarify in comments -- no code, etc). Why not close this one question that I answered and open another for your new, different issue? "One question per question" seems a wise and sane policy to me.

Alex Martelli 2010-08-22 05:14:50

@Alex sorry maybe I needed to clarify, Your provided a data structure and a way of adding data to the structure. but I see problems updating the data, when specifically when a Directory is renamed or moved you'll have to go through the entire structure and match(using LIKE) on column 3 for each file in that directory then update it compared with the first example in the link I provided which would only require a single node to be changed. I just wanted your take on this problem?

Daniel Hill 2010-08-22 08:37:12

@Daniel, if prefix `/a/b/c/` is renamed to `/a/b/d/`, `UPDATE thetable SET X='/a/b/d/'||SUBSTR(X,8)' WHERE X LIKE '/a/b/c/%'` is the general SQL solution. If you know there are no homonym dicts or subtrees, `SET X=REPLACE(X,'/a/b/c/','/a/b/d/')` or even just `SET X=REPLACE(X,'/c/','/d/')` might suffice, depending on what homonyms (if any) you may rule out.

Alex Martelli 2010-08-22 14:53:19

Answer 3

+1 A:

I misunderstood the question at first, but I think I have a solution now (and sufficiently different from my other answer to warrant a new one). Basically, you do the normal query the first time you run walk on a directory, but you store the yielded values. The second time around, you just yield those stored values. I've wrapped the os.walk() call because it's short, but you could just as easily wrap your generator as a whole.

cache = {}
def os_walk_cache( dir ):
   if dir in cache:
      for x in cache[ dir ]:
         yield x
   else:
      cache[ dir ]    = []
      for x in os.walk( dir ):
         cache[ dir ].append( x )
         yield x
   raise StopIteration()

I'm not sure of your memory requirements, but you may want to consider periodically cleaning out cache.

robert 2010-08-22 11:05:21

well the use-case would only have 1-10 very large directoriesnot sure if a directory tree would actually take much memory though, that's why I was considering a database engine because it'll take care of the optimizing of memory plus persistence when i need it

Daniel Hill 2010-08-23 14:58:13

ansaurus

tags:

views:

answers:

os.walk() caching/speeding up

related questions