views:

1251

answers:

4

In my app I need to watch a directory for new files. The amount of traffic is very large and there are going to be a minimum of hundreds of new files per second appearing. Currently I'm using a busy loop with this kind of idea:

while True:
  time.sleep(0.2)
  if len(os.listdir('.')) > 0:
    # do stuff

After running profiling I'm seeing a lot of time spent in the sleep, and I'm wondering if I should change this to use polling instead.

I'm trying to use one of the available classes in select to poll my directory, but I'm not sure if it actually works, or if I'm just doing it wrong.

I get an fd for my directory with:

fd = os.open('.', os.O_DIRECT)

I've then tried several methods to see when the directory changes. As an example, one of the things I tried was:

poll = select.poll()
poll.register(fd, select.POLLIN)

poll.poll()  # returns (fd, 1) meaning 'ready to read'

os.read(fd, 4096) # prints largely gibberish but i can see that i'm pulling the files/folders contained in the directory at least

poll.poll()  # returns (fd, 1) again

os.read(fd, 4096) # empty string - no more data

Why is poll() acting like there is more information to read? I assumed that it would only do that if something had changed in the directory.

Is what I'm trying to do here even possible?

If not, is there any other better alternative to while True: look for changes ?

+3  A: 

Why not use a Python wrapper for one of the libraries for monitoring file changes, like gamin or inotify (search for pyinotify, I'm only allowed to post one hyperlink as a new user...) - that's sure to be faster and the low-level stuff is already done at C level for you, using kernel interfaces...

David Fraser
I'm using BSD so inotify isn't usable and it looks like gamin isn't either.
gdm
The gamin docs says it's usable on FreeBSD but uses a less optimal polling solution - it may still be faster than anything else though
David Fraser
+1  A: 

After running profiling I'm seeing a lot of time spent in the sleep, and I'm wondering if I should change this to use polling instead.

Looks like you already do synchronous polling, by checking the state at regular intervals. Don't worry about the time "spent" in sleep, it won't eat CPU time. It just passes control to the operating system which wakes the process up after a requested timeout.

You could consider asynchronous event loop using a library that listens to filesystem change notifications provided by the operating system, but consider first if it gives you any real benefits in this particular situation.

Adam Byrtek
+4  A: 

FreeBSD and thus Mac OS X provide an analog of inotify called kqueue. Type man 2 kqueue on a FreeBSD machine for more information. For kqueue on Freebsd you have PyKQueue available at http://people.freebsd.org/~dwhite/PyKQueue/, unfortunately is not actively maintained so your mileage may vary.

Kurt
A: 

You might want to have a look at select.kqueue - I've not used it but kqueue is the right interface for this under BSD I believe so you can monitor files / directories and be called back when and only when they change

Nick Craig-Wood