views:

403

answers:

2

I would like to develop a command line program that worked like so:

myprogram /c [some_executable_here]

Which launched the command specified by the user and "watched" the process (and any sub-processes) for read I/O and when that program exits, print a listing of files that were "read" (ultimately resulted in a read() system call).

My initial OS for implementation is Windows, but I'd like to do the same kind of thing on Linux as well.

All the FileSystem watch-like APIs I've seen so far are geared towards watching directories (or individual files) though, and not processes, so I'm not sure what the best way to go about this is.

EDIT: I'm looking for code examples of how to ultimately implement this (or at least pointers to APIs that I could follow) to do this on Windows and Linux.

Also to be clear, it can't use a method like OpendFilesView, procmon or grepping strings from some system-level tool that can't definitively identify the process by ID (and any sub-processes) from the beginning and end of its execution; IOW there can't be any timing issues involved and possibility of a false positive by searching for "foo.exe" and getting the wrong one.

+5  A: 

Try "Process Monitor" (procmon.exe) It allows to specify a filter (the name of the process to watch). It'll then list all the files and operations on said files.

On Linux, try lsof for a current snapshot and strace for a continuous monitoring. You'll have to filter the output with grep.

All these tools check the process structure (i.e. the data structure which the OS uses to manage a process) and enumerate the handles/file descriptors mentioned there. This is not a function of the filesystem API but the process management API.

[EDIT] See the section "How does it work" on this page to get started to write your own tool on Windows.

Aaron Digulla
I know about procmon but it only works for a specified time period, also, this is something I want to figure out how to implement with my own code.
Garen
OpenedFilesView v1.45 comes with an explanation how to do it on Windows (see my edits)
Aaron Digulla
The web page mentions that it uses the "NtQuerySystemInformation API" and has a kernel driver (NirSoftOpenedFilesDriver.sys) but has no other info I can see on how to do it programmatically. That's a good clue for Windows, but not enough to get me started as a developer that's never done Windows kernel drivers.
Garen
Ask the author and reuse the existing driver?
Aaron Digulla
Sent. Hopefully he replies. :)
Garen
+5  A: 

On Linux, I'd definitely use strace -- it's simple and powerful. E.g.:

$ strace -o/tmp/blah -f -eopen,read bash -c "cat ciao.txt"

runs the requested command (including the subprocesses it spawns, due to -f) and also leaves in /tmp/blah (120 lines in my case for this example) detailing all the open and read calls made by these processes, and their results.

You do need a little processing afterwards to extract just the set of files that were successfully read, as you require; for example, with Python, you could do:

import re

linere = re.compile(r'^(\d+)\s+(\w+)\(([^)]+)\)\s+\=\s*(.*)$')

def main():
  openfiles = dict()
  filesread = set()
  with open('/tmp/blah') as f:
    for line in f:
      mo = linere.match(line)
      if mo is None:
        print "Unmatched line %r" % line
      pid, command, args, results = mo.groups()
      if command == 'open':
        fn = args.split(',', 1)[0].strip('"')
        fd = results.split(' ', 1)[0]
        openfiles[fd] = fn
      elif command == 'read':
        if results != '0':
          fd = args.split(',', 1)[0]
          filesread.add(openfiles[fd])
      else:
        print "Unknown command %r" % command
  print sorted(filesread)

This is a bit oversimplified (you need to watch some other syscalls such as dup &c) but, I hope, shows the gist of the work needed. In my example, this emits:

['/lib/libc.so.6', '/lib/libdl.so.2', '/lib/libncurses.so.5',
 '/proc/meminfo', '/proc/sys/kernel/ngroups_max',
 '/usr/share/locale/locale.alias', 'ciao.txt']

so it also counts as "reads" those that are done to get dynamic libraries &c, not just "data files"... at syscall level, there's little difference. I imagine you could filter non-data files out, if that's what you need.

I find strace so handy for such purposes that, were I tasked to do the same job on Windows, my first try would be to go for StraceNT -- not 100% compatible, and of course the underlying syscall names &c differ, but I think I could account for these differences in my Python code (preparing and executing the strace command, and post-processing the results).

Unfortunately, some other Unix systems, to my knowledge, only offer this kind of facilities if you're root (super-user) -- e.g. on Mac OS X you need to go via sudo in order to execute such tracing utilities as dtrace and dtruss; I don't know of a straightforward port of strace to the Mac, nor other ways to perform such tasks without root privileges.

Alex Martelli