views:

108

answers:

4

What is the best practice (interface and implementation) for a command line tool that processes selected files in a directory tree?

I give an example that comes to my mind, but I am looking for a 'best practice':

flipcase foo.txt foo2.txt

could process foo.txt and save the result as foo2.txt.

flipcase -rv *.txt

could process all text files in the current directory.
-r or --recursive will include all subdirectories.
-v will print some infos to stdout while processing.

One problem that I see with this example is, that the *.txt argument is sometimes expanded by the shell (Unix and Vista), so I can't apply this pattern when walking sub directories.
I guess the reason is, that on Unix such tools are comined with a call to find, but this seems not to be common on Windows. It also makes it hard to print a summary at the end.

Requirements:

  • MUST run on Unix, Windows XP, Windows 7 and Mac
  • SHOULD follow common conventions on these platforms. (Yes, I know. But I am looking for a reasonable compromise. For example it's Ok to use - instead of / on Windows.)
  • SHOULD not rely on a separate find command, like grep does.
  • MUST work for single files, file patterns and patterns in directory hierarchies.
  • SHOULD be build with standard Python libs, e.g. OptionParser and os.walk.
  • COULD handle multiple patterns, e.g. *.txt,*.html.

Other questions on design decisions:

  • What should this tool return (status code)?
  • Which ctrl-keys should this tool handle, and in what way?
  • Should stdin be supported instead of a single file? Configurable or auto-detect?
  • Should output redirection be supported? Configurable or auto-detect? How deal with debug output in this case?
  • Should the pattern be glob syntax, or a regular expression?
  • Is there a common pattern syntax that supports recursion? Maybe recursive:*.txt In this case the -r option would not be neccesary.
  • What is best practice to create backups of modified files? Option -b, or rather have backups by default and add --no-backup option
  • For single files it should be possible to specify a target file name. How?
  • What status info should be printed, and hot configure this? Should it be verbose by default and we allow -q for quiet? Or always print a little bit and allow -v (or -vv) to boost this or -q to shut up completely?

I don't really expect to get one single right answer, but may be a handful of thoughts and pointers to good sample projects.

+2  A: 

In my experience, the best starting point is to build a tool that follows basic Unix principles -- namely, to read from standard input and write to standard output. This allows people to use your tool in a flexible way:

flipcase input.txt > output.txt
othercommand | flipcase > output.txt
flipcase | othercommand > ouput.txt
flipcase input1.txt  input2.txt > output.txt

The next feature might be in-place editing:

# Modify input files directly.
flipcase -i input.txt

# Create backup copies before modifying originals.
flipcase -i --backup-suffix '_BAK' input.txt
flipcase -i --backup-prefix 'BAK_' input.txt

# Regex for power users.
flipcase -i --backup-regex 's/foo/bar/' input.txt

In verbose mode, the tool should not write to standard output, because that would conflict with the core principles above. It should write to standard error or a user-defined log file.

flipcase -v         input.txt > output.txt
flipcase -v log.txt input.txt > output.txt

After that, you add recursive behavior. The direction is less clear-cut here, but I'll toss out a few ideas. In the typical recursive case, the program's arguments are probably directories, and the user would need to supply additional options to define various types of filtering behavior (that is, which types of files to process).

flipcase -r -i --backup-suffix '_BAK' --filter-glob '*.txt' dir1 dir2
flipcase -r -i --backup-suffix '_BAK' --filter-glob '*.txt' --filter-glob 'log*.dat' dir
flipcase -r -i --backup-suffix '_BAK' --filter-regex 'log\w+\.(txt|log)$' dir1 dir2

# Don't do in-place editing. Instead create new files within the structure.
flipcase -r --newname-suffix '_NEW'              --filter-glob '*.txt' dir1 dir2
flipcase -r --newname-regex 's/\.txt$/_new.txt/' --filter-glob '*.txt' dir1 dir2

# Create the backups or the new files in a parallel directory
# structure rather than within the original structure.
flipcase -r -i --backup-tree 'backup_dir'   --filter-glob '*.txt' dir1 dir2
flipcase -r -i --new-tree    'newfiles_dir' --filter-glob '*.txt' dir1 dir2
FM
Thanks for that comprehensive input!
mar10
Are the options names you are using 'common', i.e. are there well known tools that use them?
mar10
@mar10 Only in some cases. The `-v` and `-r` options are commonly used for verbose and recursive. The `-i` option reflects my Perl background, where it is used for in-place file editing (Perl probably inherited the convention from `sed`). The longer options that I proposed are just rough ideas. You might want to look at the other recursive Unix tools for ideas regarding option naming: `find`, `rsync`, and perhaps other.
FM
+1  A: 

What is the best practice (interface and implementation) for a command line tool that processes selected files in a directory tree?

I don't think there's a single standard or "best practice" when it comes to the implementation of a command line tool. Although, you'll gain lots of insights by looking at and experimenting with well built tools like the GNU coreutils for example.

Also, I think you're looking for something like this as well: http://www.gnu.org/prep/standards/html_node/Command_002dLine-Interfaces.html

Reading and experimenting about the Unix way of doing this actually addresses many of your concerns regarding design decisions.

One problem that I see with this example is, that the *.txt argument is sometimes expanded by the shell (Unix and Vista), so I can't apply this pattern when walking sub directories.

In Unix, the * is automatically expanded. I'm not sure about Windows but if I'm not mistaken, * is not expanded so you can simply use glob.glob(sys.argv[1]). A workaround for Unix would be to escape the wildcard but there must be a better way.

Coding District
Thanks for the pointer, GNU is a good reference. (Btw. Vista seems to expand *, but older version of Windows do not, as far as I know)
mar10
A: 

Recursive processing is usually done using os.path.walk, but you can create your own version to use Python generators which is much more command line friendly: piping will get the output as it's processed. Here is a tested and documented proof of concept.

With Python 3, you don't have to do it, as it provides os.walk that create a generator.

Then after, follow FM advices to create the CLI interface using optparse.

e-satis
+1  A: 

To address the globbing part of your question, the odd man out in your list is really supporting Windows. The UNIX way, and also a good way, to do it is to let the shell handle the globbing. You just get a list of files. I know no UNIX tool what does its own globbing (in basic cases like this). I'd suggest you don't do it yourself either, but rely on the shell.

On Windows, you could refer people to using a shell with Cygwin, or something like that. Of course, Windows users usually eschew the command line, so if you build a GUI they'll be happy too.

That doesn't cover your -r switch. But it gets difficult there. Do you want to provide to users the ability to specify "all files in subdirectories that have the extension .txt"? Note that modern shells like ZSH can do globs that recurse into directories, like:

rm **/*.tmp

and, as you say, you can always use find instead. So a recommendation here really needs to factor in the specifics of your tool. rsync benefits from implementing its own -r switch, but an hypothetical flipcase probably wouldn't.

loevborg
I guess requiring cygwin is too much for most Windows users. I like the 'rm * * / *.tmp' syntax. But this seems hard to implement due to the shell globbing (given that I do not want to rely on a specific shell like ZSH)
mar10