views:

311

answers:

5

I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.

I'm currently doing something like

#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases 
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;

whatsleft=todo - isdone; # what's the unix magic?

#tack on the .xml prefix with sed or something

#and then call the job server; 
jobserve E "$whatsleft";

and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)

As a bonus question, is there a way to do lookahead search in bash grep?

To clarify/extend the problem:

I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.

So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.

N.B. I don't need to check concurrency yet or lock any files.

So a simple, clear way to solve the above problem (in pseudocode) might be

for i in `/bin/ls *.xml`
do
   replace xml suffix with txt
   if [that file exists]
      add to whatsleft list
   end
done

but I'm looking for something more general.

A: 

i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )

if [ -f "$file" ];then
  newname="...."
fi
...
jobserve E .... > $newname 

if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..

ghostdog74
that's the behavior i want, but I don't want to count on the perl script/executable to prevent overwriting.
johndashen
+1  A: 

The question title suggests that you might be looking for:

 set -o noclobber

The question content indicates a wholly different problem!

It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:

 todo=""
 for file in *.xml
 do [ -f ${file%.xml}.txt ] || todo="$todo $file"
 done
 jobserve E $todo

This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.

If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.

Jonathan Leffler
this will work for me, as the script E won't be accessing any overlapping files between calls. i have a few followup questions since I'm fairly new to bash scripting:(1) can i use a glob with multiple asterisks in the for-in clause? as in \*/special/\*.xml?(2) does the % syntax remove all instances of .xml?
johndashen
(1) Yes; (2) No. The single % removes the last '.xml' only (so x.xml.xml.xml --> x.xml.xml).
Jonathan Leffler
+1  A: 
whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )

Note this actually gets a symmetric difference.

slacker
this would work for me in the example, but i simplified it slightly: i'd like to make this work for different patterns as well, such as from *.xml -> *-reordered.xml, and across directories as well. in this case i used ls with --ignore: can you modify your command to accommodate that?
johndashen
@johndashen:I don't see why it would not work, or maybe I simply don't understand what do you mean :). Could you explain more clearly, preferably with an example?
slacker
if i replace *.txt in your example with *-reordered.xml, i will always get a copy of *-reordered.xml twice ... but uniq takes care of that, so it's not actually a problem. huh. =)
johndashen
A: 

for posterity's sake, this is what i found to work:

TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;
johndashen
It would be more cool if $TMPA and $TMPB were actually named pipes.
slacker
See the answer I gave, which doesn't require temporary files, and only uses a single external command (`comm`) rather than there (`sort`, `uniq` and `sed`).
Charles Duffy
+1  A: 
#!/bin/sh

shopt -s extglob # allow extended glob syntax, for matching the filenames

LC_COLLATE=C     # use a sort order comm is happy with

IFS=$'\n'        # so filenames can have spaces but not newlines
                 # (newlines don't work so well with comm anyhow;
                 # shame it doesn't have an option for null-separated
                 # input lines).

files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
  $(comm -23 --nocheck-order \
    <(printf "%s\n" "${files_todo[@]%.xml}") \
    <(printf "%s\n" "${files_done[@]%.txt}") ))

echo jobserve E $(for f in "${files_remaining[@]%.xml}"; do printf "%s\n" "${f}.txt"; done)

This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.

Note the use of extended globs rather than parsing ls, which is considered very poor practice.

To transform input to output names without using anything other than shell builtins, consider the following:

if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
  out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
  : # ...handle here the fact that you have a noncompliant name...
fi
Charles Duffy
that looks great. i didn't know about either IFS or comm. Can you explain what the shopt and LC_COLLATE lines do?
johndashen
The `shopt` line sets the `extglob` flag, which lets us match the files using extended glob syntax (effectively, what I'm doing to match only the relevant files without a regex). `LC_COLLATE=C` is setting the default sort order (in this case, for the globbed files) to something that `comm` will be happy with.
Charles Duffy
Good point about `ls`. Though I think that replacing it with `find` would be much simpler and more readable here.
slacker
could you extend this to multiple pattern matching within files, say from data/{branch}/special/{pattern}.xml->results/archive/{branch}-{pattern}.dat,if you just change the internal printf statements? you don't have to show the whole example code again for that.
johndashen
@johndashen - sorry, I don't quite understand what you're asking for here. Do you want to pick the branch name out of the files (for use in other names), or select files with only specific branch names, or something else?
Charles Duffy
i want to pick out from each input file the branch, pattern pair and check if the corresponding output file exists
johndashen
@johndashen - I've updated this answer to show you how to do that. However, I seriously believe you're abusing the system by squashing so many separate questions into one "question", resulting in answers which are less targeted and less clear for people who have questions about only one of the things you're asking. Part of the purpose of StackOverflow is to create a knowledge base; a useful knowledge base has general-purpose questions and general-purpose answers, not questions and answers so detailed to one person's use case as to be hard to find and reuse.
Charles Duffy
thanks for helping me with this issue. i realized afterwards that my question was pretty ill-suited for this forum, and i'll try to make my questions more atomic in the future.
johndashen