ansaurus

Question

shell scripting: search/replace & check file exist

Answer 1

A:

i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )

if [ -f "$file" ];then
  newname="...."
fi
...
jobserve E .... > $newname

if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..

ghostdog74 2010-04-16 23:35:20

that's the behavior i want, but I don't want to count on the perl script/executable to prevent overwriting.

johndashen 2010-04-16 23:40:46

Answer 2

+1 A:

The question title suggests that you might be looking for:

 set -o noclobber

The question content indicates a wholly different problem!

It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:

 todo=""
 for file in *.xml
 do [ -f ${file%.xml}.txt ] || todo="$todo $file"
 done
 jobserve E $todo

This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.

If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.

Jonathan Leffler 2010-04-16 23:41:06

this will work for me, as the script E won't be accessing any overlapping files between calls. i have a few followup questions since I'm fairly new to bash scripting:(1) can i use a glob with multiple asterisks in the for-in clause? as in \*/special/\*.xml?(2) does the % syntax remove all instances of .xml?

johndashen 2010-04-16 23:48:32

(1) Yes; (2) No. The single % removes the last '.xml' only (so x.xml.xml.xml --> x.xml.xml).

Jonathan Leffler 2010-04-16 23:52:05

Answer 3

+1 A:

whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )

Note this actually gets a symmetric difference.

slacker 2010-04-16 23:53:00

this would work for me in the example, but i simplified it slightly: i'd like to make this work for different patterns as well, such as from *.xml -> *-reordered.xml, and across directories as well. in this case i used ls with --ignore: can you modify your command to accommodate that?

johndashen 2010-04-17 00:01:06

@johndashen:I don't see why it would not work, or maybe I simply don't understand what do you mean :). Could you explain more clearly, preferably with an example?

slacker 2010-04-17 00:12:03

if i replace *.txt in your example with *-reordered.xml, i will always get a copy of *-reordered.xml twice ... but uniq takes care of that, so it's not actually a problem. huh. =)

johndashen 2010-04-17 00:20:58

Answer 4

A:

for posterity's sake, this is what i found to work:

TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;

johndashen 2010-04-16 23:56:46

It would be more cool if $TMPA and $TMPB were actually named pipes.

slacker 2010-04-17 00:07:32

See the answer I gave, which doesn't require temporary files, and only uses a single external command (`comm`) rather than there (`sort`, `uniq` and `sed`).

Charles Duffy 2010-04-17 00:12:09

Answer 5

+1 A:

#!/bin/sh

shopt -s extglob # allow extended glob syntax, for matching the filenames

LC_COLLATE=C     # use a sort order comm is happy with

IFS=$'\n'        # so filenames can have spaces but not newlines
                 # (newlines don't work so well with comm anyhow;
                 # shame it doesn't have an option for null-separated
                 # input lines).

files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
  $(comm -23 --nocheck-order \
    <(printf "%s\n" "${files_todo[@]%.xml}") \
    <(printf "%s\n" "${files_done[@]%.txt}") ))

echo jobserve E $(for f in "${files_remaining[@]%.xml}"; do printf "%s\n" "${f}.txt"; done)

This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.

Note the use of extended globs rather than parsing ls, which is considered very poor practice.

To transform input to output names without using anything other than shell builtins, consider the following:

if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
  out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
  : # ...handle here the fact that you have a noncompliant name...
fi

Charles Duffy 2010-04-17 00:08:23

that looks great. i didn't know about either IFS or comm. Can you explain what the shopt and LC_COLLATE lines do?

johndashen 2010-04-17 00:28:49

The `shopt` line sets the `extglob` flag, which lets us match the files using extended glob syntax (effectively, what I'm doing to match only the relevant files without a regex). `LC_COLLATE=C` is setting the default sort order (in this case, for the globbed files) to something that `comm` will be happy with.

Charles Duffy 2010-04-17 00:40:15

Good point about `ls`. Though I think that replacing it with `find` would be much simpler and more readable here.

slacker 2010-04-17 00:44:05

could you extend this to multiple pattern matching within files, say from data/{branch}/special/{pattern}.xml->results/archive/{branch}-{pattern}.dat,if you just change the internal printf statements? you don't have to show the whole example code again for that.

johndashen 2010-04-17 00:48:38

@johndashen - sorry, I don't quite understand what you're asking for here. Do you want to pick the branch name out of the files (for use in other names), or select files with only specific branch names, or something else?

Charles Duffy 2010-04-17 01:13:02

i want to pick out from each input file the branch, pattern pair and check if the corresponding output file exists

johndashen 2010-04-18 10:04:25

@johndashen - I've updated this answer to show you how to do that. However, I seriously believe you're abusing the system by squashing so many separate questions into one "question", resulting in answers which are less targeted and less clear for people who have questions about only one of the things you're asking. Part of the purpose of StackOverflow is to create a knowledge base; a useful knowledge base has general-purpose questions and general-purpose answers, not questions and answers so detailed to one person's use case as to be hard to find and reuse.

Charles Duffy 2010-04-18 17:31:18

thanks for helping me with this issue. i realized afterwards that my question was pretty ill-suited for this forum, and i'll try to make my questions more atomic in the future.

johndashen 2010-04-27 03:40:00

ansaurus

tags:

views:

answers:

shell scripting: search/replace & check file exist

related questions