tags:

views:

312

answers:

4

Hi Guys,

I'm having some rather unusual problems using grep in a bash script. Below is an example of the bash script code that I'm using that exhibits the behaviour:

UNIQ_SCAN_INIT_POINT=1
cat "$FILE_BASENAME_LIST" | uniq -d >> $UNIQ_LIST
sed '/^$/d' $UNIQ_LIST >> $UNIQ_LIST_FINAL
UNIQ_LINE_COUNT=`wc -l $UNIQ_LIST_FINAL | cut -d \  -f 1`
while [ -n "`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`" ]; do
    CURRENT_LINE=`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`
    CURRENT_DUPECHK_FILE=$FILE_DUPEMATCH-$CURRENT_LINE 
    grep $CURRENT_LINE $FILE_LOCTN_LIST >> $CURRENT_DUPECHK_FILE
    MATCH=`grep -c $CURRENT_LINE $FILE_BASENAME_LIST`
    CMD_ECHO="$CURRENT_LINE matched $MATCH times," cmd_line_echo
    echo "$CURRENT_DUPECHK_FILE" >> $FILE_DUPEMATCH_FILELIST
    let UNIQ_SCAN_INIT_POINT=UNIQ_SCAN_INIT_POINT+1
done

On numerous occasions, when grepping for the current line in the file location list, it has put no output to the current dupechk file even though there have definitely been matches to the current line in the file location list (I ran the command in terminal with no issues).

I've rummaged around the internet to see if anyone else has had similar behaviour, and thus far all I have found is that it is something to do with buffered and unbuffered outputs from other commands operating before the grep command in the Bash script....

However no one seems to have found a solution, so basically I'm asking you guys if you have ever come across this, and any idea/tips/solutions to this problem...

Regards

Paul

A: 

Are there any directories with spaces in their names in $FILE_LOCTN_LIST? Because if they are, those spaces will need escaped somehow. Some combination of find and xargs can usually deal with that for you, especially xargs -0

Andrew McGregor
I am currently using this command to compile the $FILE_LOCTN_LIST:echo $SCAN_DIRNAME | xargs -I {/} find {/} -type f > $FILE_LOCTN_LISTI think xargs -I performs similarly to xargs -0 ?
paultop6
Ok, so it isn't going to be escaping if it's the name of a single file.
Andrew McGregor
+1  A: 

The `problem' is the standard I/O library. When it is writing to a terminal it is unbuffered, but if it is writing to a pipe then it sets up buffering.

try changing

CURRENT_LINE=`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`

to

CURRENT LINE=`sed "$UNIQ_SCAN_INIT_POINT"'q;d' $UNIQ_LIST_FINAL`
ennuikiller
Lifesaver. I understand now that you explained it, but would never have even thought of that otherwise, thanks!
paultop6
A: 

A small bash script using md5sum and sort that detects duplicate files in the current directory:

CURRENT="" md5sum * | 
  sort | 
  while read md5sum filename; 
  do 
    [[ $CURRENT == $md5sum ]] && echo $filename is duplicate; 
    CURRENT=$md5sum; 
  done
ar
A: 

you tagged linux, some i assume you have tools like GNU find,md5sum,uniq, sort etc. here's a simple example to find duplicate files

$ echo "hello world">file
$ md5sum file
6f5902ac237024bdd0c176cb93063dc4  file
$ cp file file1
$ md5sum file1
6f5902ac237024bdd0c176cb93063dc4  file1
$ echo "blah" > file2
$ md5sum file2
0d599f0ec05c3bda8c3b8a68c32a1b47  file2
$ find . -type f -exec md5sum "{}" \; |sort -n | uniq -w32 -D
6f5902ac237024bdd0c176cb93063dc4  ./file
6f5902ac237024bdd0c176cb93063dc4  ./file1
ghostdog74