views:

62

answers:

2

I have several files in a specific directory. A specific string in one file can occur in another files. If this string is in other files. Then all the files in which this string occured should be deleted and only 1 file should remain with the string.

Example:

file1
ShortName "Blue Jeans"
price 89.47
cur EURO

file2
ShortName "Blue Jeans"
Price 59.47
CUR USD

file3
ShortName "Blue Jeans"
Price 99.47
CUR GBP

Since the value of ShortName "Blue Jeans" is occuring in file2 & file3. Both this file should be deleted. Similarly files with other ShortName Could any one please help me how can it done via script (ksh, SED, AWK). I am on solaris.

Thank you

A: 

gawk solution only for these 3 files, since no other information provided at time of writing

awk 'FNR==NR && FNR==1{ get=$0; next}
FNR!=NR && FNR==1 && $0 ~ get{ 
 cmd="rm \047"FILENAME"\047"
 print cmd
 # system(cmd) #uncomment to use 
}' 1.txt 2.txt 3.txt
ghostdog74
Hello ghostdog74Thank you for your reply.There are n numbers of file in a directory.Not limited to 3. How to extend your script for n file in a directory.Thank you.
premier_de
not tested, but you can try 1.txt *.txt .It should work if i am not wrong
ghostdog74
Hello, I tried your script, some how I keep getting syntax error near line 1awk 'FNR==NR
premier_de
what OS are you using?
ghostdog74
use nawk on solaris !
ghostdog74
I used nwk on solaris, nawk 'FNR==NR next} FNR!=NR
premier_de
what is "rm >>> " and <<< 047?? . my code don't have ">>>" !!
ghostdog74
If I execute the script it thows error as :: nawk: syntax error at source line 1 context is FNR==NR
premier_de
please post the actual code you used in your first question.
ghostdog74
Hello,Thanks it can be executed now. The script does not iterate over all the files in the directory. It picks up the first file from the directory and look for the occurrence of ShortName "Blue Jeans" in remaining files. If another file has different ShortName " Yellow T Shirt" it does not process this.
premier_de
A: 

This script finds all duplicates and leaves only one example of each. For example, let's say there are three "Blue Jean" files, two "Plaid Shirt" files, one "Sneakers" file and several files with no "ShortName". After running this script, you should have one each: "Blue Jeans", "Plaid Shirt" and "Sneakers" and the other files should be untouched. Price and currency are completely ignored.

Paranoid disclaimer: This is ugly and guaranteed to blow up. Caveat emptor. No refunds.

#!/bin/bash
dir="apparel"
saveIFS="$IFS"
IFS=$'\n'
strings=($(sed -n 's/ShortName "\(.*\)"/\1/p' ${dir}/*|sort|uniq -c))    # dummy comment to fix syntax coloring (ignore me) */
IFS="$saveIFS"
for string in "${strings[@]}"
do
    count=${string:0:7}
    count=${count// }
    string=${string:8}
    if [[ $count > 1 ]]
    then
        first=1
        for f in $(grep -l "$string" ${dir}/*)                           # dummy comment to fix syntax coloring (ignore me) */
        do
            if [[ $first ]]
            then
                unset first
            else
                echo rm "$f"
            fi
        done
     fi
done

Remove the echo after you've tested it to make the rm work.

Dennis Williamson
Hello Dennis,Yes I tested it and it is working. Under which circumstances will it blow up, what are it pitfalls. I tested it for some 30 file. But on production I am expecting more than 200 files. Can this survive ?.. your view.. please.
premier_de
My confidence is somewhat improved after reading your comment clarifying the initial conditions (the comment which references "Yellow T shirt). The pitfalls: I have not tested this extensively, I don't think it will have a problem if filenames have spaces or other such characters, the search string could be more robust (in any of a number of ways - here's one: `^[ \t]*ShortName[ \t]\+"\(.*\)"[ \t]*$`), parsing the output of `uniq -c` using `${var:n:m}` might need different values for n and m on some systems, you could do `mv` into a temporary directory then check it then delete it, etc.
Dennis Williamson
If it works on your 30 file test, there's no reason it shouldn't work for 200 files. As always, backups are your friend.
Dennis Williamson