views:

109

answers:

2

Hello,

I have a file with lines having some fields delimited by "|".

I have to extract the lines that are identical based on some of the fileds (i.e. find lines which contain the same values for fields 1,2,3 12,and 13) Other fields contents have no importance for searching but the whole extracted lines have to be complete.

Can anyone tell me how I can do that in KSH scripting (By exemple a script with some arguments (order dependent) that define the fileds separator and the fields which have to be compared to find duplicates lines in input file )

Thanks in advance and kind regards

Oli

A: 

This prints duplicate lines based on matching fields. It uses an associative array which could grow large depending on the nature of your input file. The output is not sorted so most duplicates are not grouped together (except the first two of a set).

awk -F'|' '{ idx=$1$2$3$12$13; if (array[idx] == 1) {print} else if (array[idx]) {print array[idx]; print; array[idx]=1} else {array[idx]=$0}}' inputfile.txt

You could probably build up your index list in a shell variable in a wrapper script something like this:

#!/bin/ksh
for arg
do
    case arg in    # validate input (could be better)
        +([0-9]) ) # integers only
            idx="$idx'$'$arg"
            ;;
        * )
            echo "Invalid field specifier"
            exit
            ;;
    esac
done
awk -F'|' '{ idx='$idx'; if (array ...

You can sort the output by piping it through a command such as this:

awk ... | sort  --field-separator='|' --key=1,1 --key=2,2 --key=3,3 --key=12,12 --key=13,13
Dennis Williamson
A: 

This prints lines which are duplicated - just one line each:

awk -F'|' '!arr[$1$2$3$12$13]++' inputfile > outputfile
jim mcnamara

related questions