ansaurus

Question

Find duplicates lines based on some delimited fileds on line

Answer 1

A:

This prints duplicate lines based on matching fields. It uses an associative array which could grow large depending on the nature of your input file. The output is not sorted so most duplicates are not grouped together (except the first two of a set).

awk -F'|' '{ idx=$1$2$3$12$13; if (array[idx] == 1) {print} else if (array[idx]) {print array[idx]; print; array[idx]=1} else {array[idx]=$0}}' inputfile.txt

You could probably build up your index list in a shell variable in a wrapper script something like this:

#!/bin/ksh
for arg
do
    case arg in    # validate input (could be better)
        +([0-9]) ) # integers only
            idx="$idx'$'$arg"
            ;;
        * )
            echo "Invalid field specifier"
            exit
            ;;
    esac
done
awk -F'|' '{ idx='$idx'; if (array ...

You can sort the output by piping it through a command such as this:

awk ... | sort  --field-separator='|' --key=1,1 --key=2,2 --key=3,3 --key=12,12 --key=13,13

Dennis Williamson 2010-05-26 14:07:35

Answer 2

A:

This prints lines which are duplicated - just one line each:

awk -F'|' '!arr[$1$2$3$12$13]++' inputfile > outputfile

jim mcnamara 2010-05-29 14:22:35

ansaurus

tags:

views:

answers:

Find duplicates lines based on some delimited fileds on line

related questions