views:

488

answers:

3

I have files with these kind of duplicate lines, where only the last field is different:

OST,0202000070,01-AUG-09,002735,6,0,0202000068,4520688,-1,0,0,0,0,0,55
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,5
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,55
OST,0202000068,01-AUG-09,003019,6,0,0202000071,4520690,-1,0,0,0,0,0,55

I need to remove the first occurrence of the line and leave the second one.

I've tried:

awk '!x[$0]++ {getline; print $0}' file.csv

but it's not working as intended, as it's also removing non duplicate lines.

A: 

As a general strategy (I'm not much of an AWK pro despite taking classes with Aho) you might try:

  1. Concatenate all the fields except the last.
  2. Use this string as a key to a hash.
  3. Store the entire line as the value to a hash.
  4. When you have processed all lines, loop through the hash printing out the values.

This isn't AWK specific and I can't easily provide any sample code, but this is what I would first try.

Willi Ballenthin
+2  A: 
#!/bin/awk -f
{
    s = substr($0, 0, match($0, /,[^,]+$/))
    if (!seen[s]) {
        print $0
        seen[s] = 1
    }
}
Steven Huwig
This one needs an asterisk after the closing square bracket to match the correct substring. Apart from that, it is identical to `awk '!x[substr($0, 1,16)]++ ' file.csv`. They both suffer in that they print the first of a set of near duplicates, rather than the last.
Ewan Todd
Identical wrt this training data, that is
Ewan Todd
Thanks for the correction, and good catch on the OP's requirements
Steven Huwig
You can make this work "correctly" by sandwiching it between invocations of `tac`, e.g. `tac | script.awk file.txt | tac`. If you're lucky enough to have tac, of course. :)
Steven Huwig
I meant `tac | script.awk | tac file.txt`
Steven Huwig
tac file.csv|script.awk | tac
Ewan Todd
Nice solution! Can be combined with Dennis' solution in the case he identified.
Ewan Todd
See my edited `tac`-free version.
Dennis Williamson
+1  A: 

If your near-duplicates are always adjacent, you can just compare to the previous entry and avoid creating a potentially huge associative array.

#!/bin/awk -f
{
    s = substr($0, 0, match($0, /,[^,]*$/))
    if (s != prev) {
        print prev0
    }
    prev = s
    prev0 = $0
} 
END {
    print $0
}

Edit: Changed the script so it prints the last one in a group of near-duplicates (no tac needed).

Dennis Williamson