views:

59

answers:

3

Hi all,

I was wondering if there was a way to use bash/awk to remove duplicate rows based on a known field range. For example:

Easy Going                  USA:22 May 1926
Easy Going Gordon               USA:6 August 1925   
Easy Life                   USA:20 May 1944
Easy Listening                  USA:14 January 2002 
Easy Listening                  USA:10 October 2002 
Easy Listening                  USA:27 January 2004 
Easy Living                     USA:7 July 1937 
Easy Living                     USA:16 July 1937
Easy Living                     USA:4 September 2009

I would like to remove duplicate move titles. The movie title will always be from $1 through $(NF-3). Ideally I would like to stick with the first occurrence (earliest date), but if that's not possible then it doesn't matter.

Thanks,

Tomek

A: 

This could be a quick answer

sort -t':' -k1,1 -u your-file
enzotib
Not really. You are making the country name `USA` part of the movie name.
codaddict
yeah, I said it is a quick answer, and is should not be a big problem: there are movies from different country with the same title? possible but not probable.
enzotib
If this data is from IMDB dump, those are the *release* dates.
codaddict
+2  A: 
#!/bin/bash

awk 'BEGIN{
   m=split("January|February|March|April|May|June|July|August|September|October|November|December",d,"|")
   for(o=1;o<=m;o++){
      months[d[o]]=sprintf("%02d",o)
   }
}
{
   sub(/.*:/,"",$(NF-2))
   t=mktime($(NF)" "months[$(NF-1)]" "$(NF-2)" 0 0 0")
   time[t]=$(NF-2) FS $(NF-1) FS $(NF)
   $(NF-2)=$(NF-1)=$(NF)=""
   gsub(/ +$/,"")
   if (!($0 in array)){array[$0]=99999999999999}
   if ( t <= array[$0] ){ array[$0]=t }
}
END{
  for(i in array){ print "->",i,time[array[i]]  }
} ' file

output

$ ./shell.sh
-> Easy Living 7 July 1937
-> Easy Going Gordon 6 August 1925
-> Easy Listening 14 January 2002
-> Easy Going 22 May 1926
-> Easy Life 20 May 1944
ghostdog74
How would you preserve the country?
Dennis Williamson
use another another with the date found as the key and the country as value.
ghostdog74
+1  A: 
awk '
    {
        line = $0
        $(NF-2) = $(NF-1) = $NF = ""
        if ( ! ($0 in movies)) 
            movies[$0] = line
    }
    END {
        for (m in movies) print movies[m] 
    }
' movies.txt 

That does not preserve the original line ordering. You might want to sort the output.

glenn jackman
Note that this also ignores the date.
Dennis Williamson