views:

847

answers:

8

I want to remove all lines where all the second column 05408736032 are same

0009300|05408736032|89|01|001|0|0|0|1|NNNNNNYNNNNNNNNN|asdf| 0009367|05408736032|89|01|001|0|0|0|1|NNNNNNYNNNNNNNNN|adff|

these lines are not consecutive. Its fine to remove all the lines . I dont have to keep one of them around.

Sorry my unix fu is really weak from non usage :) .

+2  A: 

Assuming that they're consecutive and you want to remove subsequent ones, the following awk script will do it:

awk -F'|' 'NR==1 {print;x=$2} NR>1 {if ($2 != x) {print;x=$2}}'

It works by printing the first line and storing the second column. Then for subsequent lines, it skips ones where the stored value and second column are the same (if different, it prints the line and updates the stored value).

If they're not consecutive, I'd opt for a Perl solution where you maintain an associative array to detect and remove duplicates - I'd code it up but my 3yo daughter has just woken up , it's midnight and she has a cold - see you all tomorrow, if I survive the night :-)

paxdiablo
oh they are not subsequent
Surya
+6  A: 

If all your input data is formatted as above - i.e. fixed-size fields - and the order of the lines in the output doesn't matter, sort --key=8,19 --unique should do the trick. If the order does matter, but duplicate lines are always consecutive, uniq -s 8 -w 11 will work. If the fields are not fixed-width but duplicate lines are always consecutive, Pax's awk script will work. In the most general case we're probably looking at something slightly too complicated for a one-liner though.

moonshadow
+1  A: 

Unix includes python, so the following few-liners may be just what you need:

f=open('input.txt','rt')
d={}
for s in f.readlines():
  l=s.split('|')
  if l[2] not in d:
    print s
    d[l[2]]=True

This will work without requiring fixed-length, and even if identical values are not neighbours.

redtuna
That won't remove all lines with duplicate values -- it will print the first instance.
glenn jackman
indeed. The question says "it's fine to remove all instances" - so removing all is not a requirement, it's OK to leave one representative of each. At least, that's how I understood it.
redtuna
A: 

this awk will print only those line where second column is not 05408736032

awk '{if($2!=05408736032}{print}' filename
Do you need quotes around the number? Does it get interpreted as an octal number because of the leading zero? Or does it not get interpreted as octal because of the 8 appearing (invalid in octal, of course), but what about if there was no 8 or 9 in the number?
Jonathan Leffler
A: 

If the columns are not fixed with you can still use sort

The -t flag will set the seperator:

sort -t '|' --key=10,10 -g FILENAME

the -g is just for natural numeric ordering

daveb
Use '-k' for maximal (POSIX-compliant) portability (and no '='). Also, why 10,10 for the second column?
Jonathan Leffler
Two reasons. One, When you're using -t sort will use fields not characters (I.e. 10 not a higher number). Two, the end (,10) is specified to stop sort using the rest of the line from that point on.
daveb
A: 

Takes two passes over the input file: 1) find the duplicate values, 2) remove them

awk -F\| '
    {count[$2]++} 
    END {for (x in count) {if (count[x] > 1) {print x}}}
' input.txt >input.txt.dups

awk -F\| '
    NR==FNR {dup[$1]++; next}
    !($2 in dup) {print}
' input.txt.dups input.txt

If you use bash, you can omit the temp file: combine into one line using process substitution: (deep breath)

awk -F\| 'NR==FNR {dup[$1]++; next} !($2 in dup) {print}' <(awk -F\| '{count[$2]++} END {for (x in count) {if (count[x] > 1) {print x}}}' input.txt) input.txt

(phew!)

glenn jackman
A: 
awk -F"|" '!_[$2]++' file
ghostdog74
A: 

Put the lines in a hash, using line as key and value, then iterate over the hash (this should work in almost any programming language, awk, perl, etc.)

Helper Method