views:

204

answers:

7

In order to use the uniq command, you have to sort your file first.

But in the file I have, the order of the information is important, thus how can I keep the original format of the file but still get rid of duplicate content?

A: 

You could use some horrible O(n^2) thing, like this (Pseudo-code):

file2 = EMPTY_FILE
for each line in file1:
  if not line in file2:
    file2.append(line)

This is potentially rather slow, especially if implemented at the Bash level. But if your files are reasonably short, it will probably work just fine, and would be quick to implement (not line in file2 is then just grep -v, and so on).

Otherwise you could of course code up a dedicated program, using some more advanced data structure in memory to speed it up.

unwind
Thanks unwind. The file I have right now is just a sample file so it's rather short. But the files I will be using it on are going to be large. I'll see what others suggest, and I'll probably try out your suggestion for now.
Dennis
+1  A: 

You can run uniq -d on the sorted version of the file to find the duplicate lines, then run some script that says:

if this_line is in duplicate_lines {
    if not i_have_seen[this_line] {
        output this_line
        i_have_seen[this_line] = true
    }
} else {
    output this_line
}
chaos
The benefit of doing this as opposed to slightly simpler solutions, btw, is that you're not keeping a mapping of every line in the file, only the duplicate lines.
chaos
oh wait. didn't thought about -d . silly litb. well teh comm can be cut out then in favor of it :)
Johannes Schaub - litb
final edition after put in -d instead of using comm: sort file.txt | uniq -d | awk 'FNR==NR { dups[$0]; } FNR!=NR { if($0 in dups) { if(!($0 in lines)) { print $0; lines[$0]; } } else print $0; }' - file.txt
Johannes Schaub - litb
+4  A: 

This awk keeps the first occurrence. Same algorithm as other answers use:

awk '!($0 in lines) { print $0; lines[$0]; }'

Here's one that only needs to store duplicated lines (as opposed to all lines) using awk:

sort file | uniq -d | awk '
   FNR == NR { dups[$0] }
   FNR != NR && (!($0 in dups) || !lines[$0]++)
' - file
Johannes Schaub - litb
A: 
for line in $(sort file1 | uniq ); do
    grep -n -m1 line file >>out
done;

sort -n out

first do the sort,

for each uniqe value grep for the first match (-m1)

and preserve the line numbers

sort the output numerically (-n) by line number.

you could then remove the line #'s with sed or awk

Steve B.
+10  A: 

Another awk version:

awk '!_[$0]++' infile
radoulov
O(n) solution in 8 bytes. +1
ashawley
haha, cute! how does it work? (+1)
Johannes Schaub - litb
ah, now i see :)
Johannes Schaub - litb
Print only when seen for the first time.
radoulov
+4  A: 

There's also the "line-number, double-sort" method.

 nl -n ln | sort -u -k 2| sort -k 1n | cut -f 2-
ashawley
+1 for a solution that works with very large files. But shouldn't that be "sort -k 1n" (numeric sort)?
Aaron Digulla
yes, you're right.
ashawley
+1  A: 

Using only uniq and grep:

Create d.sh:

#!/bin/sh
sort $1 | uniq > $1_uniq
for line in $(cat $1); do
cat $1_uniq | grep -m1 $line >> $1_out
cat $1_uniq | grep -v $line > $1_uniq2
mv $1_uniq2 $1_uniq
done;
rm $1_uniq

Example:

./d.sh infile
Wadih M.