ansaurus

Question

How to keep a file's format if you use the uniq command (in shell)?

Answer 1

A:

You could use some horrible O(n^2) thing, like this (Pseudo-code):

file2 = EMPTY_FILE
for each line in file1:
  if not line in file2:
    file2.append(line)

This is potentially rather slow, especially if implemented at the Bash level. But if your files are reasonably short, it will probably work just fine, and would be quick to implement (not line in file2 is then just grep -v, and so on).

Otherwise you could of course code up a dedicated program, using some more advanced data structure in memory to speed it up.

unwind 2009-03-13 15:12:31

Thanks unwind. The file I have right now is just a sample file so it's rather short. But the files I will be using it on are going to be large. I'll see what others suggest, and I'll probably try out your suggestion for now.

Dennis 2009-03-13 15:15:08

Answer 2

+1 A:

You can run uniq -d on the sorted version of the file to find the duplicate lines, then run some script that says:

if this_line is in duplicate_lines {
    if not i_have_seen[this_line] {
        output this_line
        i_have_seen[this_line] = true
    }
} else {
    output this_line
}

chaos 2009-03-13 15:15:37

The benefit of doing this as opposed to slightly simpler solutions, btw, is that you're not keeping a mapping of every line in the file, only the duplicate lines.

chaos 2009-03-13 15:20:38

oh wait. didn't thought about -d . silly litb. well teh comm can be cut out then in favor of it :)

Johannes Schaub - litb 2009-03-13 15:45:48

final edition after put in -d instead of using comm: sort file.txt | uniq -d | awk 'FNR==NR { dups[$0]; } FNR!=NR { if($0 in dups) { if(!($0 in lines)) { print $0; lines[$0]; } } else print $0; }' - file.txt

Johannes Schaub - litb 2009-03-13 15:47:20

Answer 3

+4 A:

This awk keeps the first occurrence. Same algorithm as other answers use:

awk '!($0 in lines) { print $0; lines[$0]; }'

Here's one that only needs to store duplicated lines (as opposed to all lines) using awk:

sort file | uniq -d | awk '
   FNR == NR { dups[$0] }
   FNR != NR && (!($0 in dups) || !lines[$0]++)
' - file

Johannes Schaub - litb 2009-03-13 15:18:23

Answer 4

A:

for line in $(sort file1 | uniq ); do
    grep -n -m1 line file >>out
done;

sort -n out

first do the sort,

for each uniqe value grep for the first match (-m1)

and preserve the line numbers

sort the output numerically (-n) by line number.

you could then remove the line #'s with sed or awk

Steve B. 2009-03-13 15:21:00

Answer 5

+10 A:

Another awk version:

awk '!_[$0]++' infile

radoulov 2009-03-13 15:37:11

O(n) solution in 8 bytes. +1

ashawley 2009-03-13 15:42:51

haha, cute! how does it work? (+1)

Johannes Schaub - litb 2009-03-13 15:49:45

ah, now i see :)

Johannes Schaub - litb 2009-03-13 15:54:56

Print only when seen for the first time.

radoulov 2009-03-13 16:21:35

Answer 6

+4 A:

There's also the "line-number, double-sort" method.

 nl -n ln | sort -u -k 2| sort -k 1n | cut -f 2-

ashawley 2009-03-13 15:41:17

+1 for a solution that works with very large files. But shouldn't that be "sort -k 1n" (numeric sort)?

Aaron Digulla 2009-03-13 16:36:04

yes, you're right.

ashawley 2009-03-13 17:48:24

Answer 7

+1 A:

Using only uniq and grep:

Create d.sh:

#!/bin/sh
sort $1 | uniq > $1_uniq
for line in $(cat $1); do
cat $1_uniq | grep -m1 $line >> $1_out
cat $1_uniq | grep -v $line > $1_uniq2
mv $1_uniq2 $1_uniq
done;
rm $1_uniq

Example:

./d.sh infile

Wadih M. 2009-03-13 16:08:02

ansaurus

tags:

views:

answers:

How to keep a file's format if you use the uniq command (in shell)?

related questions