ansaurus

Question

unix tool to remove duplicate lines from a file

Answer 1

+9 A:

SYNOPSIS

uniq [OPTION]... [INPUT [OUTPUT]]

DESCRIPTION

Discard all but one of successive identical lines from INPUT (or standard input), writing to OUTPUT (or standard output).

Or, if you want to remove non-adjacent duplicate lines as well, this fragment of perl will do it:

while(<>) {
    print $_ if (!$seen{$_});
    $seen{$_}=1;
}

Paul 2009-04-14 07:53:44

The Perl answer only works if you want the first item. The last would be a different solution.

Xetius 2009-04-14 08:22:34

And for those who don't know how to use Perl, this is all you need to type: perl -pe 'print unless $seen{$_}++' [INPUT] > OUTPUT

reinierpost 2009-04-14 08:43:08

@Xetuis, they're the same line :) If you do want the last line, just set the seen entry to the line number, don't print in the loop, then and print them out in order of line number at the end. But I don't think that's neded in this case.

Paul 2009-04-14 09:51:04

@reinierpost, yep, I can never recall the command line options to do that so I tend to resort to full scripts...

Paul 2009-04-14 09:51:57

Answer 2

+2 A:

If you are interested in removing adjacent duplicate lines, use uniq.

If you want to remove all duplicate lines, not just adjacent ones, then it's trickier.

Chris Jester-Young 2009-04-14 07:53:52

Answer 3

A:

Here's what I came up with while I was waiting for an answer here (though the first (and accepted) answer came in about 2 minutes). I used this substitution in VIM:

%s/^\(.*\)\n\1$/\1/

Which means: look for lines where after the newline we have the same as before, and replace them only with what we captured in the first line.

uniq is definitely easier, though.

Nathan Fellman 2009-04-14 08:03:12

Answer 4

+7 A:

Complementary to the uniq answers, which work great if you don't mind sorting your file first. If you need to remove non-adjacent lines (or if you want to remove duplicates without rearranging your file), the following Perl one-liner should do it (stolen from here):

cat textfile | perl -ne '$H{$_}++ or print'

Matt J 2009-04-14 08:09:44

I think this is a neat answer. Been programming in Perl for about 6 years now and wouldn't have thought of something so concise

Xetius 2009-04-14 08:21:03

The Perl part is really nifty. This does, however, qualify for the "Useless Use of cat" award :-) (see http://partmaps.org/era/unix/award.html). Just use "<textfile" at the end.

sleske 2009-04-14 08:23:21

I'd never heard of that award! Yeah, I do use cat rather gratuitously sometimes; I have no idea idea why "cat x | " looks any better than "< x" to me.. it just does :) It may have something to do with the fact that I very often redirect stdout as well, and "./prog < x > y" makes my eyes bleed :P

Matt J 2009-04-14 13:28:13

Useless use of cat award! Use perl -ne ...whatever... textfile

Bklyn 2009-04-16 03:16:51

Answer 5

A:

cat textfile | perl -ne '$H{$_}++ or print'

This works good- but how would you use it if you only want to KEEP duplicate lines?

josh 2010-06-02 02:38:49

perl -ne '($H{$_}++==1) and print' < filename

jasonmp85 2010-06-02 03:07:33

ansaurus

tags:

views:

answers:

unix tool to remove duplicate lines from a file

related questions