ansaurus

Question

List only duplicate lines based on one column from a semi-colon delimited file?

Answer 1

+1 A:

grep may do it, but I'm guessing you'll have a much easier time with awk (aka gawk, on some systems).

The effective chain / script to be used for your need depends on a few extra bits of info. For example, is the input file readily sorted, how big is the input (or rather is it huge or a stream)...

Assuming sorted input (either originally or from piping through sort), the awk script would look something like that: (attention untested)

Check the solution provided by Jonathan Leffler or Hai Vu, for a way to achieve the same without the pre-sort requirement.

#!/usr/bin/awk
# *** Simple AWK script to output duplicate lines found in input ***
#    Assume input is sorted on fields

BEGIN {
    FS = ";";   #delimiter
    dupCtr = 0;       # number of duplicate _instances_
    dupLinesCtr = 0;  # total number of duplicate lines

    firstInSeries = 1;   #used to detect if this is first in series

    prevLine = "";
    prevCol2 = "";  # use another string in case empty field is valid
}

{
  if ($2 == prevCol2) {
    if (firstInSeries == 1) {
      firstInSeries = 0;
      dupCtr++;
      dupLinesCtr++;
      print prevLine
    }
    dupLinesCtr++;
    print $0
  }
  else
     firstInSeries = 1
  prevCol2 = $2
  prevLine = $0
}

END { #optional display of counts etc.
  print "*********"
  print "Total duplicate instances = " iHits "   Total lines = " NR;
}

mjv 2009-09-20 02:08:28

I don't bieleve I have that tool. When I type "awk --help" I get no such option message

goe 2009-09-20 02:12:05

Hum... maybe you have gawk, which is in fact "better" (IMHO)

mjv 2009-09-20 02:13:46

gawk doesn't exist either

goe 2009-09-20 02:19:40

I can't use that since I don't have awk, any command-line based solution using grep and or uniq?

goe 2009-09-20 02:34:17

I'm not current on linux distros, nowadays, I figured this came with most (and indeed it may be installed but not in the path). If you want/can, you can get gawk (GNU awk), from http://www.gnu.org/software/gawk/gawk.html It is a tiny, stand alone, command line, add it to your bin and you'll be in business. Again... there _may_ be ways to achieve this with grep, but I'm guessing it would be somewhat contrived.

mjv 2009-09-20 02:38:38

It sounds like you have a non-GNU version of awk - they do not support `--help`. You do have some version of awk - you would get a command not found message rather than an option not recognized message if it was missing altogether. And you can always install awk if you need it.

Jonathan Leffler 2009-09-20 02:41:47

I'm not fluent with uniq, I think it would be difficult for two reasons #1) can only skip the first x fields, but cannot isolate one field (or skip past field y) #2) the delimiter is implied to whitespace, not sure it can be changed. There may be several flavors of uniq, but most idioms that do field-based duplication removal I've seen involve both sort and gawk, as a possible preliminary sequence to uniq.

mjv 2009-09-20 02:49:43

Using 'uniq' would be hard; using 'sort -u' won't work readily, either. I'm not convinced I like this solution since it requires the data to be sorted beforehand, whereas my solution and Hai Vu's solution both work regardless of whether the data is sorted or not. If the lines must be preserved in order, then you have to do some extra work - the awk 'for (i in arrayname)' construct returns the rows in an indeterminate order.

Jonathan Leffler 2009-09-20 03:01:29

@Jonathan and @Hai Vu : +1 for a sort-less solution, with minimal added complexity. The only thing to be aware with this approach is that it could push the limit of in-awk memory if the input was huge (whereby I think sort, would handle this better). So until the OP said differently I went with the sorted assumption. Indeed he may not use awk... some kind of a *nix system he's got ;-)

mjv 2009-09-20 03:53:04

Answer 2

+1 A:

Have a convoluted awk script.

awk 'BEGIN { FS=";" } { c[$2]++; l[$2,c[$2]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }' file.txt

It works by keeping a counter of all occurrences of each value in the second field, and the lines which have that value, then it prints out the lines that have counters greater than 1.

Replace all instances of $2 with whichever field number you need, and the file.txt at the end with your filename.

jtbandes 2009-09-20 02:21:21

Convoluted and works! You can't beat that :-)

Hai Vu 2009-09-20 03:03:48

You can, actually: convoluted and works *and* is documented!

jtbandes 2009-09-20 04:26:49

Answer 3

+2 A:

As @mjv surmised - awk (or Perl, or Python) is a better choice:

awk -F';' ' {
    if (assoc[$2]) {          # This field 2 has been seen before
        if (assoc[$2] != 1) { # The first occurrence has not been printed
            print assoc[$2];  # Print first line with given $2
            assoc[$2] = 1;    # Reset array entry so we know we've printed it;
                              # a full line has 8 fields with semi-colons and
                              # cannot be confused with 1.
        }
        print $0;             # Print this duplicate entry
    }
    else {
        assoc[$2] = $0;       # Record line in associative array, indexed by
                              # second field.  
    }
}' <<!
a;b;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;2;c;d;e;f;g;h
a;z;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;4;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;x;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;4;c;d;e;f;g;h
!

This works, but it can slightly reorder the data - because it prints the first occurrence of a duplicated line when the second occurrence appears. The sample output is:

a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;4;c;d;e;f;g;h
a;4;c;d;e;f;g;h

This variant of the awk script reorders the testing, leading to a slightly more compact notation. It also explicitly ignores malformed data lines that do not contain 8 fields separated by semi-colons. It is packaged as a shell script, but without any option handling so you can only supply a list of files to scan (it reads standard input if there are no files listed). I removed the Perl-ish semi-colons in the script; awk does not need them.

#!/bin/sh

awk -F';' '
NF == 8 {
    if (!assoc[$2]) assoc[$2] = $0
    else if (assoc[$2] != 1)
    {
        print assoc[$2]
        assoc[$2] = 1
        print $0
    }
    else print $0
}' "$@"

Also, @mjv commented that there could be memory problems with solution such as this one if the input is huge because it keeps a record of each distinct field 2 value in the associative array 'assoc'. We can eliminate that if the data fed into awk is sorted, something we can ensure using sort, of course. Here's a variant script that does deal with monstrous inputs (because sort spills data to disk if necessary to hold intermediate results):

sort -t';' -k 2,2 "$@" |
awk -F';' '
BEGIN { last = ";"; line = "" }
NF == 8 {
    if ($2 != last)
    {
        last = $2
        line = $0
    }
    else if (line != "")
    {
        print line
        line = ""
        print $0
    }
    else print $0;
}'

This only keeps a copy of one line of input. The output from the sample data is given in sorted order, of course.

Jonathan Leffler 2009-09-20 02:34:30

Answer 4

+2 A:

See my comments in the awk script

$ cat data.txt 
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416

$ cat dup.awk 
BEGIN { FS = ";" }

{
    # Keep count of the fields in second column
    count[$2]++;

    # Save the line the first time we encounter a unique field
    if (count[$2] == 1)
        first[$2] = $0;

    # If we encounter the field for the second time, print the
    # previously saved line
    if (count[$2] == 2)
        print first[$2];

    # From the second time onward. always print because the field is
    # duplicated
    if (count[$2] > 1)
        print
}

Example output:

$ sort -t ';' -k 2 data.txt | awk -f dup.awk

John Thomas;jd;301
John Tomas;jd;302
Alex Tremble;atrem;415
Alex Trebe;atrem;416

Here is my solution #2:

awk -F';' '{print $2}' data.txt |sort|uniq -d|fgrep -f - data.txt

The beauty of this solution is it preserve the line order at the expense of using many tools together (awk, sort, uniq, and fgrep).

The awk command prints out the second field, whose output is then sorted. Next, the uniq -d command picks out the duplicated strings. At this point, the standard output contains a list of duplicated second fields, one per line. We then pipe that list into fgrep. The '-f -' flag tells fgrep to look for these strings from the standard input.

Yes, you can go all out with command line. I like the second solution better for exercising many tools and for a clearer logic (at least to me). The drawback is the number of tools and possibly memory used. Also, the second solution is inefficient because it it scans the data file twice: the first time with the awk command and the second with the fgrep command. This consideration matters only when the input file is large.

Hai Vu 2009-09-20 02:51:01

Using separate arrays for counting and first instance of line is perhaps neater than my version using one array for both tasks.

Jonathan Leffler 2009-09-20 02:58:43

Great solution. Avoids the pre-sort requirement!

mjv 2009-09-20 03:54:06

Watch out for non-stable sorting. It can report the first line is a duplicate of the second.

Mr.Ree 2009-09-20 04:22:32

The snag I see with the second solution is that if, for sake of example, the second field contained (just) 'e' and it was repeated, the `fgrep` command would pick up every line with an 'e' anywhere on it. You could reduce the probability of that being a problem by having `awk` do `{print ";$2;"}`; now you would only print stray lines if the value in a repeated $2 also appeared as a value in $1 or $3..$8. For 'e', that might be unlikely; for '1', maybe not so unlikely. To use `grep`, you would have to be able to use a regex such as: '^[^;]*;X;' (to isolate X to X appearing in the second field).

Jonathan Leffler 2009-09-20 17:04:57

Answer 5

A:

Neeraj 2009-09-20 03:32:06

Watch out for non-stable sorting. It can report the first line is a duplicate of the second.

Mr.Ree 2009-09-20 04:20:49

Answer 6

A:

Borrowing from Hai Vu:

% cat data.txt
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416

There's the really easy way (with gnu-sort & gawk):
(Though this will re-order output!)
(Caveat: Without --stable, the sort can reorder lines so the second occurrence comes before the first. Watch out for that!)

cat data.txt | sort -k2,2 -t';' --stable | gawk -F';' '{if ( $2==old ) { print $0 }; old=$2; }'

There's also the perl way...

cat data.txt | perl -e 'while(<>) { @data = split(/;/); if ( defined( $test{$data[1]} ) ) { print $_; } $test{$data[1]} = $_; }'

.

Mr.Ree 2009-09-20 04:19:37

Answer 7

A:

I'm assuming that you're not relying on any particular ordering of the input (that it might not have been pre-sorted on the key (second) field) and that you'd prefer to preserve the order of the input lines in your output ... printing copies of the first and all subsequent lines which contain duplicate values in the second field.

Here's the quickest snippet of code I could come up with in Python:

    import fileinput
    seen = dict()
    for line in fileinput.input():
        fields = line.split(';')
        key = fields[1]
        if key in seen:
            if not seen[key][0]:
                print seen[key][1],
                seen[key] = (True, seen[key][1])
            print line,
        else:
            seen[key] = (False, line)

The fileinput module lets us handle our input lines in a manner similar to the default awk file/input processing ... or to the semantics of Perl's -n command line switch.

From there we simply keep track of the first line we see with a unique value in the second field, and a flag indicating whether we've printed this one before. When we first find a duplicate we print the first line that had that key, and mark it as having been printed, then we print the current line. For all subsequent duplicates we just print the current line. Obviously for any non-dupe we simply post it as an entry to our dictionary.

There's probably a more elegant way to handle that "first dupe" boolean ... but this was most obvious to me and shouldn't pose any undo additional overhead. Creating a very simple object/class with its own state (I've been printed) would be an option. But I think that would make the overall gist of the code more difficult to understand.

It should be obvious that this can be done in any scripting or programming language which support for associative arrays (hashes, dictionaries, tables whatever your preferred language calls them). The only difference between this code and most of the other examples I've seen in this thread is in the assumptions I'm making about your requirements (that you'd prefer to preserve the relative order of the input and output lines).

Jim Dennis 2009-09-20 04:41:22

ansaurus

tags:

views:

answers:

List only duplicate lines based on one column from a semi-colon delimited file?

related questions