I have a file with a bunch of lines. Each one of these lines has 8 semi-colon delimited columns.
How can I (in Linux) return duplicate lines but only based on column number 2?
Should I be using grep
or something else?
I have a file with a bunch of lines. Each one of these lines has 8 semi-colon delimited columns.
How can I (in Linux) return duplicate lines but only based on column number 2?
Should I be using grep
or something else?
grep may do it, but I'm guessing you'll have a much easier time with awk (aka gawk, on some systems).
The effective chain / script to be used for your need depends on a few extra bits of info. For example, is the input file readily sorted, how big is the input (or rather is it huge or a stream)...
Assuming sorted input (either originally or from piping through sort), the awk script would look something like that: (attention untested)
Check the solution provided by Jonathan Leffler or Hai Vu, for a way to achieve the same without the pre-sort requirement.
#!/usr/bin/awk
# *** Simple AWK script to output duplicate lines found in input ***
# Assume input is sorted on fields
BEGIN {
FS = ";"; #delimiter
dupCtr = 0; # number of duplicate _instances_
dupLinesCtr = 0; # total number of duplicate lines
firstInSeries = 1; #used to detect if this is first in series
prevLine = "";
prevCol2 = ""; # use another string in case empty field is valid
}
{
if ($2 == prevCol2) {
if (firstInSeries == 1) {
firstInSeries = 0;
dupCtr++;
dupLinesCtr++;
print prevLine
}
dupLinesCtr++;
print $0
}
else
firstInSeries = 1
prevCol2 = $2
prevLine = $0
}
END { #optional display of counts etc.
print "*********"
print "Total duplicate instances = " iHits " Total lines = " NR;
}
Have a convoluted awk
script.
awk 'BEGIN { FS=";" } { c[$2]++; l[$2,c[$2]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }' file.txt
It works by keeping a counter of all occurrences of each value in the second field, and the lines which have that value, then it prints out the lines that have counters greater than 1.
Replace all instances of $2
with whichever field number you need, and the file.txt
at the end with your filename.
As @mjv surmised - awk (or Perl, or Python) is a better choice:
awk -F';' ' {
if (assoc[$2]) { # This field 2 has been seen before
if (assoc[$2] != 1) { # The first occurrence has not been printed
print assoc[$2]; # Print first line with given $2
assoc[$2] = 1; # Reset array entry so we know we've printed it;
# a full line has 8 fields with semi-colons and
# cannot be confused with 1.
}
print $0; # Print this duplicate entry
}
else {
assoc[$2] = $0; # Record line in associative array, indexed by
# second field.
}
}' <<!
a;b;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;2;c;d;e;f;g;h
a;z;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;4;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;x;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;4;c;d;e;f;g;h
!
This works, but it can slightly reorder the data - because it prints the first occurrence of a duplicated line when the second occurrence appears. The sample output is:
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;4;c;d;e;f;g;h
a;4;c;d;e;f;g;h
This variant of the awk
script reorders the testing, leading to a slightly more compact notation. It also explicitly ignores malformed data lines that do not contain 8 fields separated by semi-colons. It is packaged as a shell script, but without any option handling so you can only supply a list of files to scan (it reads standard input if there are no files listed). I removed the Perl-ish semi-colons in the script; awk
does not need them.
#!/bin/sh
awk -F';' '
NF == 8 {
if (!assoc[$2]) assoc[$2] = $0
else if (assoc[$2] != 1)
{
print assoc[$2]
assoc[$2] = 1
print $0
}
else print $0
}' "$@"
Also, @mjv commented that there could be memory problems with solution such as this one if the input is huge because it keeps a record of each distinct field 2 value in the associative array 'assoc'. We can eliminate that if the data fed into awk
is sorted, something we can ensure using sort
, of course. Here's a variant script that does deal with monstrous inputs (because sort
spills data to disk if necessary to hold intermediate results):
sort -t';' -k 2,2 "$@" |
awk -F';' '
BEGIN { last = ";"; line = "" }
NF == 8 {
if ($2 != last)
{
last = $2
line = $0
}
else if (line != "")
{
print line
line = ""
print $0
}
else print $0;
}'
This only keeps a copy of one line of input. The output from the sample data is given in sorted order, of course.
See my comments in the awk script
$ cat data.txt
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416
$ cat dup.awk
BEGIN { FS = ";" }
{
# Keep count of the fields in second column
count[$2]++;
# Save the line the first time we encounter a unique field
if (count[$2] == 1)
first[$2] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$2] == 2)
print first[$2];
# From the second time onward. always print because the field is
# duplicated
if (count[$2] > 1)
print
}
Example output:
$ sort -t ';' -k 2 data.txt | awk -f dup.awk
John Thomas;jd;301
John Tomas;jd;302
Alex Tremble;atrem;415
Alex Trebe;atrem;416
Here is my solution #2:
awk -F';' '{print $2}' data.txt |sort|uniq -d|fgrep -f - data.txt
The beauty of this solution is it preserve the line order at the expense of using many tools together (awk, sort, uniq, and fgrep).
The awk command prints out the second field, whose output is then sorted. Next, the uniq -d command picks out the duplicated strings. At this point, the standard output contains a list of duplicated second fields, one per line. We then pipe that list into fgrep. The '-f -' flag tells fgrep to look for these strings from the standard input.
Yes, you can go all out with command line. I like the second solution better for exercising many tools and for a clearer logic (at least to me). The drawback is the number of tools and possibly memory used. Also, the second solution is inefficient because it it scans the data file twice: the first time with the awk command and the second with the fgrep command. This consideration matters only when the input file is large.
Borrowing from Hai Vu:
% cat data.txt
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416
There's the really easy way (with gnu-sort & gawk):
(Though this will re-order output!)
(Caveat: Without --stable, the sort can reorder lines so the second occurrence comes before the first. Watch out for that!)
cat data.txt | sort -k2,2 -t';' --stable | gawk -F';' '{if ( $2==old ) { print $0 }; old=$2; }'
There's also the perl way...
cat data.txt | perl -e 'while(<>) { @data = split(/;/); if ( defined( $test{$data[1]} ) ) { print $_; } $test{$data[1]} = $_; }'
.
I'm assuming that you're not relying on any particular ordering of the input (that it might not have been pre-sorted on the key (second) field) and that you'd prefer to preserve the order of the input lines in your output ... printing copies of the first and all subsequent lines which contain duplicate values in the second field.
Here's the quickest snippet of code I could come up with in Python:
import fileinput
seen = dict()
for line in fileinput.input():
fields = line.split(';')
key = fields[1]
if key in seen:
if not seen[key][0]:
print seen[key][1],
seen[key] = (True, seen[key][1])
print line,
else:
seen[key] = (False, line)
The fileinput
module lets us handle our input lines in a manner similar to the default awk
file/input processing ... or to the semantics of Perl's -n
command line switch.
From there we simply keep track of the first line we see with a unique value in the second field, and a flag indicating whether we've printed this one before. When we first find a duplicate we print the first line that had that key, and mark it as having been printed, then we print the current line. For all subsequent duplicates we just print the current line. Obviously for any non-dupe we simply post it as an entry to our dictionary.
There's probably a more elegant way to handle that "first dupe" boolean ... but this was most obvious to me and shouldn't pose any undo additional overhead. Creating a very simple object/class with its own state (I've been printed) would be an option. But I think that would make the overall gist of the code more difficult to understand.
It should be obvious that this can be done in any scripting or programming language which support for associative arrays (hashes, dictionaries, tables whatever your preferred language calls them). The only difference between this code and most of the other examples I've seen in this thread is in the assumptions I'm making about your requirements (that you'd prefer to preserve the relative order of the input and output lines).