ansaurus

Question

Swapping of columns in a file and remove duplicates

Answer 1

+1 A:

Ordering each word in the line and sorting is easy with perl.

./scriptbelow.pl < datafile.txt | uniq

#!/usr/bin/perl

foreach(sort map { reorder($_) } <>) {
    print;
}

sub reorder {
    return join(' ', sort { $a cmp $b } split(/\s+/, $_)) . "\n";
}

h0tw1r3 2010-04-12 20:06:17

Answer 2

+1 A:

In perl:

while($t=<>) {
 @ts=sort split(/\s+/, $t);
 $t1 = join(" ", @ts);
 print $t unless exists $done{$t1};
 $done{$t1}++;
}

Or:

cat yourfile | perl -n -e  'print join(" ", sort split) . "\n";' | sort | uniq

I'm not sure which one performs better for huge files. The first one produces a huge perl hashmap in memory, the second one invokes a "sort" command...

leonbloy 2010-04-12 20:16:55

Answer 3

+1 A:

To preserve original ordering, a simple (but not necessarily fast and/or memory-efficient) solution in awk:

awk '!seen[$1 " " $2] && !seen[$2 " " $1] { seen[$1 " " $2] = 1; print }

Edit: Sorting alternative in ruby:

ruby -n -e 'puts $_.split.sort.join(" ")' | sort | uniq

Arkku 2010-04-12 20:27:51

Answer 4

+1 A:

If the file is very very long, maybe you should consider writing your program with C/C++. I think this would be the fastest solution ( specially if you have to treat all the file for each line that you read). Treatment with bash functions get very slow with big files and repetitive operations

2010-04-12 21:37:03

then he would spend time doing low level stuff, doing memory manipulation,etc. tools like awk, Perl, Python are capable of handling large files.

ghostdog74 2010-04-13 00:57:45

Answer 5

+1 A:

If you want to remove both "term1 term2" and "term2 term1":

join -v 1 -1 1 <(sort input_file) -v 2 -2 2 <(sort -k 2 input_file) | uniq

Dennis Williamson 2010-04-12 23:19:11

Answer 6

+1 A:

awk '($2FS$1 in _){
 delete _[$1FS$2];delete _[$2FS$1]
 next
} { _[$1FS$2] }
END{ for(i in _)  print i } ' file

output

$ cat file
term1 term2
term3 term4
term2 term1
term5 term3
term3 term5
term6 term7

$ ./shell.sh
term6 term7
term3 term4

ghostdog74 2010-04-12 23:52:35

Answer 7

+1 A:

The way I would do it (if you don't need to keep the double columns) is:

sed 's/ /\n/g' test.txt | sort -u

Here's what the output looks like (ignore my funky prompt):

[~]
==> cat test.txt
term1 term2
term3 term4
term2 term1
term5 term3
[~]
==> sed 's/ /\n/g' test.txt | sort -u
term1
term2
term3
term4
term5

DevNull 2010-04-17 05:23:25

ansaurus

tags:

views:

answers:

Swapping of columns in a file and remove duplicates

related questions