tags:

views:

417

answers:

11

How do I remove duplicate characters and keep the unique one only. For example, my input is:

EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU

Expected output is:

EFUAH
UEH
UJHACDEF

I came across perl -pe's/$1//g while/(.).*\/' which is wonderful but it is removing even the single occurrence of the character in output.

+5  A: 

This can be done using positive lookahead :

perl -pe 's/(.)(?=.*?\1)//g' FILE_NAME

The regex used is: (.)(?=.*?\1)

  • . : to match any char.
  • first () : remember the matched single char.
  • (?=...) : +ve lookahead
  • .*? : to match anything in between
  • \1 : the remembered match.
  • (.)(?=.*?\1) : match and remember any char only if it appears again later in the string.
  • s/// : Perl way of doing the substitution.
  • g: to do the substitution globally...that is don't stop after first substitution.
  • s/(.)(?=.*?\1)//g : this will delete a char from the input string only if that char appears again later in the string.

This will not maintain the order of the char in the input because for every unique char in the input string, we retain its last occurrence and not the first.

To keep the relative order intact we can do what KennyTM tells in one of the comments:

  • reverse the input line
  • do the substitution as before
  • reverse the result before printing

The Perl one line for this is:

perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' FILE_NAME

Since we are doing print manually after reversal, we don't use the -p flag but use the -n flag.

I'm not sure if this is the best one-liner to do this. I welcome others to edit this answer if they have a better alternative.

codaddict
The order is changed (eg. "EFAHU") - wonder if it matters.
Gavin Brock
@Gavin: That can be fixed by reversing the string initially, and reverse the string after the replacement.
KennyTM
Well this is amazing !!!!But can you explain me bit details like what ===> s/(.) and (?=.*?\1)// is doing ?Also is it possible to have in the same order which I have put in my early query , For ex. currently I am getting EFAHU instead of EFUAH which is more helpful.Thnax a ton :)
manu
@KennyTM: Thanks :) @Manu: I've updated my ans with a short explanation of whats going on.
codaddict
@Downvoter: Care to explain?
codaddict
This is working exactly. Thanx again for the kind reply and clear explanation of all the stuff.Thank u all :)
manu
A: 

for a file containing the data you list named foo.txt

python -c "print set(open('foo.txt').read())"
jkyle
sets in Python does not have order...and he wants Perl..
ghostdog74
His original post did not specify perl as a requirement (though he tagged it perl), only pointed out he found a perl one-liner as a possible way to do it. He also did not say order mattered, only uniqueness. Also, the use of a one-liner indicates that the method doesn't really matter.
jkyle
+1  A: 

Tie::IxHash is a good module to store hash order (but may be slow, you will need to benchmark if speed is important). Example with tests:

use Test::More 0.88;

use Tie::IxHash;
sub dedupe {
  my $str=shift;
  my $hash=Tie::IxHash->new(map { $_ => 1} split //,$str);
  return join('',$hash->Keys);
}

{
my $str='EFUAHUU';
is(dedupe($str),'EFUAH');
}

{
my $str='EFUAHHUU';
is(dedupe($str),'EFUAH');
}

{
my $str='UJUJHHACDEFUCU';
is(dedupe($str),'UJHACDEF');
}

done_testing();
Alexandr Ciornii
+2  A: 
perl -ne'my%s;print grep!$s{$_}++,split//'
Hynek -Pichi- Vychodil
This is also working and shorter than earlier one. I am overwhelmed by the response :)I would like to know its working if possible.
manu
It works in same way as gianthare solution but much more idiomatic Perl and faster.
Hynek -Pichi- Vychodil
A nice one, I agree. Almost a one-liner except for the `my %s`. Though I don't see where the speedup is coming from. May it be from fresh hashtable instead of resetting? Or is grep more efficient than the explicit loop?
Giant Hare
@gianthare: there is difference calling `print` for each character and calling `print` with array parameter. Your code will be slower for lines with bigger amount of unique chars. `%seen=();` should be almost same fast as mine `my %s;`.
Hynek -Pichi- Vychodil
@Hynek -Pichi- Vychodil: Thanks, you are probably right
Giant Hare
A: 

From the shell, this works:

sed -e 's/$/<EOL>/ ; s/./&\n/g' test.txt | uniq | sed -e :a -e '$!N; s/\n//; ta ; s/<EOL>/\n/g'

In words: mark every linebreak with a <EOL> string, then put every character on a line of its own, then use uniq to remove duplicate lines, then strip out all the linebreaks, then put back linebreaks instead of the <EOL> markers.

I found the -e :a -e '$!N; s/\n//; ta part in a forum post and I don't understand the seperate -e :a part, or the $!N part, so if anyone can explain those, I'd be grateful.

Hmm, that one does only consecutive duplicates; to eliminate all duplicates you could do this:

cat test.txt | while read line ; do echo $line | sed -e 's/./&\n/g' | sort | uniq | sed -e :a -e '$!N; s/\n//; ta' ; done

That puts the characters in each line in alphabetical order though.

Jean Jordaan
+1  A: 

This looks like a classic application of positive lookbehind, but unfortunately perl doesn't support that. In fact, doing this (matching the preceding text of a character in a string with a full regex whose length is indeterminable) can only be done with .NET regex classes, I think.

However, positive lookahead supports full regexes, so all you need to do is reverse the string, apply positive lookahead (like unicornaddict said):

perl -pe 's/(.)(?=.*?\1)//g' 

And reverse it back, because without the reverse that'll only keep the duplicate character at the last place in a line.

MASSIVE EDIT

I've been spending the last half an hour on this, and this looks like this works, without the reversing.

perl -pe 's/\G$1//g while (/(.).*(?=\1)/g)' FILE_NAME

I don't know whether to be proud or horrified. I'm basically doing the positive looakahead, then substituting on the string with \G specified - which makes the regex engine start its matching from the last place matched (internally represented by the pos() variable).

With test input like this:

aabbbcbbccbabb

EFAUUUUH

ABCBBBBD

DEEEFEGGH

AABBCC

The output is like this:

abc

EFAUH

ABCD

DEFGH

ABC

I think it's working...

Explanation - Okay, in case my explanation last time wasn't clear enough - the lookahead will go and stop at the last match of a duplicate variable [in the code you can do a print pos(); inside the loop to check] and the s/\G//g will remove it [you don't need the /g really]. So within the loop, the substitution will continue removing until all such duplicates are zapped. Of course, this might be a little too processor intensive for your tastes... but so are most of the regex-based solutions you'll see. The reversing/lookahead method will probably be more efficient than this, though.

Deep-B
More precisely, it's *variable-length* lookbehinds Perl doesn't support. Besides .NET, they're supported by JGSoft (EditPad Pro, PowerGrep) and in more limited form by Java.
Alan Moore
Edited and added new solution. Not sure if it's fullproof or not... too much caffeine. :-P
Deep-B
+1  A: 

Use uniq from List::MoreUtils:

perl -MList::MoreUtils=uniq -ne 'print uniq split ""'
mscha
+1  A: 

If the set of characters that can be encountered is restricted, e.g. only letters, then the easiest solution will be with tr
perl -p -e 'tr/a-zA-Z/a-zA-Z/s'
It will replace all the letters by themselves, leaving other characters unaffected and /s modifier will squeeze repeated occurrences of the same character (after replacement), thus removing duplicates

Me bad - it removes only adjoining appearances. Disregard

Giant Hare
+4  A: 

Here is a solution, that I think should work faster than the lookahead one, but is not regexp-based and uses hashtable.

perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}' 

It splits every line into characters and prints only the first appearance by counting appearances inside %seen hashtable

Giant Hare
A: 
use strict;
use warnings;

my ($uniq, $seq, @result);
$uniq ='';
sub uniq {
    $seq = shift;
    for (split'',$seq) {
    $uniq .=$_ unless $uniq =~ /$_/;
    }
    push @result,$uniq;
    $uniq='';
}

while(<DATA>){
   uniq($_);
}
print @result;

__DATA__
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU

The output:

EFUAH
UEH
UJHACDEF
Mike
+2  A: 

if Perl is not a must, you can also use awk. here's a fun benchmark on the Perl one liners posted against awk. awk is 10+ seconds faster for a file with 3million++ lines

$ wc -l <file2
3210220

$ time awk 'BEGIN{FS=""}{delete _;for(i=1;i<=NF;i++){if(!_[$i]++) printf $i};print""}' file2 >/dev/null

real    1m1.761s
user    0m58.565s
sys     0m1.568s

$ time perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}'  file2 > /dev/null

real    1m32.123s
user    1m23.623s
sys     0m3.450s

$ time perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' file2 >/dev/null

real    1m17.818s
user    1m10.611s
sys     0m2.557s

$ time perl -ne'my%s;print grep!$s{$_}++,split//' file2 >/dev/null

real    1m20.347s
user    1m13.069s
sys     0m2.896s
ghostdog74
+1, nice work :)
codaddict
I am amazed of how fast the regexp solution is
Giant Hare