What is an efficient algorithm to remove all duplicates in a string? For example if I have aaaabbbccdbdbcd, I will get back abcd.
You use a hashtable to store currently discovered keys (access O(1)) and then loop through the array. If a character is in the hashtable, discard it. If it isn't add it to the hashtable and a result string.
Overall: O(n) time (and space).
The naive solution is to search for the character is the result string as you process each one. That O(n2).
Keep an array of 256 "seen" booleans, one for each possible character. Stream your string. If you haven't seen the character before, output it and set the "seen" flag for that character.
string newString = new string("aaaaabbbbccccdddddd".ToCharArray().Distinct().ToArray());
or
char[] characters = "aaaabbbccddd".ToCharArray();
string result = string.Empty ;
foreach (char c in characters)
{
if (result.IndexOf(c) < 0)
result += c.ToString();
}
In Python
>>> ''.join(set("aaaabbbccdbdbcd"))
'acbd'
If the order needs to be preserved
>>> q="aaaabbbccdbdbcd" # this one is not
>>> ''.join(sorted(set(q),key=q.index)) # so efficient
'abcd'
or
>>> S=set()
>>> res=""
>>> for c in "aaaabbbccdbdbcd":
... if c not in S:
... res+=c
... S.add(c)
...
>>> res
'abcd'
or
>>> S=set()
>>> L=[]
>>> for c in "aaaabbbccdbdbcd":
... if c not in S:
... L.append(c)
... S.add(c)
...
>>> ''.join(L)
'abcd'
In python3.1
>>> from collections import OrderedDict
>>> ''.join(list(OrderedDict((c,0) for c in "aaaabbbccdbdbcd").keys()))
'abcd'
This closely related to the question: Detecting repetition with infinite input.
The hashtable approach may not be optimal depending on your input. Hashtables have a certain amount of overhead (buckets, entry objects). It is huge overhead compared to the actual stored char. (If you target environment is Java it is even worse as the HashMap is of type Map<Character,?>
.) The worse case runtime for a Hashtable access is O(n) due to collisions.
You need only 8kb too represent all 2-byte unicode characters in a plain BitSet. This may be optimized if your input character set is more restricted or by using a compressed BitSets (as long as you have a sparse BitSet). The runtime performance will be favorable for a BitSet it is O(1).
In C++, you'd probably use an std::set
:
std::string input("aaaabbbccddd");
std::set<char> unique_chars(input.begin(), input.end());
In theory you could use std::unordered_set
instead of std::set
, which should give O(N) expected overall complexity (though O(N2) worst case), where this one is O(N lg M) (where N=number of total characters, M=number of unique characters). Unless you have long strings with a lot of unique characters, this version will probably be faster though.
You can sort the string and then remove the duplicate characters.
#include <iostream>
#include <algorithm>
#include <string>
int main()
{
std::string s = "aaaabbbccdbdbcd";
std::sort(s.begin(), s.end());
s.erase(std::unique(s.begin(), s.end()), s.end());
std::cout << s << std::endl;
}
C++ - O(n) time, O(1) space, and the output is sorted.
std::string characters = "aaaabbbccddd";
std::vector<bool> seen(std::numeric_limits<char>::max()-std::numeric_limits<char>::min());
for(std::string::iterator it = characters.begin(), endIt = characters.end(); it != endIt; ++it) {
seen[(*it)-std::numeric_limits<char>::min()] = true;
}
characters = "";
for(char ch = std::numeric_limits<char>::min(); ch != std::numeric_limits<char>::max(); ++ch) {
if( seen[ch-std::numeric_limits<char>::min()] ) {
characters += ch;
}
}