views:

453

answers:

6

Is it possible to remove duplicate characters from a string without saving each character you've seen in an array and checking to see if new characters are already in that array? That seems highly inefficient. Surely there must be a quicker method?

+1  A: 

Using linq

string someString = "Something I wrote quickly";
char[] distinctChars = someString.ToCharArray().Distinct();
string newString = new string(distinctChars);
Daniel A. White
string anotherString = "Mississippi"; :D
mingos
+1  A: 

You can to use a regular expression to match that duplicate characters at once.

Rubens Farias
+1  A: 

I don't know if there is a simpler algorithm. An alternative way is to examine the first character, then go through the rest of the string and remove all characters that are equal. Then do this for the 2nd character, 3rd character, and so on. That might save on memory, but would be O(n^2).

The algorithm you suggested would be O(n*m), m < n, since it loops through the array for each character in the string. It would most likely be faster than the alternative above due to less characters in the array than in the string. The array would add a little additional memory requirements, but not much.

In a majority of real applications, however, I doubt that the efficiency of the method you suggested would have any noticeable impact on performance. There are probably other methods (such as regexes, or LINQ distincts) that may have even more performance overhead, but would probably be worth it due to the code simplification.

Kaleb Brasee
Using an array is O(n^2), too.
sepp2k
Nitpickers' corner: The original poster's lookup-in-an-array technique is really O(n*m), where n is the length of the string, and m is the bound on the number of unique characters you might have.
quixoto
Ah yes you are right, and m would be less than n.
Kaleb Brasee
A: 

It's going to depend on what the characteristics of your data are. Is the string superlong? Are many duplicates expected? What is the range of possible characters in the string (is it English? Chinese?) How much memory do you have available? Does the resulting string need to still be ordered?

Keeping a set of characters you've already seen as you traverse is reasonable. So might be sorting the string and then removing dupes as you walk the string, if you can mutate the string like that.

If the string is really long, you'll want to keep running time close to O(n), which means keeping a bit set (generally) or in rarer cases a hash (if the list of possible chars is large: Chinese?) or the like and keeping track of characters you've seen so you can evict them as you walk the string. Lots of implementation details here, too, around whether you have to shift back all the rest of the string in memory every time you remove a character, or whether you can replace it with a blank or something else in-situ.

So again, depends on what you're trying to accomplish and what environment you're in.

quixoto
+6  A: 

You can use a boolean array indexed by character:

bool seen[256];

For byte-sized ASCII-like characters, the above would be appropriate. For 16-bit Unicode:

bool seen[65536];

and so on. Then, for each character in the string it's a simple lookup to see whether that boolean has already been set.

Greg Hewgill
Ah, yes, good idea... then you would be able to pull it off in O(n). Though restructuring the string would probably take more time than anything else.
Kaleb Brasee
You can do that in O(n) too, copy each character you want to keep into a new string.
Greg Hewgill
smaaaart. This is a good idea.
Chris
A: 

Python:

>>> ''.join(set("Something I wrote quickly"))
' cegihkmlonqISrutwy'

Obviously this doesn't preserve order.

Seth