ansaurus

Question

How can you efficiently remove duplicate characters from a string?

Answer 1

+1 A:

Using linq

string someString = "Something I wrote quickly";
char[] distinctChars = someString.ToCharArray().Distinct();
string newString = new string(distinctChars);

Daniel A. White 2010-01-11 00:14:39

string anotherString = "Mississippi"; :D

mingos 2010-01-11 00:18:24

Answer 2

+1 A:

You can to use a regular expression to match that duplicate characters at once.

Rubens Farias 2010-01-11 00:14:57

Answer 3

+1 A:

I don't know if there is a simpler algorithm. An alternative way is to examine the first character, then go through the rest of the string and remove all characters that are equal. Then do this for the 2nd character, 3rd character, and so on. That might save on memory, but would be O(n^2).

The algorithm you suggested would be O(n*m), m < n, since it loops through the array for each character in the string. It would most likely be faster than the alternative above due to less characters in the array than in the string. The array would add a little additional memory requirements, but not much.

In a majority of real applications, however, I doubt that the efficiency of the method you suggested would have any noticeable impact on performance. There are probably other methods (such as regexes, or LINQ distincts) that may have even more performance overhead, but would probably be worth it due to the code simplification.

Kaleb Brasee 2010-01-11 00:17:44

Using an array is O(n^2), too.

sepp2k 2010-01-11 00:19:50

Nitpickers' corner: The original poster's lookup-in-an-array technique is really O(n*m), where n is the length of the string, and m is the bound on the number of unique characters you might have.

quixoto 2010-01-11 00:28:03

Ah yes you are right, and m would be less than n.

Kaleb Brasee 2010-01-11 00:29:47

Answer 4

A:

It's going to depend on what the characteristics of your data are. Is the string superlong? Are many duplicates expected? What is the range of possible characters in the string (is it English? Chinese?) How much memory do you have available? Does the resulting string need to still be ordered?

Keeping a set of characters you've already seen as you traverse is reasonable. So might be sorting the string and then removing dupes as you walk the string, if you can mutate the string like that.

If the string is really long, you'll want to keep running time close to O(n), which means keeping a bit set (generally) or in rarer cases a hash (if the list of possible chars is large: Chinese?) or the like and keeping track of characters you've seen so you can evict them as you walk the string. Lots of implementation details here, too, around whether you have to shift back all the rest of the string in memory every time you remove a character, or whether you can replace it with a blank or something else in-situ.

So again, depends on what you're trying to accomplish and what environment you're in.

quixoto 2010-01-11 00:18:40

Answer 5

+6 A:

You can use a boolean array indexed by character:

bool seen[256];

For byte-sized ASCII-like characters, the above would be appropriate. For 16-bit Unicode:

bool seen[65536];

and so on. Then, for each character in the string it's a simple lookup to see whether that boolean has already been set.

Greg Hewgill 2010-01-11 00:18:42

Ah, yes, good idea... then you would be able to pull it off in O(n). Though restructuring the string would probably take more time than anything else.

Kaleb Brasee 2010-01-11 00:28:34

You can do that in O(n) too, copy each character you want to keep into a new string.

Greg Hewgill 2010-01-11 00:31:01

smaaaart. This is a good idea.

Chris 2010-01-11 00:40:22

Answer 6

A:

Python:

>>> ''.join(set("Something I wrote quickly"))
' cegihkmlonqISrutwy'

Obviously this doesn't preserve order.

Seth 2010-01-11 00:27:25

ansaurus

tags:

views:

answers:

How can you efficiently remove duplicate characters from a string?

related questions