views:

1029

answers:

6

In C#, what is the fastest way to detect duplicate characters in a String and remove them (removal including 1st instance of the duplicated character)?

Example Input: nbHHkRvrXbvkn

Example Output: RrX

+21  A: 

Fastest as in fewest-lines-of-code:

var s = "nbHHkRvrXbvkn";
var duplicates = s.Where(ch => s.Count(c => c == ch) > 1);
var result = new string(s.Except(duplicates).ToArray()); // = "RrX"

Fastest as in fastest-performance would probably be something like this (does not preserve order):

var h1 = new HashSet<char>();
var h2 = new HashSet<char>();

foreach (var ch in "nbHHkRvrXbvkn")
{
    if (!h1.Add(ch))
    {
        h2.Add(ch);
    }
}

h1.ExceptWith(h2); // remove duplicates

var chars = new char[h1.Count];
h1.CopyTo(chars);
var result = new string(chars); // = "RrX"


Performance test

When in doubt -- test it :)

Yuriy Faktorovich's answer        00:00:00.2360900
Luke's answer                     00:00:00.2225683
My 'few lines' answer             00:00:00.5318395
My 'fast' answer                  00:00:00.1842144
dtb
Very nice. Great performance comparison too. Performance variation is probably even more visible with very large strings.
Alex
I've repeated the performance test in Release build with detached debugger (but same input string). I'm surpised by the performance of Yuriy's answer; it's pretty fast!
dtb
@dtb: The thing that slows down my answer compared to yours is that I'm preserving the original order in the output string, which necessitates an additional loop through the input string. The technique that you and I use for actually finding the dupes is *exactly* the same.
LukeH
see my solution below ... you had the right idea but using an array for known data is 4 times faster than HashSet in my tests
gabe
c# has var? ... i've never seen the use of var in c# before... BTW great piece of code good job! +1
baeltazor
stands for variant, usually used with LINQ for anonymous types, discouraged otherwise.
Yuriy Faktorovich
A: 

This algorithm is general, can be applied to any language

  1. create a map(HashTable) char->int that holds the count of each character found, initially empty
  2. scan the string once to populate the map.
  3. create a new empty string that'll hold the output, may be you'll need to use a StringBuilder.
  4. scan the string(orthe map, whichever is shorter) copying only characters which an occurrence of 1 to the output string/StringBuilder
Diaa Sami
+9  A: 

Here is a pretty fast one preserving order. But I'd be a little worried about how LINQ does Group and Where:

string s = "nbHHkRvrXbvkn";
Console.WriteLine( 
    s.ToCharArray()
        .GroupBy(c => c)
        .Where(g => g.Count() == 1)
        .Aggregate(new StringBuilder(), (b, g) => b.Append(g.Key)));

Edit: This one beats Luke's in some cases still slower than dtb's, but it preserves the order

private static string MyMethod(string s)
{
    StringBuilder sb = new StringBuilder(s.Length);
    foreach (var g in s.ToCharArray().GroupBy(c => c))
        if (g.Count() == 1) sb.Append(g.Key);

    return sb.ToString();
}
Yuriy Faktorovich
+1. Very clean solution. It's amazingly fast, too!
dtb
Agree. +1 !
Alex
+3  A: 

This one should be pretty quick (and it preserves the original order):

public static string RemoveDuplicates(string source)
{
    HashSet<char> found = new HashSet<char>();
    HashSet<char> dupes = new HashSet<char>();

    foreach (char c in source)
    {
        if (!found.Add(c))
        {
            dupes.Add(c);
        }
    }

    StringBuilder sb = new StringBuilder(source.Length);
    foreach (char c in source)
    {
        if (!dupes.Contains(c))
        {
            sb.Append(c);
        }
    }
    return sb.ToString();
}
LukeH
Why do you think creating a StringBuilder that is probably too large will take less time than allowing it to get space on the fly?
Yuriy Faktorovich
@Yuri: I benchmarked! I tested with millions of random strings and pre-sizing the `StringBuilder` was faster in most cases. Of course, in the real-world the strings probably wouldn't be purely random. In that situation the performance difference would depend on the ratio of dupes to non-dupes in the source string.
LukeH
@Yuriy: I just benchmarked on a different machine (Vista64 vs XP32) and the results were much closer. On the 64-bit machine it appears to make no real difference whether the `StringBuilder` is pre-allocated or not. (In which case it probably makes sense to not bother pre-allocating and save ourselves some RAM.)
LukeH
+2  A: 

This preserves order and, based on my tests, is 4 times faster than using a HashSet. This assumes your character range is 0-255 but you can extend that easily. If you're planning on using this in a loop, move the int[] c = new int[255]; out and in the function do an Array.Clear(c,0,255).


        private static string RemoveDuplicates(string s)
        {
            int[] c = new int[255];
            for (int i = 0; i < s.Length; i++)
            {
                c[s[i]]++;
            }
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < s.Length; i++)
            {
                if (c[s[i]] == 1) sb.Append(s[i]);
            }
            return sb.ToString();
        }
gabe
also, i don't know if the compiler will unroll those loops for you, but you can try that too http://en.wikipedia.org/wiki/Loop_unwinding
gabe
`char.MaxValue` is 65535
dtb
What is your test timing/stopwatch result with the sample string?
Alex
With the sample string, this method would be 1/4 or less compared to others.
Yuriy Faktorovich
@Yuriy: Is that with the array size correctly set to 65536? In my tests there's not really much difference between this method and using a `HashSet` once the array is sized correctly.
LukeH
No, once the Array is set correctly on average working on a 100 character string it was 60% worse.
Yuriy Faktorovich
A: 

Following is strictly O(n) and fastest.

    public static string RemoveDuplicates(this string input)
    {
        var lastIndex = new int[255];
        var result = new char[input.Length + 1];
        int i = 1;
        int pos;
        char ch2;
        int start = 1;
        foreach (char ch in input)
        {
            pos = lastIndex[ch];
            if (pos == 0)
            {
                lastIndex[ch] = i;
                result[i++] = ch;

            }
            else if (pos > 0)
            {
                ch2 = result[start];
                result[pos] = ch2;
                lastIndex[ch2] = pos;
                start++;
                lastIndex[ch] = -1;
            }
        }
        return new String(result, start, i-start);
    }

If you want to take full character set in consideration you can replace lastIndex with Dictionary.

    public static string RemoveDuplicates(this string input)
    {
        var lastIndex = new Dictionary<char, int>(input.Length);
        var result = new char[input.Length + 1];
        int i = 1;
        int pos;
        char ch2;
        int start = 1;
        foreach (char ch in input)
        {
            lastIndex.TryGetValue(ch, out pos);
            if (pos == 0)
            {
                lastIndex[ch] = i;
                result[i++] = ch;

            }
            else if (pos > 0)
            {
                ch2 = result[start];
                result[pos] = ch2;
                lastIndex[ch2] = pos;
                start++;
                lastIndex[ch] = -1;
            }
        }
        return new String(result, start, i-start);
    }

If you are ok with unsafe code and do not want to use Dictionary following is the solution.

    [DllImport("Kernel32.dll", EntryPoint = "RtlZeroMemory", SetLastError = false)]
    static extern void ZeroMemory(IntPtr dest, IntPtr size);

    static int[] lastIndex = new int[Char.MaxValue];
    public static string RemoveDuplicates(this string input)
    {
        unsafe
        {
            fixed (int* pointer = lastIndex)
                ZeroMemory((IntPtr)pointer, (IntPtr)lastIndex.Length);
        }
        var result = new char[input.Length + 1];
        int i = 1;
        int pos;
        char ch2;
        int start = 1;
        foreach (char ch in input)
        {
            pos = lastIndex[ch];
            if (pos == 0)
            {
                lastIndex[ch] = i;
                result[i++] = ch;                    
            }
            else if (pos > 0)
            {
                ch2 = result[start];
                result[pos] = ch2;
                lastIndex[ch2] = pos;
                start++;
                lastIndex[ch] = -1;
            }
        }
        return new String(result, start, i-start);
    }
Hasan Khan
@Hasan: This doesn't preserve the original order (but neither does the accepted answer so I guess that's ok). More importantly, a `char` has 65536 possible values, not 256, and when your `lastIndex` array is correctly sized to 65536 the performance isn't quite as impressive.
LukeH
@Luke The point of posting was displaying the algorithm. Array could easily be replaced with a dictionary. However I've pasted the dictionary implementation for reference.
Hasan Khan
@Hasan: It's certainly a fast, and interesting, algorithm, which is what the question is asking for. (Having said that, I think in most real-world situations the readability/maintainability of the code is more important than out-and-out performance, so long as performance is reasonably good.)
LukeH