ansaurus

Question

Fastest way to implement Duplicate Character Removal in String (C#)

Answer 1

+21 A:

Fastest as in fewest-lines-of-code:

var s = "nbHHkRvrXbvkn";
var duplicates = s.Where(ch => s.Count(c => c == ch) > 1);
var result = new string(s.Except(duplicates).ToArray()); // = "RrX"

Fastest as in fastest-performance would probably be something like this (does not preserve order):

var h1 = new HashSet<char>();
var h2 = new HashSet<char>();

foreach (var ch in "nbHHkRvrXbvkn")
{
    if (!h1.Add(ch))
    {
        h2.Add(ch);
    }
}

h1.ExceptWith(h2); // remove duplicates

var chars = new char[h1.Count];
h1.CopyTo(chars);
var result = new string(chars); // = "RrX"

Performance test

When in doubt -- test it :)

Yuriy Faktorovich's answer        00:00:00.2360900
Luke's answer                     00:00:00.2225683
My 'few lines' answer             00:00:00.5318395
My 'fast' answer                  00:00:00.1842144

dtb 2009-08-27 21:55:20

Very nice. Great performance comparison too. Performance variation is probably even more visible with very large strings.

Alex 2009-08-27 22:39:59

I've repeated the performance test in Release build with detached debugger (but same input string). I'm surpised by the performance of Yuriy's answer; it's pretty fast!

dtb 2009-08-27 22:44:46

@dtb: The thing that slows down my answer compared to yours is that I'm preserving the original order in the output string, which necessitates an additional loop through the input string. The technique that you and I use for actually finding the dupes is *exactly* the same.

LukeH 2009-08-27 22:55:45

see my solution below ... you had the right idea but using an array for known data is 4 times faster than HashSet in my tests

gabe 2009-08-28 16:04:33

c# has var? ... i've never seen the use of var in c# before... BTW great piece of code good job! +1

baeltazor 2009-08-30 06:11:39

stands for variant, usually used with LINQ for anonymous types, discouraged otherwise.

Yuriy Faktorovich 2009-09-02 01:43:05

Answer 2

A:

This algorithm is general, can be applied to any language

create a map(HashTable) char->int that holds the count of each character found, initially empty
scan the string once to populate the map.
create a new empty string that'll hold the output, may be you'll need to use a StringBuilder.
scan the string(orthe map, whichever is shorter) copying only characters which an occurrence of 1 to the output string/StringBuilder

Diaa Sami 2009-08-27 21:59:03

Answer 3

+9 A:

Here is a pretty fast one preserving order. But I'd be a little worried about how LINQ does Group and Where:

string s = "nbHHkRvrXbvkn";
Console.WriteLine( 
    s.ToCharArray()
        .GroupBy(c => c)
        .Where(g => g.Count() == 1)
        .Aggregate(new StringBuilder(), (b, g) => b.Append(g.Key)));

Edit: This one beats Luke's in some cases still slower than dtb's, but it preserves the order

private static string MyMethod(string s)
{
    StringBuilder sb = new StringBuilder(s.Length);
    foreach (var g in s.ToCharArray().GroupBy(c => c))
        if (g.Count() == 1) sb.Append(g.Key);

    return sb.ToString();
}

Yuriy Faktorovich 2009-08-27 22:16:41

+1. Very clean solution. It's amazingly fast, too!

dtb 2009-08-27 22:46:39

Agree. +1 !

Alex 2009-08-27 22:54:56

Answer 4

+3 A:

This one should be pretty quick (and it preserves the original order):

public static string RemoveDuplicates(string source)
{
    HashSet<char> found = new HashSet<char>();
    HashSet<char> dupes = new HashSet<char>();

    foreach (char c in source)
    {
        if (!found.Add(c))
        {
            dupes.Add(c);
        }
    }

    StringBuilder sb = new StringBuilder(source.Length);
    foreach (char c in source)
    {
        if (!dupes.Contains(c))
        {
            sb.Append(c);
        }
    }
    return sb.ToString();
}

LukeH 2009-08-27 22:28:05

Why do you think creating a StringBuilder that is probably too large will take less time than allowing it to get space on the fly?

Yuriy Faktorovich 2009-08-28 03:19:09

@Yuri: I benchmarked! I tested with millions of random strings and pre-sizing the `StringBuilder` was faster in most cases. Of course, in the real-world the strings probably wouldn't be purely random. In that situation the performance difference would depend on the ratio of dupes to non-dupes in the source string.

LukeH 2009-08-28 08:59:46

@Yuriy: I just benchmarked on a different machine (Vista64 vs XP32) and the results were much closer. On the 64-bit machine it appears to make no real difference whether the `StringBuilder` is pre-allocated or not. (In which case it probably makes sense to not bother pre-allocating and save ourselves some RAM.)

LukeH 2009-08-29 09:13:53

Answer 5

+2 A:

This preserves order and, based on my tests, is 4 times faster than using a HashSet. This assumes your character range is 0-255 but you can extend that easily. If you're planning on using this in a loop, move the int[] c = new int[255]; out and in the function do an Array.Clear(c,0,255).


        private static string RemoveDuplicates(string s)
        {
            int[] c = new int[255];
            for (int i = 0; i < s.Length; i++)
            {
                c[s[i]]++;
            }
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < s.Length; i++)
            {
                if (c[s[i]] == 1) sb.Append(s[i]);
            }
            return sb.ToString();
        }

gabe 2009-08-28 16:00:04

also, i don't know if the compiler will unroll those loops for you, but you can try that too http://en.wikipedia.org/wiki/Loop_unwinding

gabe 2009-08-28 16:08:19

`char.MaxValue` is 65535

dtb 2009-08-28 16:31:13

What is your test timing/stopwatch result with the sample string?

Alex 2009-08-28 20:04:39

With the sample string, this method would be 1/4 or less compared to others.

Yuriy Faktorovich 2009-08-29 00:52:09

@Yuriy: Is that with the array size correctly set to 65536? In my tests there's not really much difference between this method and using a `HashSet` once the array is sized correctly.

LukeH 2009-08-29 09:06:43

No, once the Array is set correctly on average working on a 100 character string it was 60% worse.

Yuriy Faktorovich 2009-08-29 14:10:13

Answer 6

A:

Following is strictly O(n) and fastest.

    public static string RemoveDuplicates(this string input)
    {
        var lastIndex = new int[255];
        var result = new char[input.Length + 1];
        int i = 1;
        int pos;
        char ch2;
        int start = 1;
        foreach (char ch in input)
        {
            pos = lastIndex[ch];
            if (pos == 0)
            {
                lastIndex[ch] = i;
                result[i++] = ch;

            }
            else if (pos > 0)
            {
                ch2 = result[start];
                result[pos] = ch2;
                lastIndex[ch2] = pos;
                start++;
                lastIndex[ch] = -1;
            }
        }
        return new String(result, start, i-start);
    }

If you want to take full character set in consideration you can replace lastIndex with Dictionary.

    public static string RemoveDuplicates(this string input)
    {
        var lastIndex = new Dictionary<char, int>(input.Length);
        var result = new char[input.Length + 1];
        int i = 1;
        int pos;
        char ch2;
        int start = 1;
        foreach (char ch in input)
        {
            lastIndex.TryGetValue(ch, out pos);
            if (pos == 0)
            {
                lastIndex[ch] = i;
                result[i++] = ch;

            }
            else if (pos > 0)
            {
                ch2 = result[start];
                result[pos] = ch2;
                lastIndex[ch2] = pos;
                start++;
                lastIndex[ch] = -1;
            }
        }
        return new String(result, start, i-start);
    }

If you are ok with unsafe code and do not want to use Dictionary following is the solution.

    [DllImport("Kernel32.dll", EntryPoint = "RtlZeroMemory", SetLastError = false)]
    static extern void ZeroMemory(IntPtr dest, IntPtr size);

    static int[] lastIndex = new int[Char.MaxValue];
    public static string RemoveDuplicates(this string input)
    {
        unsafe
        {
            fixed (int* pointer = lastIndex)
                ZeroMemory((IntPtr)pointer, (IntPtr)lastIndex.Length);
        }
        var result = new char[input.Length + 1];
        int i = 1;
        int pos;
        char ch2;
        int start = 1;
        foreach (char ch in input)
        {
            pos = lastIndex[ch];
            if (pos == 0)
            {
                lastIndex[ch] = i;
                result[i++] = ch;                    
            }
            else if (pos > 0)
            {
                ch2 = result[start];
                result[pos] = ch2;
                lastIndex[ch2] = pos;
                start++;
                lastIndex[ch] = -1;
            }
        }
        return new String(result, start, i-start);
    }

Hasan Khan 2009-08-29 09:01:23

@Hasan: This doesn't preserve the original order (but neither does the accepted answer so I guess that's ok). More importantly, a `char` has 65536 possible values, not 256, and when your `lastIndex` array is correctly sized to 65536 the performance isn't quite as impressive.

LukeH 2009-08-30 00:01:37

@Luke The point of posting was displaying the algorithm. Array could easily be replaced with a dictionary. However I've pasted the dictionary implementation for reference.

Hasan Khan 2009-08-30 05:51:49

@Hasan: It's certainly a fast, and interesting, algorithm, which is what the question is asking for. (Having said that, I think in most real-world situations the readability/maintainability of the code is more important than out-and-out performance, so long as performance is reasonably good.)

LukeH 2009-08-30 12:43:36

ansaurus

tags:

views:

answers:

Fastest way to implement Duplicate Character Removal in String (C#)

related questions