ansaurus

Question

Efficiently merge string arrays in .NET, keeping distinct values

Answer 1

A:

I don't think there's any way to merge two arrays in that manner that performs better than O(n!)

Dan Herbert 2008-09-28 18:03:43

Answer 2

A:

Probably creating a hashtable with your values as keys (only adding those not already present) and then converting the keys to an array could be a viable solution.

petr k. 2008-09-28 18:08:04

Answer 3

+1 A:

Disclaimer This is premature optimization. For your example arrays, use the 3.5 extension methods. Until you know you have a performance problem in this region, you should use library code.

If you can sort the arrays, or they're sorted when you get to that point in the code, you can use the following methods.

These will pull one item from both, and produce the "lowest" item, then fetch a new item from the corresponding source, until both sources are exhausted. In the case where the current item fetched from the two sources are equal, it will produce the one from the first source, and skip them in both sources.

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2)
{
    return Merge(source1, source2, Comparer<T>.Default);
}

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2, IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source1))
        throw new ArgumentNullException("source1");
    if (Object.ReferenceEquals(null, source2))
        throw new ArgumentNullException("source2");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T>
        enumerator1 = source1.GetEnumerator(),
        enumerator2 = source2.GetEnumerator())
    {
        Boolean more1 = enumerator1.MoveNext();
        Boolean more2 = enumerator2.MoveNext();

        while (more1 && more2)
        {
            Int32 comparisonResult = comparer.Compare(
                enumerator1.Current,
                enumerator2.Current);
            if (comparisonResult < 0)
            {
                // enumerator 1 has the "lowest" item
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
            }
            else if (comparisonResult > 0)
            {
                // enumerator 2 has the "lowest" item
                yield return enumerator2.Current;
                more2 = enumerator2.MoveNext();
            }
            else
            {
                // they're considered equivalent, only yield it once
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
                more2 = enumerator2.MoveNext();
            }
        }

        // Yield rest of values from non-exhausted source
        while (more1)
        {
            yield return enumerator1.Current;
            more1 = enumerator1.MoveNext();
        }
        while (more2)
        {
            yield return enumerator2.Current;
            more2 = enumerator2.MoveNext();
        }
    }
}

Note that if one of the sources contains duplicates, you might see duplicates in the output. If you want to remove these duplicates in the already sorted lists, use the following method:

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source)
{
    return CheapDistinct<T>(source, Comparer<T>.Default);
}

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source,
    IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source))
        throw new ArgumentNullException("source");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T> enumerator = source.GetEnumerator())
    {
        if (enumerator.MoveNext())
        {
            T item = enumerator.Current;

            // scan until different item found, then produce
            // the previous distinct item
            while (enumerator.MoveNext())
            {
                if (comparer.Compare(item, enumerator.Current) != 0)
                {
                    yield return item;
                    item = enumerator.Current;
                }
            }

            // produce last item that is left over from above loop
            yield return item;
        }
    }
}

Note that none of these will internally use a data structure to keep a copy of the data, so they will be cheap if the input is sorted. If you can't, or won't, guarantee that, you should use the 3.5 extension methods that you've already found.

Here's example code that calls the above methods:

String[] list_1 = { "apple", "orange", "apple", "banana" };
String[] list_2 = { "banana", "pear", "grape" };

Array.Sort(list_1);
Array.Sort(list_2);

IEnumerable<String> items = Merge(
    CheapDistinct(list_1),
    CheapDistinct(list_2));
foreach (String item in items)
    Console.Out.WriteLine(item);

Lasse V. Karlsen 2008-09-28 18:12:00

+1 for thinking out of the box: what if they're sorted?. And for lots of code.Then again, the time it takes to sort them might beat the whole purpose. Hence the disclaimer :)

Lucas 2008-10-01 21:35:28

Answer 4

+1 A:

You don't know which approach is faster until you measure it. The LINQ way is elegant and easy to understand.

Another way is to implement an set as an hash array (Dictionary) and add all the elements of both the arrays to the set. Then use set.Keys.ToArray() method to create the resulting array.

danatel 2008-09-28 18:14:55

Answer 5

+1 A:

.NET 3.5 introduced the HashSet class which could do this:

IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);

Not sure of performance, but it should beat the Linq example you gave.

EDIT: I stand corrected. The lazy implementation of Concat and Distinct have a key memory AND speed advantage. Concat/Distinct is about 10% faster, and saves multiple copies of data.

I confirmed through code:

Setting up arrays of 3000000 strings overlapping by 300000
Starting Hashset...
HashSet: 00:00:02.8237616
Starting Concat/Distinct...
Concat/Distinct: 00:00:02.5629681

is the output of:

        int num = 3000000;
        int num10Pct = (int)(num / 10);

        Console.WriteLine(String.Format("Setting up arrays of {0} strings overlapping by {1}", num, num10Pct));
        string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray();
        string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray();

        Console.WriteLine("Starting Hashset...");
        Stopwatch sw = new Stopwatch();
        sw.Start();
        string[] merged = new HashSet<string>(list1).Union(list2).ToArray();
        sw.Stop();
        Console.WriteLine("HashSet: " + sw.Elapsed);

        Console.WriteLine("Starting Concat/Distinct...");
        sw.Reset();
        sw.Start();
        string[] merged2 = list1.Concat(list2).Distinct().ToArray();
        sw.Stop();
        Console.WriteLine("Concat/Distinct: " + sw.Elapsed);

TheSoftwareJedi 2008-09-28 18:22:27

Actually, I'd expect that to be *less* efficient than the Concat/Distinct way, as Union will need to form a second HashSet.

Jon Skeet 2008-09-28 18:29:03

Answer 6

+7 A:

Why do you imagine that it would be inefficient? As far as I'm aware, both Concat and Distinct are evaluated lazily, using a HashSet behind the scenes for Distinct to keep track of the elements which have already been returned.

I'm not sure how you'd manage to make it more efficient than that in a general way :)

EDIT: Distinct actually uses Set (an internal class) instead of HashSet, but the gist is still correct. This is a really good example of just how neat LINQ is. The simplest answer is pretty much as efficient as you can achieve without more domain knowledge.

The effect is the equivalent of:

public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second)
{
    HashSet<T> returned = new HashSet<T>();
    foreach (T element in first)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
    foreach (T element in second)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
}

Jon Skeet 2008-09-28 18:27:36

Answer 7

+16 A:

string[] result = list1.Union(list2);

from msdn: "This method excludes duplicates from the return set. This is different behavior to the Concat(TSource) method, which returns all the elements in the input sequences including duplicates."

Wonko 2008-09-29 01:04:37

I came back to this topic to post exactly this solution. It's ideal in every way, I believe!

Jon Skeet 2008-09-29 14:20:54

ansaurus

tags:

views:

answers:

Efficiently merge string arrays in .NET, keeping distinct values

related questions