views:

796

answers:

6

I have what is essentially a jagged array of name value pairs - i need to generate a set of unique name values from this. the jagged array is approx 86,000 x 11 values. It does not matter to me what way I have to store a name value pair (a single string "name=value" or a specialised class for example KeyValuePair).
Additional Info: There are 40 distinct names and a larger number of distinct values - probably in the region 10,000 values.

I am using C# and .NET 2.0 (and the performance is so poor I am thinking that it may be better to push my entire jagged array into a sql database and do a select distinct from there).

Below is the current code Im using:

List<List<KeyValuePair<string,string>>> vehicleList = retriever.GetVehicles();
this.statsLabel.Text = "Unique Vehicles: " + vehicleList.Count;

Dictionary<KeyValuePair<string, string>, int> uniqueProperties = new Dictionary<KeyValuePair<string, string>, int>();
foreach (List<KeyValuePair<string, string>> vehicle in vehicleList)
{
    foreach (KeyValuePair<string, string> property in vehicle)
    {
        if (!uniqueProperties.ContainsKey(property))
        {
            uniqueProperties.Add(property, 0);
        }
    }
}
this.statsLabel.Text += "\rUnique Properties: " + uniqueProperties.Count;
A: 

if you don't need any specific correlation between each key/value pair and the unique values you're generating, you could just use a GUID? I'm assuming the problem is that your current 'Key' isn't unique in this jagged array.

Dictionary<System.Guid, KeyValuePair<string, string>> myDict 
   = new Dictionary<Guid, KeyValuePair<string, string>>();


foreach of your key values in their current format
   myDict.Add(System.Guid.NewGuid(), new KeyValuePair<string, string>(yourKey, yourvalue))

Sounds like it would store what you need but I don't know how you would pull data back from this as there would be no semantic relationship between the generate Guid & what you originally had...

Can you provide any more info in your question?

Eoin Campbell
A: 

Use KeyValuePair as a wrapper class and then create a dictionary with to create a set perhaps? Or implement your own wrapper that overrides the Equals and GetHashCode.

Dictionary<KeyValuePair, bool> mySet;

for(int i = 0; i < keys.length; ++i)
{
    KeyValuePair kvp = new KeyValuePair(keys[i], values[i]);
    mySet[kvp] = true;
}
Mats Fredriksson
A: 

Instead of using a Dictionary why not extend KeyedCollection<TKey, TItem>? According to the documentation:

Provides the abstract base class for a collection whose keys are embedded in the values.

You then need to override the protected TKey GetKeyForItem(TItem item) function. As it is an hybrid between IList<T> and IDictionary<TKey, TValue> I think it's likely to be quite fast.

Leandro López
A: 

How about:

Dictionary<NameValuePair,int> hs = new Dictionary<NameValuePair,int>();
foreach (i in jaggedArray)
{
    foreach (j in i)
    {
        if (!hs.ContainsKey(j))
        {
            hs.Add(j, 0);
        }
    }
}
IEnumerable<NameValuePair> unique = hs.Keys;

of course, if you were using C# 3.0, .NET 3.5:

var hs = new HashSet<NameValuePair>();
hs.UnionWith(jaggedArray.SelectMany(item => item));

would do the trick.

Eric Smith
this is almost exactly the code I am currently using - unfortunately after about 20 mins I get impatient and kill the application.
dice
In C# 3 you could just use `.Distinct()`, too.
Konrad Rudolph
@ Konrad Rudolph: Yes, and it would be just as slow.
Binary Worrier
A: 

Have you profiled your code? You are certain that the foreach loops are the bottleneck, and not retriever.GetVehicles()?

I did create a small test project where I fake the retriever and let it return 86.000 X 11 values. My first attempt ran at 5 seconds, creating the data included.

I used the same value for both the key and value where the first key was "0#0" and the last "85999#10".

Then I switched to guids. Same result.

Then I made the key longer, like this:

        var s = Guid.NewGuid().ToString();
        return s + s + s + s + s + s + s+ s + s + s;

Now it took almost 10 seconds.

Then I made the keys insanely long and got an out of memory exception. I don't have a swap file on my computer, so I got this exception immediately.

How long are your keys? Are your virtual memory consumption the reason for your poor performance?

Thomas Eyde
GetVehicles() is fairly fast in my case - the difference I guess is the data - your data would contain all unique values whereas mine would not - still it is suprising just how fast it runs for you. It should be 86,000 in the outer loop and 11 in the inner.
dice
+6  A: 

I have it running in 0.34 seconds down from 9+ minutes

The problem is when comparing the KeyValuePair structs. I worked around it by writing a comparer object, and passing an instance of it to the Dictionary.

From what I can determine, the KeyValuePair.GetHashCode() returns the hashcode of the it's Key object (in this example the least unique object).

As the dictionary adds (and check existence of) each item, it uses both Equals and GetHashCode functions, but has to rely on the Equals function when the hashcode is less unique.

By providing a more unique GetHashCode function, it excerises the Equals function far less often. I also optimised the Equals function to compare the more unique Values before the less unqiue Keys.

86,000 * 11 items with 10,000 unique properties runs in 0.34 seconds using the comparer object below (without the comparer object it takes 9 minutes 22 seconds)

Hope this helps :)

    class StringPairComparer
        : IEqualityComparer<KeyValuePair<string, string>>
    {
        public bool Equals(KeyValuePair<string, string> x, KeyValuePair<string, string> y)
        {
            return x.Value == y.Value && x.Key == y.Key;
        }
        public int GetHashCode(KeyValuePair<string, string> obj)
        {
            return obj.Key.GetHashCode() * obj.Value.GetHashCode();
        }
    }

EDIT: If it was just one string (instead of a KeyValuePair, where string = Name+Value) it would be approx twice as fast. It's a nice intresting problem, and I have spent faaaaaar too much time on it (I learned quiet a bit though)

Binary Worrier