views:

117

answers:

3

I wrote a hashmap in C# as a self study exercise. I wanted to implement chaining as a collision handling technique. At first I thought I'd simply use GetHashCode as my hashing algorithm, but I quickly found that use the numbers returned by GetHashCode would not always be viable (size of the int causes a out of mem if you want to index and array by the number and numbers can be negative :(). So, I came up with a kludgey method of narrowing the numbers (see MyGetHashCode).

Does anyone have any pointers/tips/criticism for this implementation (of the hash function and in general)? Thanks in advance!

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace HashMap
{
    class Program
    {

        public class MyKVP<T, K>
        {
            public T Key { get; set; }
            public K Value { get; set; }
            public MyKVP(T key, K value)
            {
                Key = key;
                Value = value;
            }
        }


        public class MyHashMap<T, K> : IEnumerable<MyKVP<T,K>>
            where T:IComparable
        {

            private const int map_size = 5000;
            private List<MyKVP<T,K>>[] storage;
            public MyHashMap()
            {
                storage = new List<MyKVP<T,K>>[map_size];
            }

            System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }
            public IEnumerator<MyKVP<T, K>> GetEnumerator()
            {
                foreach (List<MyKVP<T, K>> kvpList in storage)
                {
                    if (kvpList != null)
                    {
                        foreach (MyKVP<T, K> kvp in kvpList)
                        {
                            yield return kvp;
                        }
                    }
                }
            }


            private int MyGetHashCode(T key)
            {
                int i = key.GetHashCode();
                if (i<0) i=i*-1;
                return i / 10000;
            }

            public void Add(T key, K data)
            {
                int value = MyGetHashCode(key);

                SizeIfNeeded(value);

                //is this spot in the hashmap null?
                if (storage[value] == null)
                {
                    //create a new chain
                    storage[value] = new List<MyKVP<T, K>>();
                    storage[value].Add(new MyKVP<T, K>(key, data));
                }
                else
                { 
                    //is this spot taken?
                    MyKVP<T, K> myKvp = Find(value, key);
                    if (myKvp != null) //key exists, throw
                    {
                        throw new Exception("This key exists. no soup for you.");
                    }

                    //if we didn't throw, then add us
                    storage[value].Add(new MyKVP<T, K>(key, data));
                }

            }

            private MyKVP<T, K> Find(int value, T key)
            {
                foreach (MyKVP<T, K> kvp in storage[value])
                {
                    if (kvp.Key.CompareTo(key) == 0)
                    {
                        return kvp;
                    }
                }

                return null;
            }

            private void SizeIfNeeded(int value)
            {
                if (value >= storage.Length)
                {
                    List<MyKVP<T, K>>[] temp = storage;
                    storage = new List<MyKVP<T, K>>[value+1];
                    Array.Copy(temp, storage, temp.Length);
                }
            }

            public K this[T key]
            {

                get 
                {
                    int value = MyGetHashCode(key);
                    if (value > storage.Length) { throw new IndexOutOfRangeException("Key does not exist."); }
                    MyKVP<T, K> myKvp = Find(value, key);
                    if (myKvp == null) throw new Exception("key does not exist");
                    return myKvp.Value;
                }
                set 
                {
                    Add(key, value);
                }
            }


            public void Remove(T key)
            {
                int value = MyGetHashCode(key);
                if (value > storage.Length) { throw new IndexOutOfRangeException("Key does not exist."); }
                if (storage[value] == null) { throw new IndexOutOfRangeException("Key does not exist."); }

                //loop through each kvp at this hash location
                MyKVP<T, K> myKvp = Find(value, key);
                if (myKvp != null)
                {
                    storage[value].Remove(myKvp);
                }
            }
        }

        static void Main(string[] args)
        {
            MyHashMap<string, int> myHashMap = new MyHashMap<string, int>();
            myHashMap.Add("joe", 1);
            myHashMap.Add("mike", 2);
            myHashMap.Add("adam", 3);
            myHashMap.Add("dad", 4);

            Assert.AreEqual(1, myHashMap["joe"]);
            Assert.AreEqual(4, myHashMap["dad"]);
            Assert.AreEqual(2, myHashMap["mike"]);
            Assert.AreEqual(3, myHashMap["adam"]);

            myHashMap.Remove("joe");

            try 
            {
                if (myHashMap["joe"] == 3) { }; //should throw 
            }
            catch (Exception) 
            {
                try { myHashMap.Add("mike",1); }
                catch (Exception) {

                    foreach (MyKVP<string, int> kvp in myHashMap)
                    { 
                        Console.WriteLine(kvp.Key + " " + kvp.Value.ToString());
                    }


                    return;
                }

            }

            throw new Exception("fail");
        }
    }
}
+5  A: 
  1. A Remove method should never throw an exception. You are trying to remove an item. No harm is done if it have already been removed. All collection classes in .Net uses bool as a return value to indicate if an item was really removed.

  2. Do not throw Exception, throw specific one. Browse through all exceptions in the Collection namespaces to find suitable ones.

  3. Add a TryGetValue

  4. Use KeyValuePair which already is a part of .Net instead of creating your own.

  5. Add a constructor which can define map size.

  6. When throwing exceptions include details to why it was thrown. For instance, instead of writing "This key exists", write string.Format("Key '{0}' already exists", key)

jgauffin
A: 

Sorry to say this, but this class won't be working as HashMap or even simple dictionary.

First of all, value returned from GetHashCode() is not unique. Two different objects, e.g. two strings, can possibly return same hash code value. The idea to use hash code as the array index then simply leads to record loss in case of hash code clashing. I would suggest reading about GetHashCode() method and how to implement it from MSDN. Some obvious example is if you get hash code of all possible Int64 values starting at 0, the hash code will surely be clashed at some point.

Another thing is, the for-loop lookup is slow. You should consider using binary search for look up. To do so, you must maintained your key-value pair sorted by the key at any time, which imply that you should use List instead of array for the storage variable so when adding new key-value pair you can insert it at the appropriate index.

After all, make sure that when you are coding for real hash map, you realized that hash code can be the same for different keys, and never do the look up with for-loop from 0 to len-1.

tia
Isn't that what chaining is for?
Dr.HappyPants
@tia. This is precisely how GetHashCode() is intended to be used. I would suggest you read about the GetHashCode() method a bit more yourself. Hash maps of arbitary objects (rather than those for which a perfect hash can be generated because they are from a known set) always have a risk of hash collision, for which there are various techniques such as reprobing and chaining (the querant is using chaining here). The person implementing GetHashCode() should always avoid collision, but the person using it must deal with the fact that collision can still happen.
Jon Hanna
You might want to look at the code and why this has a -1.
Dr.HappyPants
Yes my bad and I must apologize that I skimmed your code too lightly. My above comment is completely inaccurate.
tia
+3  A: 

Your hash method is of a fixed range. This means that a single item could cause 214748 buckets to be created (if it's hashcode rehashed to 214747). A more commonly used (and almost always better approach) is to start with an initial size that is either known (due to knowledge of the domain) to be big enough for all values or to start small and have hashmap resize itself as appropriate. With re-probing the obvious measure of a need to resize is how much reprobing was needed. With chaining as you are experimenting with here, you'll want to keep both average and maximum chain sizes down. This keeps down your worse-case lookup time, and hence your average lookup time closer to the best-case O(1).

The two most common approaches to such hashing (and hence to initial table size) is to either use prime numbers or powers of two. The former is considered (though there is some contention on the point) to offer better distribution of keys while the latter allows for faster computation (both cases do a modulo on the input-hash, but with a number known to be a power of 2, the modulo can be quickly done as a binary-and operation). Another advantage of using a power of two when you are chaining, is that its possible to test a chain to see if resizing the hash would actually cause that chain to be split or not (if you have an 8-value table and there's a chain whose hashes are all either 17, 1 or 33 then doubling the table size would still leave them in the same chain, but quadrupling it would re-distribute them).

You don't have a method offering replace semantics, which is usual with .NET dictionary types (where adding will error if there's already an item with that key, but assigning to an index won't).

Your error on a retrieval that would try to go beyond the number of buckets will make no sense to the user, who doesn't care whether the bucket existed or not, only the key (they need not know how your implementation works at all). Both cases where a key isn't found should throw the same error (System.Collections.Generic.KeyNotFoundException has precisely the right semantics, so you could reuse that.).

Using a List is rather heavy in this case. Generally I'd frown on anyone saying a BCL collection was too heavy, but when it comes to rolling your own collections, its generally either because (1) you want to learn from the exercise or (2) the BCL collections don't suit your purposes. In case (1) you should learn how to complete the job you started, and in case (2) you need to be sure that List doesn't have whatever failing you found with Dictionary.

Your removal both throws a nonsensical error for someone who doesn't know about the implementation details, and an inconsistent error (whether something else existed in that bucket is not something they should care about). Since removing a non-existent item isn't harmful it is more common to merely return a bool indicating whether the item had been present or not, and let the user decide if that indicates an error or not. It is also wasteful in continuing to search the entire bucket after the item has been removed.

Your implementation does now allow null keys, which is reasonable enough (indeed, the documentation for IDictionary<TKey, TValue> says that implementations may or may not do so). However, the way you reject them is by having the NullReferenceException caused by trying to call GetHashCode() on null be returned, rather than checking and throwing a ArgumentNullException. For the user to receive a NullReferenceException suggests that the collection itself was null. This is hence a clear bug.

Jon Hanna