views:

2718

answers:

7

I have a Dictionary<string,int> that has the potential to contain upwards of 10+ million unique keys. I am trying to reduce the amount of memory that this takes, while still maintaining the functionality of the dictionary.

I had the idea of storing a hash of the string as a long instead, this decreases the apps memory usage to an acceptable amount (~1.5 gig to ~.5 gig), but I don't feel very good about my method for doing this.

long longKey=
BitConverter.ToInt64(cryptoTransformSHA1.ComputeHash(enc.GetBytes(strKey)), 0);

Basically this chops off the end of a SHA1 hash, and puts the first chunk of it into a long, which I then use as a key. While this works, at least for the data I'm testing with, I don't feel like this is a very reliable solution due to the increased possibility for key collisions.

Are there any other ways of reducing the Dictionary's memory footprint, or is the method I have above not as horrible as I think it is?

[edit] To clarify, I need to maintain the ability to lookup a value contained in the Dictionary using a string. Storing the actual string in the dictionary takes way to much memory. What I would like to do instead is to use a Dictionary<long,int> where the long is the result of a hashing function on the string.

+3  A: 

Why don't you just use GetHashCode() to get a hash of the string?

Andrew Hare
GetHashCode() is not reliable at all...
Diadistis
I tried that first, but it caused collisions.
blogsdon
I was not aware that GetHashCode was not reliable - more info?
Andrew Hare
What is wrong with collisions? (See my answer)
Brian Genisio
Diadistis: what makes you say that? This is what the function is *there for*! No other general-purpose function will perform equally well.
Konrad Rudolph
From my understanding, GetHashCode on String IS reliable (deterministic), but not unique.
Brian Genisio
thanks Brian, that makes sense
Andrew Hare
http://blogs.msdn.com/brada/archive/2003/09/30/50396.aspxhttp://vkreynin.wordpress.com/2008/07/05/explaining-gethashcode-method/
Diadistis
+2  A: 

With hashtable implementations I have worked with in the past, the hash brings you to a bucket which is often a link list of other objects that have the same hash. Hashes are not unique, but they are good enough to split your data up into very manageable lists (sometimes only 2 or 3 long) that you can then search though to find your actual item.

The key to a good hash is not its uniqueness, but its speed and distribution capabilities... you want it to distribute as evenly as possible.

Brian Genisio
Dictionary doesn't work this way. It won't allow key collisions. You'd have to use a different data structure and to handle collisions you'd need to store both the hashed key and the real key -- unless you also know the value you are looking for. This wouldn't save any memory.
tvanfosson
Hashed keys may be congruent, but not equivalent. He is using a hashed string AS the key. Which is why hey cant use string.GetHashCode() as the key, because of dupes given the sample size.
Nicholas Mancuso
+5  A: 

By the way, cryptographic hashes / hash functions are exceptionally bad for dictionaries. They’re big and slow. By solving the one problem (size) you’ve only introduced another, more severe problem: the function won’t spread the input evenly any longer, thus destroying the single most important property of a good hash for approaching collision-free addressing (as you seem to have noticed yourself).

/EDIT: As Andrew has noted, GetHashCode is the solution for this problem since that’s its intended use. And like in a true dictionary, you will have to work around collisions. One of the best schemes for that is double hashing. Unfortunately, the only 100% reliable way will be to actually store the original values. Else, you’d have created an infinite compression, which we know can’t exist.

Konrad Rudolph
In effect that _is_ what he is doing. Instead of Dict<string, int> its Dict<long, int> and the key is the cryptohash of the original string, whereas before string.gethashcode was causing duplicate keys across the orignal sample.
Nicholas Mancuso
Nicholas, you're right – but a (crippled) cryto hash is *still* a bad hash, even if used in double hashing.
Konrad Rudolph
You can turn that frown upside down by encapsulating the signature in a class and pretending the signature itself is some opaque object. My example below does just that. Keep in mind he should be using a database anyways...
sixlettervariables
A: 

A possible solution, while I'm not a terribly large fan and think you should use a DB, is the following inelegant kludge. It uses the SHA1 value as the dictionary key. The example below assumes you've fixed up the class called CryptoKey below that has the ability to be hashed and can be tested for uniqueness (i.e. a correct IEquatable<> implementation).

Dictionary<CryptoKey, int> bigDictionary = ...;

CryptoKey key = new CryptoKey(sha1Algorithm);
key.FromString(enc, myStr);
if (!bigDictionary.ContainsKey(key))
{
    bigDictionary[key] = someIntValue;
}

if (bigDictionary.ContainsKey(key))
{
    // it exists! yay
}

All in all you should take a look at the space requirements versus your strings. Each SHA1 key is 32 bytes, are all of your strings greater than 32 bytes? I can assure you the dictionary is falling apart at huge sizes due to string interning keeping them around, which begs the question as to why a local file based database isn't appropriate. Sqlite is very fast.

EDIT It occurs to me you could just use the CryptoKey class and a dictionary of ints, making life easy. Class provided for reference (USE A DATABASE WRAAA):

public class CryptoKey : IEquatable<CryptoKey> {
    private HashAlgorithm algorithm;
    private byte[] bytes;

    public CryptoKey(HashAlgorithm algo) {
        this.algorithm = algo;
    }

    public void SetBytes(byte[] bytes) {
        this.bytes = bytes;
    }

    public void FromString(Encoding encoding, string value) {
        this.SetBytes(algorithm.ComputeHash(encoding.GetBytes(value)));
    }

    public bool Equals(CryptoKey other) {
        if (other == null) return false;
        if (this.bytes == null || other.bytes == null) return false;
        if (this.bytes.Length != other.bytes.Length) return false;
        for (int ii = 0; ii < this.bytes.Length; ++ii) {
            if (this.bytes[ii] != other.bytes[ii]) return false;
        }

        return true;
    }

    public override bool Equals(object other) {
        CryptoKey otherKey = other as CryptoKey;
        if (otherKey == null) return false;
        return this.Equals(otherKey);
    }

    public override long GetHashCode() {
        long hashValue = this.bytes[0];
        for (int ii = 1; ii < this.bytes.Length; ++ii) {
            hashValue *= 33;
            hashValue += this.bytes[ii];
        }

        return hashValue;
    }
}
sixlettervariables
I don't like this solution btw, but it will work (note you could conceivably have both a .Net hash code collision AND a SHA1 collision and the data is actually different...but i doubt that happens in reality).
sixlettervariables
+7  A: 

With 10 million-odd records, have you considered using a database with a non-clustered index? Databases have a lot more tricks up their sleeve for this type of thing.

Hashing, by definition, and under any algorithm, has the potential of collisions - especially with high volumes. Depending on the scenario, I'd be very cautious of this.

Using the strings might take space, but it is reliable... if you are on x64 this needn't be too large (although it definitely counts as "big" ;-p)

Marc Gravell
+8  A: 

So I have done something similar recently and for a certain set of reasons that are fairly unique to my application did not use a database. In fact I was try to stop using a database. I have found that GetHashCode is significantly improved in 3.5. One important note, NEVER STORE PERSISTENTLY THE RESULTS FROM GetHashCode. NEVER EVER. They are not guaranteed to be consistent between versions of the framework.

So you really need to conduct an analysis of your data since different hash functions might work better or worse on your data. You also need to account for speed. As a general rule cryptographic hash functions should not have many collisions even as the number of hashes moves into the billions. For things that I need to be unique I typically use SHA1 Managed. In general the CryptoAPI has terrible performance, even if the underlying hash functions perform well.

For a 64bit hash I currently use Lookup3 and FNV1, which are both 32 bit hashes, together. For a collision to occur both would need to collide which is mathematically improbable and I have not seen happen over about 100 million hashes. You can find the code to both publicly available on the web.

Still conduct your own analysis. What has worked for me may not work for you. Actually inside of my office different applications with different requirements actually use different hash functions or combinations of hash functions.

I would avoid any unproven hash functions. There are as many hash functions as people who think that they should be writing them. Do your research and test test test.

Steve
I implemented a version of your 64bit hash idea, and preliminary tests went well. I am going to conduct some further tests, but this looks like the solution that is the best trade off between memory size and access time for my purposes.
blogsdon
Cool. I like the 64bit hash technique. Which hash functions did you use?
Steve
+1 for actually answering the question and not trying to recommend relational database.
lubos hasko
Thanks. Relational databases are great but there are times that you don't want to use one. I leave it to peoples judgement when those times are.
Steve
+2  A: 

Just go get SQLite. You're not likely to beat it, and even if you do, it probably won't be worth the time/effort/complexity.

SQLite.