ansaurus

Question

C# Dictionary Memory Management

Answer 1

+3 A:

Why don't you just use GetHashCode() to get a hash of the string?

Andrew Hare 2008-12-18 20:41:36

GetHashCode() is not reliable at all...

Diadistis 2008-12-18 20:43:07

I tried that first, but it caused collisions.

blogsdon 2008-12-18 20:43:11

I was not aware that GetHashCode was not reliable - more info?

Andrew Hare 2008-12-18 20:44:46

What is wrong with collisions? (See my answer)

Brian Genisio 2008-12-18 20:45:05

Diadistis: what makes you say that? This is what the function is *there for*! No other general-purpose function will perform equally well.

Konrad Rudolph 2008-12-18 20:45:41

From my understanding, GetHashCode on String IS reliable (deterministic), but not unique.

Brian Genisio 2008-12-18 20:46:01

thanks Brian, that makes sense

Andrew Hare 2008-12-18 20:47:15

http://blogs.msdn.com/brada/archive/2003/09/30/50396.aspxhttp://vkreynin.wordpress.com/2008/07/05/explaining-gethashcode-method/

Diadistis 2008-12-18 21:00:24

Answer 2

+2 A:

With hashtable implementations I have worked with in the past, the hash brings you to a bucket which is often a link list of other objects that have the same hash. Hashes are not unique, but they are good enough to split your data up into very manageable lists (sometimes only 2 or 3 long) that you can then search though to find your actual item.

The key to a good hash is not its uniqueness, but its speed and distribution capabilities... you want it to distribute as evenly as possible.

Brian Genisio 2008-12-18 20:44:07

Dictionary doesn't work this way. It won't allow key collisions. You'd have to use a different data structure and to handle collisions you'd need to store both the hashed key and the real key -- unless you also know the value you are looking for. This wouldn't save any memory.

tvanfosson 2008-12-18 20:56:27

Hashed keys may be congruent, but not equivalent. He is using a hashed string AS the key. Which is why hey cant use string.GetHashCode() as the key, because of dupes given the sample size.

Nicholas Mancuso 2008-12-18 21:04:07

Answer 3

+5 A:

By the way, cryptographic hashes / hash functions are exceptionally bad for dictionaries. They’re big and slow. By solving the one problem (size) you’ve only introduced another, more severe problem: the function won’t spread the input evenly any longer, thus destroying the single most important property of a good hash for approaching collision-free addressing (as you seem to have noticed yourself).

/EDIT: As Andrew has noted, GetHashCode is the solution for this problem since that’s its intended use. And like in a true dictionary, you will have to work around collisions. One of the best schemes for that is double hashing. Unfortunately, the only 100% reliable way will be to actually store the original values. Else, you’d have created an infinite compression, which we know can’t exist.

Konrad Rudolph 2008-12-18 20:44:18

In effect that _is_ what he is doing. Instead of Dict<string, int> its Dict<long, int> and the key is the cryptohash of the original string, whereas before string.gethashcode was causing duplicate keys across the orignal sample.

Nicholas Mancuso 2008-12-18 21:00:18

Nicholas, you're right – but a (crippled) cryto hash is *still* a bad hash, even if used in double hashing.

Konrad Rudolph 2008-12-18 21:18:14

You can turn that frown upside down by encapsulating the signature in a class and pretending the signature itself is some opaque object. My example below does just that. Keep in mind he should be using a database anyways...

sixlettervariables 2008-12-18 21:39:07

Answer 4

A:

A possible solution, while I'm not a terribly large fan and think you should use a DB, is the following inelegant kludge. It uses the SHA1 value as the dictionary key. The example below assumes you've fixed up the class called CryptoKey below that has the ability to be hashed and can be tested for uniqueness (i.e. a correct IEquatable<> implementation).

Dictionary<CryptoKey, int> bigDictionary = ...;

CryptoKey key = new CryptoKey(sha1Algorithm);
key.FromString(enc, myStr);
if (!bigDictionary.ContainsKey(key))
{
    bigDictionary[key] = someIntValue;
}

if (bigDictionary.ContainsKey(key))
{
    // it exists! yay
}

All in all you should take a look at the space requirements versus your strings. Each SHA1 key is 32 bytes, are all of your strings greater than 32 bytes? I can assure you the dictionary is falling apart at huge sizes due to string interning keeping them around, which begs the question as to why a local file based database isn't appropriate. Sqlite is very fast.

EDIT It occurs to me you could just use the CryptoKey class and a dictionary of ints, making life easy. Class provided for reference (USE A DATABASE WRAAA):

public class CryptoKey : IEquatable<CryptoKey> {
    private HashAlgorithm algorithm;
    private byte[] bytes;

    public CryptoKey(HashAlgorithm algo) {
        this.algorithm = algo;
    }

    public void SetBytes(byte[] bytes) {
        this.bytes = bytes;
    }

    public void FromString(Encoding encoding, string value) {
        this.SetBytes(algorithm.ComputeHash(encoding.GetBytes(value)));
    }

    public bool Equals(CryptoKey other) {
        if (other == null) return false;
        if (this.bytes == null || other.bytes == null) return false;
        if (this.bytes.Length != other.bytes.Length) return false;
        for (int ii = 0; ii < this.bytes.Length; ++ii) {
            if (this.bytes[ii] != other.bytes[ii]) return false;
        }

        return true;
    }

    public override bool Equals(object other) {
        CryptoKey otherKey = other as CryptoKey;
        if (otherKey == null) return false;
        return this.Equals(otherKey);
    }

    public override long GetHashCode() {
        long hashValue = this.bytes[0];
        for (int ii = 1; ii < this.bytes.Length; ++ii) {
            hashValue *= 33;
            hashValue += this.bytes[ii];
        }

        return hashValue;
    }
}

sixlettervariables 2008-12-18 21:15:00

I don't like this solution btw, but it will work (note you could conceivably have both a .Net hash code collision AND a SHA1 collision and the data is actually different...but i doubt that happens in reality).

sixlettervariables 2008-12-18 21:16:01

Answer 5

+7 A:

With 10 million-odd records, have you considered using a database with a non-clustered index? Databases have a lot more tricks up their sleeve for this type of thing.

Hashing, by definition, and under any algorithm, has the potential of collisions - especially with high volumes. Depending on the scenario, I'd be very cautious of this.

Using the strings might take space, but it is reliable... if you are on x64 this needn't be too large (although it definitely counts as "big" ;-p)

Marc Gravell 2008-12-18 21:21:31

Answer 6

+8 A:

So I have done something similar recently and for a certain set of reasons that are fairly unique to my application did not use a database. In fact I was try to stop using a database. I have found that GetHashCode is significantly improved in 3.5. One important note, NEVER STORE PERSISTENTLY THE RESULTS FROM GetHashCode. NEVER EVER. They are not guaranteed to be consistent between versions of the framework.

So you really need to conduct an analysis of your data since different hash functions might work better or worse on your data. You also need to account for speed. As a general rule cryptographic hash functions should not have many collisions even as the number of hashes moves into the billions. For things that I need to be unique I typically use SHA1 Managed. In general the CryptoAPI has terrible performance, even if the underlying hash functions perform well.

For a 64bit hash I currently use Lookup3 and FNV1, which are both 32 bit hashes, together. For a collision to occur both would need to collide which is mathematically improbable and I have not seen happen over about 100 million hashes. You can find the code to both publicly available on the web.

Still conduct your own analysis. What has worked for me may not work for you. Actually inside of my office different applications with different requirements actually use different hash functions or combinations of hash functions.

I would avoid any unproven hash functions. There are as many hash functions as people who think that they should be writing them. Do your research and test test test.

Steve 2008-12-18 22:20:26

I implemented a version of your 64bit hash idea, and preliminary tests went well. I am going to conduct some further tests, but this looks like the solution that is the best trade off between memory size and access time for my purposes.

blogsdon 2008-12-19 20:59:00

Cool. I like the 64bit hash technique. Which hash functions did you use?

Steve 2008-12-19 21:54:54

+1 for actually answering the question and not trying to recommend relational database.

lubos hasko 2008-12-20 02:28:24

Thanks. Relational databases are great but there are times that you don't want to use one. I leave it to peoples judgement when those times are.

Steve 2008-12-20 06:45:06

Answer 7

+2 A:

Just go get SQLite. You're not likely to beat it, and even if you do, it probably won't be worth the time/effort/complexity.

SQLite.

2008-12-20 02:16:12

ansaurus

tags:

views:

answers:

C# Dictionary Memory Management

related questions