views:

169

answers:

6

I have a 2D MxN grid (or matrix). The cells in the matrix may hold an integer. A cell with a non-zero integer is said to be populated. The set of populated cells in the matrix is known as a "configuration".

I want to come up with an encoding or hashing algorithm that wil allow me to uniquely identify a configuration in the matrix, by computing its encoded value (which should generate a unique number).

I prefer encoding to hashing, since collisions will be totally undesirable.

Can anyone suggest an encoding algorithm I can use to compute a unique "id" for a given configuration?

A: 

So it's an array of 1s and 0s? How about an RLE or LZW compression of that array?

Paul Mitchell
It should be noted that these algorithms can produce a set of data larger than the matrix itself - hardly an "ID".
James Jones
A: 

For a representation which affords exact comparison is not possible to do better than an optimal compression of a sequence of bits which represent the configuration.

If you require MxN boolean values to be uniquely encoded in an integer, you require 2M*N values. Whether that's feasible using your platform's fixed precision integers depends on the size of M and N; if not you'll have to use a bit string or a big integer.

Because the original data is any integer value rather than just 1 or 0, a naive bitstring ID of a naive matrix will give a compression of 8 * sizeof ( matrix::cell_type ). A bitstring optimised for sparse values could be better. Good implementations of sparse bitstrings compress the data, so this will reduce the storage space of the representation, and allow fast exact comparison, which are the requirements.

If the patterns are guaranteed sparse to a certain level, there are optimisations you can do compressing the information, but you need to give more information.

For example, are you using a sparse matrix representation ( banded, diagonal, row compressed etc ), and have access to the internals of the sparse matrix storage then that will naturally and efficiently map to a compressed bitstring.

Looking at your other post, it seems that the matrix is being used as the grid for a game rather than as a matrix. In that case it's probably best to use a run length compression on the bitstring, as this yields another useful property - the encoded bitstring representation of the matrices whose configurations are translations of each other will only differ in the first value of the encoding.

Pete Kirkham
This is not an algorithm - this is merely transferring the entire matrix into another data type.
James Jones
I didn't think it worth writing out the algorithm to test each entry in the matrix and set the corresponding bit in a bitstring. It's only worth using compression on that bitstring if you know that more about the data than what is given.
Pete Kirkham
It isn't worth writing out. I understand what you meant. What I'm saying is, your method is simply transferring the matrix into another form - in this case, a bitstring.
James Jones
The problem with this method is the largest matrix that the question asker could store in a 64-bit integer would be an 8x8 matrix. As I've been saying, you're simply transferring the matrix into a bit string.
James Jones
The OP asked an encoding. An encoding **by definition** is transferring data from one representation to another.
Pete Kirkham
A: 

It's not clear what you want to achieve but maybe a custom bloom filter implementation might be suitable to your problem.

Nick D
A: 

Depending on what problem you're trying to solve, the follwoing comes to mind:

  • use a lookup-table to associate matrices width ids
  • only store the values of the populated fields; their position in the matrix can be encoded either in a seperate per-field value or using a bitmap for the whole matrix
Christoph
+1  A: 

I would suggest using a hashing algorithm that will have a 99.999999999% chance of generating a unique ID. In most scenarios, it is acceptable to have a collision every billionth hash. My suggestion is to use the CRC algorithm, since it generates highly distributed set of hashes and has a relatively low rate of collisions.

James Jones
A: 

It should not matter if there is a collsion or not. Even if there is a collision, you can then continue to check the matrix int by int to see if it in fact is similar.

As long as collisions happen very infrequently the overhead is 0

so a hashing function could be just as simple as to add all int's together. If this is sufficient it depends on the possible values of the ints and how many of them there are (So if in the entire matrix only 1 or 2 cells have a value, this hashing won't work)

Toad