views:

77

answers:

3

Let's say I'm trying to generate a monster for use in a roleplaying game from an arbitrary piece of input data. Think Barcode Battler or a more-recent iPod game whose name escapes me.

It seems to me like the most straightforward way to generate a monster would be to use a hash function on the input data (say, an MP3 file) and use that hash value to pick from some predetermined set of monsters, or use pieces of the hash value to generate statistics for a custom monster.

The question is, are there obvious methods for taking an arbitrary piece of input data and hashing it to one of a fixed set of values? The primary goal of hashing algorithms is, after all, to avoid collisions. Instead, I'm suggesting that we want to guarantee them - that, given a predetermined set of 100 monsters, we want any given MP3 file to map to one of them.

This question isn't bound to a particular language, but I'm working in C#, so that would be my preference for discussion. Thanks!

+4  A: 

Hash the file using any hash function of your choice, convert the result into an integer, and take the result modulo 100.

monsterId = hashResult % 100;

Note that if you later decide to add a new monster and change the code to % 101, nearly all hashes will suddenly map to different monsters.

Mark Byers
Ah, that's the obvious solution I'm missing, thanks. Unfortunately, the issue you point out could be a pretty big liability. It depends on how important it is for the same input data to map to the same monster, as opposed to a purely random one....of course, were that the goal, a RNG would suffice...I'll share any workarounds that leap to mind. :)
djacobson
I think it would be difficult to fix this problem in a clean way. You can certainly hack it without any difficulty (split the input range and make f_new = (hash < x) ? 100 : f_old(x), which would change the output of only x inputs, but the function would become more and more complicated each time you release a new versions. Maybe there is a standard solution to this problem, but I've never seen it before.
Mark Byers
A: 

You can use the MD5, SHA1, or SHA2 of a file as a unique finger print for the file. Each hash function will give you a larger, less overlapping fingerprint and each can be obtained by library functions already in the base libraries.

In truth you could probably hash a much smaller portion of the file, for instance the first 1-3MB of the file and still get a fairly unique fingerprint, without the expense of processing a larger file (like an AVI).

Look in the System.Security namespace for the MD5Crypto provider for an example of how to generate a MD5 from a byte sequence.

Edit: If you want to ensure that the hash collides in a relatively short order you can use CRC2, 4, 6, 8, 16, 32 which will collide fairly frequently (especially CRC2 :)) but be the same for the same file. It is easy to generate.

GrayWizardx
Edited to add collision information
GrayWizardx
CRC is a good suggestion - alas that the .NET API doesn't seem to include a library function for it, as it does for MD5. Time to put my algorithm-implementing pants on...
djacobson
+1  A: 

Okay, that's a very nice question. I would say: don't use hash, because this won't be a nice way for the player to predict patterns. From cognitive theory we know that one thing that is interesting in games is that player can learn by trial and error. So if player gives the input of an image of a red dragon and another image of a red dragon with slightly different pixels, he would like to have the same monster appearing, right? If you use hashes that would not be the case.

Instead, I would recommend doing much simpler things. Imagine that your raw piece of input is just a byte[] , it is itself already a list of numbers. Unfortunately it's only a list of numbers from 0 to 255, so if you for example do an average, you can get 1 number from 0 to 255 . That you could map to a number of monsters already, if you need more, you can read pairs of bytes and just compose Int16, that way you will be able to go up to 65536 possible monsters :)

Kel
Thanks, Kel. That solution might be a better angle of attack than trying to mold a hash function (whose purpose is ultimately to provide a unique output for every input) into producing a fixed set of outputs for all inputs. And if we use large input files (such as MP3s), we've got a lot of flexibility as to what combos of bytes we use to generate the average.
djacobson