tags:

views:

136

answers:

4

What is the simplest way I can hide a sensitive identifier, while providing some equivalent means of identifying the data from outside?

For example, lets say I have a database table with records and one of them is an sensitive ID field.

ID
2A
1S
etc...

then I want to have a second record:

ID    PublicID
2A    AXXX44328
1S    KKKZJSAAS

such that when I am given a PublicID I can always determine what ID it refers to:

H(PublicID) = ID

but nobody else is able to do so.

Also note, that I want to be able to reproduce the string in at least two different locations. So if I have two servers/database, the ID 2A has to map to string AXX44328 on each one of them independently.

I suspect this is like, encryption - with throwing away a public key?

+2  A: 

It's sufficient to generate a random, unique string of some kind and store it in the database as your public ID. Index the table on the public ID and you can easily retrieve the real ID (and other row values) given the public ID. As the database is private, nobody can work out the ID given the public ID.

A simple way to generate the random, unique string is to take a hash (SHA-1 for example) of the real ID + some salt value, e.g.

my $public_id = sha1( $salt . $id );

The $salt value should be a long, random string that is generated once, kept on the server and never revealed publicly. It makes it very difficult (nearly impossible) for an attacker to reverse engineer the real ID from the public ID by brute-forcing the hash (which can be quite easy without a salt, if the ID is short and numeric)

The advantage of this approach is that the same $id will always map to the same $public_id, as long as the $salt value stays constant.


If that is not an option, generate a random key and encrypt the real ID with it, and use the encrypted version as a public ID. You can then decrypt this ID later to get the real ID back.

rjh
Yes, I know it's an option, but I want to be able to reproduce the random string a two different locations. So if I have two servers/database, the ID 2A has to map to string AXX44328 on each one of them independantly.
drozzy
I've re-added the section about taking a sha1 hash. This solves your problem, provided both servers can access the $salt value.
rjh
Sorry - but SHA1 has a possibility of collision, whereas I want that to be 0.
drozzy
The possibility is collision is 2^-160, which is practically zero. To clarify, you will need to generate 10^17 hashes before you even have a 0.0001% chance of collision, and the person who encounters the first SHA-1 collision will be famous.
rjh
Sorry - I don't want 0.00001%, I want 0. Call me picky :-)
drozzy
Then stick a UNIQUE column on the public ID, and if you generate a duplicate, call up Bruce Schneier and he'll give you a million dollars.
rjh
+1  A: 

You didn't specify a programming language. Here's an example in PHP, similar to what RJH suggested with SHA1, but uses a proper symmetric encryption algorithm rather than SHA1, eliminating the (even remote) possibility of collisions:


define('KEY', 'S4mPhZg3rQga');

function encrypt($text)
{
    return base64_encode(mcrypt_encrypt(MCRYPT_RIJNDAEL_256, KEY, $text, MCRYPT_MODE_ECB, mcrypt_create_iv(mcrypt_get_iv_size(MCRYPT_RIJNDAEL_256, MCRYPT_MODE_ECB), MCRYPT_RAND)));
}

function decrypt($text)
{
    return mcrypt_decrypt(MCRYPT_RIJNDAEL_256, KEY, base64_decode($text), MCRYPT_MODE_ECB, mcrypt_create_iv(mcrypt_get_iv_size(MCRYPT_RIJNDAEL_256, MCRYPT_MODE_ECB), MCRYPT_RAND));
}

// example usage:
$C = encrypt('1234');
echo("Public ID: $C\n");

$P = decrypt($C);
echo("Private ID: $P\n");

The value of KEY should be set once, with the same value in both servers, and should never be revealed. You would use encrypt() when displaying data and decrypt() when accepting data from outside. There is no need to actually store the PublicID, you just compute it on the fly.

Alex R
Hmm. I was going to suggest this but had doubts about the security of block ciphers which such small plaintexts. Am I wrong?
rjh
So this key should be part of the source code, for example, on each server?
drozzy
With AES you still have possibility of collisions, no?
drozzy
RJH, I'm pretty sure the large IV mitigates size of plaintext in this case. Keep in mind if the original IDs are very small numbers, they can be probably guessed by exhaustive search anyway.
Alex R
Symmetric encryption is guaranteed round-trip, no collisions.
Alex R
@Alex R, thanks, that makes sense.
rjh
+1  A: 

If your IDs are relatively short (15 bytes or less) then I suggest encrypting them with a block cipher, namely the AES. The AES uses a secret key K, which has length 128, 192 or 256 bits (128 bits are enough). Since AES processes a block of exactly 16 bytes, you have to pad your ID a bit. The "usual" padding (known as "PKCS#5") consists in adding n bytes (n >= 1), all of them having value n, such that the resulting length is appropriate (here, you want a length of 16).

So the transformation of ID (the sensitive data) into S (the scrambled string which can be shown to the public at large) is: S = AESencrypt_K(pad(ID)). The reverse operation is: ID = unpad(AESdecrypt_K(S)). If ID is 16 bytes or more, then encryption will use several invocations of AES, and there are subtleties with regards to how those invocations are linked together. The keyword is chaining mode and the usual answer is "CBC".

Knowledge of the secret key K (the same K) is needed for both operations. This means that whoever can compute S from ID can also compute ID from S, and vice versa.

Now if you need some entities to be able to compute S from ID without giving them the power to do the reverse operation, then things are more complex. In particular, you must not have a deterministic process: if there is a single S which can be computed from ID then anybody can try an exhaustive search on the possible values of ID until a match with a given S is found. So you have to relax the model, in that a given ID may yield a great number of possible scrambled strings S', such that all those S' may be converted back into ID by someone who has the "right" secret value. This is what you would get from asymmetric encryption. The usual asymmetric encryption algorithm is RSA. With a 1024-bit RSA key (a typical size for proper security), ID could have a size up to 117 bytes, and S' will be 128-byte long (the size increase corresponds to the injected random data which makes the process non-deterministic). If 128 bytes are too much, you can get shorter encrypted messages with El-Gamal encryption over elliptic curves (down to about 40 bytes or so, for an up-to-20-byte ID), but you may have a hard time finding an existing implementation.

Thomas Pornin
Don't you mean S=pad(AESencrypt_K(ID))? Because you say "you have to pad your ID"
drozzy
No, you pad the ID _then_ you encrypt. What you suggest is to pad the encryption result, which is meaningless since the point of padding is to get something which can be encrypted, hence you cannot have an encryption result before having padded.
Thomas Pornin
No, what I mean is "S=pad(AESencrypt_K(ID))" is what YOU wrote. In your answer.
drozzy
Woops. Sorry. Edited. Need more coffee.
Thomas Pornin
+1  A: 

Since you want to be able to recreate the identifier on two, disconnected, databases then you'll need to have some kind of shared key.

This is a perfect place for a HMAC. To steal from RFC-2104 by way of Wikipedia:

Let:
H(·) be a cryptographic hash function
K be a secret key padded to the right with extra zeros to the block size of the hash function
m be the message to be authenticated
∥ denote concatenation
⊕ denote exclusive or (XOR)
opad be the outer padding (0x5c5c5c…5c5c, one-block-long hexadecimal constant)
ipad be the inner padding (0x363636…3636, one-block-long hexadecimal constant)

Then HMAC(K,m) is mathematically defined by
HMAC(K,m) = H((K ⊕ opad) ∥ H((K ⊕ ipad) ∥ m)).

But, you don't have to implement it yourself! Use your language of choice's standard library. For example, in Python:

>>> import hmac
>>> hmac.new(key='abc123secret make me long', msg='This is my unique key #1')
<hmac.HMAC instance at 0xb77bdbac>
>>> _.hexdigest()
'c23a224afa917d13fbef58ee14884269'

Now you have a calculable unique ID. Pre-compute as the primary-keys in your database. Lookup as necessary!


As a sidenote, do NOT use salted hash (Google: "don't hash secrets") and do NOT use an encrypted version of your data. The former because of message-extension attacks. The latter because you're unnecessarily exposing the data in a manner that replies solely on your key security.

I'd link with more references, but I'm a new user. :-\

Scott Robinson
The article you reference actually states "In the password example, you can hash a password as long as you salt it correctly." So if you're referring to my answer, I think my usage of a salted hash is correct (it's identical to the example). OTOH, I've also seen PunBB and other applications do `sha1($salt . sha1($salt . $password))` which seems to be an approximation of HMAC.
rjh
A salt *does not* provide authenticity. It only increases the computational complexity necessary to generate a matching message to the hash. And, re-hashing a salted hash provides negligible, if not no, additional benefit (see: http://stackoverflow.com/questions/348109/is-double-hashing-a-password-less-secure-than-just-hashing-it-once).If the ID field is sensitive and-- given the recent comments-- presumably short. Therefore, brute-forcing the hash is a legitimate issue. In which case he should be looking at bcrypt or PBKDF-- and not a salt.
Scott Robinson
If the "salt" is being used to ensure the identifier came from one of the other machines, then it's a matter of authenticity and it calls for an HMAC. A salt, and even double-hashing a salt, is never an approximation of an HMAC. It's a different thing entirely.
Scott Robinson
Will hmac produce collisions? Since it still uses hash functions?
drozzy
Yes, an HMAC "can" collide. But, as I feel wasn't explained very well, "can" is theoretical. If you get collision, a lot more than just your app would break. Hashes are fundamental to almost all unique identifiers, including UUIDs and GUIDs (http://en.wikipedia.org/wiki/Universally_Unique_Identifier). By definition, if you want a unique ID for something from that something that *isn't* that something, then you have two choices: encryption or hashing. Which you use and in what way depends on your security requirements. Can you elaborate on those, as everyone is guessing right now?
Scott Robinson
If you encrypt, then you expose some components of your data. (size, for example) Additionally, if someone gets the key, they'll have your sensitive information.If you HMAC, then you expose significantly less about your data. Additionally, if someone gets the key, they will only be able to make new "valid" encrypted IDs. But, as you're performing a lookup into a database, does this matter?If you only hash, then you can't confirm an ID legitimately came from one of your other systems. But, this might also not be a concern for you?
Scott Robinson
I feel as if this is getting to complicated for my needs. All I need is to have reproducible way of transforming one id string into another. I am not sure if I even NEED it to "non-reversible". I just want something that is easy to use in URLs for example (letters+numbers and no funny chars as in some of my internal ID's).
drozzy
Hash the ID. SHA1 is fine.
Scott Robinson