views:

85

answers:

3

Having recently started using cryptography in my application, I find myself puzzled by the relationship between the input text length and the ciphertext it results in. Before applying crypto, it was easy to determine the database column size. Now, however, the column size varies slightly.

Two questions:

  1. Am I correct in assuming this is due to the padding of my input, so that it fits the cipher's requirments?
  2. Is there a way to accurately predict the maximum length of the ciphertext based on the maximum length of the input?

And for bonus points: should I be storing the ciphertext base64-encoded in a varchar, or keep it as raw bytes and storing them in a varbinary? Are there risks involved with storing the bytes in my database (I'm using parameterized queries, so in theory accidental breaking of the escaping should not be an issue) ?

TIA!

Supplemental: The cipher I'm using is AES/Rijndael-256 - does this relation vary between the algorithms available?

+1  A: 

From my understanding, in block modes (cbc, ecb) output length will be rounded to the block size, as returned by mcrypt_enc_get_block_size. Plus, you need to store IV along with the data, so the size will be rounded strlen(data) + mcrypt_enc_get_iv_size().

As for the base64 encoding, I wouldn't bother (but make sure to use hex encoding when dumping your db).

stereofrog
+4  A: 

The relation depends on the padding and the chaining modes you are using, and the algorithm block size (if it is a block cipher).

Some encryption algorithms are stream ciphers which encrypt data "bit by bit" (or "byte by byte"). Most of them produce a key-dependent stream of pseudo-random bytes, and encryption is performed by XORing that stream with the data (decryption is identical). With a stream cipher, the encrypted length is equal to the plain data length.

Other encryption algorithms are block ciphers. A block cipher, nominally, encrypts a single block of data of a fixed length. AES is a block cipher with 128-bit blocks (16 bytes). Note that AES-256 also uses 128-bit blocks; the "256" is about the key length, not the block length. The chaining mode is about how the data is to be split into several such blocks (this is not easy to do it securely, but CBC mode is fine). Depending on the chaining mode, the data may require some padding, i.e. a few extra bytes added at the end so that the length is appropriate for the chaining mode. The padding must be such that it can be unambiguously removed when decrypting.

With CBC mode, the input data must have a length multiple of the block length, so it is customary to add PKCS#5 padding: if the block length is n, then at least 1 byte is added, at most n, such that the total size is a multiple of n, and the last added bytes (possibly all of them) have numerical value k where k is the number of added bytes. Upon decryption, it suffices to look at the last decrypted byte to recover k and thus know how many padding bytes must be ultimately removed.

Hence, with CBC mode and AES, assuming PKCS#5 padding, if the input data has length d then the encrypted length is (d + 16) & ~15. I am using C-like notation here; in plain words, the length is between d+1 and d+16, and multiple of 16.

There is a mode called CTR (as "counter") in which the block cipher encrypts successive values of a counter, yielding a stream of pseudo-random bytes. This effectively turns the block cipher into a stream cipher, and thus a message of length d is encrypted into d bytes.

Warning: about all encryption systems (including stream ciphers) and modes require an extra value called the IV (Initial Value). Each message shall have its IV, and no two messages encrypted with the same key shall use the same IV. Some modes have extra requirements; in particular, for both CBC and CTR, the IV shall be selected randomly and uniformly with a cryptographically strong pseudo-random number generator. The IV is not secret, but must be known by the decrypter. Since each message gets its own IV, it is often needed to encode the IV along with the encrypted message. With CBC or CTR, the IV has length n, so, for AES, that's an extra 16 bytes. I do not know what mcrypt does with the IV, but, cryptographically speaking, the IV must be managed at some point.

As for Base64, it is good for transferring binary data over text-only media, but this should not be necessary for a proper database. Also, Base64 enlarges data by about 33%, so it should not be applied blindly. I think you are best avoiding Base64 here.

Thomas Pornin
+1 @Thomas, Good explanation. Is not possible that you could use/not use a IV for the AES CBC mode, if you generate a new session key for every file you wish to encrypt? Thanks
Raj
If you generate a new key for each file, then you can use a conventional IV (e.g. an all-zero IV) which does not need to be encoded. But generating a new random secret key for each file is at least as difficult as generating a new random IV for each file. Whichever is best depends on the situation.
Thomas Pornin
Thank you for your very detailed answer - I'm storing the IV in separate column from the encrypted data, in the same record, because the IV is a fixed length. I'll stay away from Bas64'ing the ciphertext.
kander
A: 

For AES CBC block cipher with PKCS#5 padding,

#define BLOCKSIZE 16

size_t CipherTextLen = (PlainTxtLen / BLOCKSIZE + 1) * BLOCKSIZE;

This doesn't take into account the initialisation vector

Raj