Having recently started using cryptography in my application, I find myself puzzled by the relationship between the input text length and the ciphertext it results in. Before applying crypto, it was easy to determine the database column size. Now, however, the column size varies slightly.
Two questions:
And for bonus points: should I be storing the ciphertext base64-encoded in a varchar, or keep it as raw bytes and storing them in a varbinary? Are there risks involved with storing the bytes in my database (I'm using parameterized queries, so in theory accidental breaking of the escaping should not be an issue) ?
TIA!
Supplemental: The cipher I'm using is AES/Rijndael-256 - does this relation vary between the algorithms available?
The relation depends on the padding and the chaining modes you are using, and the algorithm block size (if it is a block cipher).
Some encryption algorithms are stream ciphers which encrypt data "bit by bit" (or "byte by byte"). Most of them produce a key-dependent stream of pseudo-random bytes, and encryption is performed by XORing that stream with the data (decryption is identical). With a stream cipher, the encrypted length is equal to the plain data length.
Other encryption algorithms are block ciphers. A block cipher, nominally, encrypts a single block of data of a fixed length. AES is a block cipher with 128-bit blocks (16 bytes). Note that AES-256 also uses 128-bit blocks; the "256" is about the key length, not the block length. The chaining mode is about how the data is to be split into several such blocks (this is not easy to do it securely, but CBC mode is fine). Depending on the chaining mode, the data may require some padding, i.e. a few extra bytes added at the end so that the length is appropriate for the chaining mode. The padding must be such that it can be unambiguously removed when decrypting.
With CBC mode, the input data must have a length multiple of the block length, so it is customary to add PKCS#5 padding: if the block length is n, then at least 1 byte is added, at most n, such that the total size is a multiple of n, and the last added bytes (possibly all of them) have numerical value k where k is the number of added bytes. Upon decryption, it suffices to look at the last decrypted byte to recover k and thus know how many padding bytes must be ultimately removed.
Hence, with CBC mode and AES, assuming PKCS#5 padding, if the input data has length d then the encrypted length is (d + 16) & ~15
. I am using C-like notation here; in plain words, the length is between d+1 and d+16, and multiple of 16.
There is a mode called CTR (as "counter") in which the block cipher encrypts successive values of a counter, yielding a stream of pseudo-random bytes. This effectively turns the block cipher into a stream cipher, and thus a message of length d is encrypted into d bytes.
Warning: about all encryption systems (including stream ciphers) and modes require an extra value called the IV (Initial Value). Each message shall have its IV, and no two messages encrypted with the same key shall use the same IV. Some modes have extra requirements; in particular, for both CBC and CTR, the IV shall be selected randomly and uniformly with a cryptographically strong pseudo-random number generator. The IV is not secret, but must be known by the decrypter. Since each message gets its own IV, it is often needed to encode the IV along with the encrypted message. With CBC or CTR, the IV has length n, so, for AES, that's an extra 16 bytes. I do not know what mcrypt does with the IV, but, cryptographically speaking, the IV must be managed at some point.
As for Base64, it is good for transferring binary data over text-only media, but this should not be necessary for a proper database. Also, Base64 enlarges data by about 33%, so it should not be applied blindly. I think you are best avoiding Base64 here.