Module crypto::aessafe [] [src]

The aessafe module implements the AES algorithm completely in software without using any table lookups or other timing dependant mechanisms. This module actually contains two seperate implementations - an implementation that works on a single block at a time and a second implementation that processes 8 blocks in parallel. Some block encryption modes really only work if you are processing a single blocks (CFB, OFB, and CBC encryption for example) while other modes are trivially parallelizable (CTR and CBC decryption). Processing more blocks at once allows for greater efficiency, especially when using wide registers, such as the XMM registers available in x86 processors.

AES Algorithm

There are lots of places to go to on the internet for an involved description of how AES works. For the purposes of this description, it sufficies to say that AES is just a block cipher that takes a key of 16, 24, or 32 bytes and uses that to either encrypt or decrypt a block of 16 bytes. An encryption or decryption operation consists of a number of rounds which involve some combination of the following 4 basic operations:

  • ShiftRows
  • MixColumns
  • SubBytes
  • AddRoundKey

Timing problems

Most software implementations of AES use a large set of lookup tables - generally at least the SubBytes step is implemented via lookup tables; faster implementations generally implement the MixColumns step this way as well. This is largely a design flaw in the AES implementation as it was not realized during the NIST standardization process that table lookups can lead to security problems [1]. The issue is that not all table lookups occur in constant time - an address that was recently used is looked up much faster than one that hasn't been used in a while. A careful adversary can measure the amount of time that each AES operation takes and use that information to help determine the secret key or plain text information. More specifically, its not table lookups that lead to these types of timing attacks - the issue is table lookups that use secret information as part of the address to lookup. A table lookup that is performed the exact same way every time regardless of the key or plaintext doesn't leak any information. This implementation uses no data dependant table lookups.

Bit Slicing

Bit Slicing is a technique that is basically a software emulation of hardware implementation techniques. One of the earliest implementations of this technique was for a DES implementation [4]. In hardware, table lookups do not present the same timing problems as they do in software, however they present other problems - namely that a 256 byte S-box table takes up a huge amount of space on a chip. Hardware implementations, thus, tend to avoid table lookups and instead calculate the contents of the S-Boxes as part of every operation. So, the key to an efficient Bit Sliced software implementation is to re-arrange all of the bits of data to process into a form that can easily be applied in much the same way that it would be in hardeware. It is fortunate, that AES was designed such that these types of hardware implementations could be very efficient - the contents of the S-boxes are defined by a mathematical formula.

A hardware implementation works on single bits at a time. Unlike adding variables in software, however, that occur generally one at a time, hardware implementations are extremely parallel and operate on many, many bits at once. Bit Slicing emulates that by moving all "equivalent" bits into common registers and then operating on large groups of bits all at once. Calculating the S-box value for a single bit is extremely expensive, but its much cheaper when you can amortize that cost over 128 bits (as in an XMM register). This implementation follows the same strategy as in [5] and that is an excellent source for more specific details. However, a short description follows.

The input data is simply a collection of bytes. Each byte is comprised of 8 bits, a low order bit (bit 0) through a high order bit (bit 7). Bit slicing the input data simply takes all of the low order bits (bit 0) from the input data, and moves them into a single register (eg: XMM0). Next, all of them 2nd lowest bits are moved into their own register (eg: XMM1), and so on. After completion, we're left with 8 variables, each of which contains an equivalent set of bits. The exact order of those bits is irrevent for the implementation of the SubBytes step, however, it is very important for the MixColumns step. Again, see [5] for details. Due to the design of AES, its them possible to execute the entire AES operation using just bitwise exclusive ors and rotates once we have Bit Sliced the input data. After the completion of the AES operation, we then un-Bit Slice the data to give us our output. Clearly, the more bits that we can process at once, the faster this will go - thus, the version that processes 8 blocks at once is roughly 8 times faster than processing just a single block at a time.

The ShiftRows step is fairly straight-forward to implement on the Bit Sliced state. The MixColumns and especially the SubBytes steps are more complicated. This implementation draws heavily on the formulas from [5], [6], and [7] to implement these steps.


Both implementations work basically the same way and share pretty much all of their code. The key is first processed to create all of the round keys where each round key is just a 16 byte chunk of data that is combined into the AES state by the AddRoundKey step as part of each encryption or decryption round. Processing the round key can be expensive, so this is done before encryption or decryption. Before encrypting or decrypting data, the data to be processed by be Bit Sliced into 8 seperate variables where each variable holds equivalent bytes from the state. This Bit Sliced state is stored as a Bs8State, where T is the type that stores each set of bits. The first implementation stores these bits in a u32 which permits up to 8 * 32 = 1024 bits of data to be processed at once. This implementation only processes a single block at a time, so, in reality, only 512 bits are processed at once and the remaining 512 bits of the variables are unused. The 2nd implementation uses u32x4s - vectors of 4 u32s. Thus, we can process 8 * 128 = 4096 bits at once, which corresponds exactly to 8 blocks.

The Bs8State struct implements the AesOps trait, which contains methods for each of the 4 main steps of the AES algorithm. The types, T, each implement the AesBitValueOps trait, which containts methods necessary for processing a collection or bit values and the AesOps trait relies heavily on this trait to perform its operations.

The Bs4State and Bs2State struct implement operations of various subfields of the full GF(28) finite field which allows for efficient computation of the AES S-Boxes. See [7] for details.


[1] - "Cache-Collision Timing Attacks Against AES". Joseph Bonneau and Ilya Mironov. [2] - "Software mitigations to hedge AES against cache-based software side channel vulnerabilities". Ernie Brickell, et al. [3] - "Cache Attacks and Countermeasures: the Case of AES (Extended Version)". Dag Arne Osvik, et al.‎. [4] - "A Fast New DES Implementation in Software". Eli Biham. [5] - "Faster and Timing-Attack Resistant AES-GCM". Emilia K ̈asper and Peter Schwabe. [6] - "FAST AES DECRYPTION". Vinit Azad. search? Vinit_Azad_MS_Report.doc%3Fsequence%3D2+&cd=4&hl=en&ct=clnk&gl=us&client=ubuntu. [7] - "A Very Compact Rijndael S-box". D. Canright.