Performance comparison between AES256 GCM vs AES 256 SHA 256

Question

I understand GCM Crypto uses ESP Encryption only for ESP and Authentication algorithm. whereas AES 256 SHA256 uses AES for ESP Encrypt and SHA256 for Auth algorithm.

Could someone help clarify the reason of getting a better performance with AES256GCM as compared to ASE256SHA256.

While not a complete answer, this article may shed some light on it. — , Jul 04 '20 at 08:05

score 5 · Answer 1 · answered Jul 04 '20 at 08:37

AES256SHA256 uses AES in CBC mode for encryption. CBC mode encrypts a block of data by first xoring the block with the ciphertext of the previous block (or the IV) and then passing this through the block cipher. This means that the encryption of a block is dependent on the ciphertext of the previous block so, encryption of a complete message has to proceed sequentially.

GCM, in contrast, uses counter mode for encryption. It encrypts a counter and xors the encryption of the counter with the plaintext. Since the encryption of each block is independent of other blocks, the encryption of multiple blocks can be parallelized.

The fact that GCM can be parallelized while CBC cannot is probably what is causing the performance difference you are experiencing.

Z.T. · Answer 2 · 2020-07-04T11:58:49.337

A cryptographic mode of operation has two parts: encryption/decryption and message authentication (anti tampering). You must use both, because encryption is malleable. Every message sent and received must be fully processed by both.

The possible options for encryption/decryption:

AES CBC

CBC mode has an inherent limitation that in the encryption direction (decryption doesn't have this problem) the 16 byte blocks need to be processed one after another, limiting parallelism. When you need to process many messages / streams in parallel, you can get enough parallelism to saturate AESNI units, but that depends on traffic pattern.

AES CTR

CTR mode and modes based on CTR (like GCM and CCM), on the other hand, can process all blocks in parallel. This means that with long enough messages, a single stream of data can saturate AESNI units. 768 bytes is long enough.

AES is either slow or insecure without special hardware support. Luckily, since ~all x86 chips since 2010 have AESNI, AES is fast on x86. On the order of a few gigabytes (not gigabits) per second per core, when AES blocks can be processed in parallel (so CTR and CTR based, not CBC).

ChaCha20

ChaCha20 exists to be fast on chips that don't have hardware AES, like phones and tablets. On x86 (for example, on the other side of a connection from a phone) it is fast enough, maybe 1.6GB/sec. On phones and tablets, something like 100MB/sec, depending on the micro-architecture and the power/cooling budget.

The possible options for MAC:

HMAC SHA

HMAC is a bit more complicated than the raw hash function, but for longer messages it is just a bit slower than the raw hash function. SHA-256 is slow, on the order of 400MB/sec. With AVX when processing parallel streams or with Intel SHA Extensions, it can be ok, up to a few gigabytes per second per core (e.g. see this). The SHA instructions are new, not common. Support is bad. Intel's IPSec library supports them though.

CBCMAC (AES-XCBC-MAC-96 or CCM)

CBCMAC uses the same AES execution units to process the same amount of data, so it at best halves your throughput compared to just encryption/decryption. It uses CBC internally, so it's slower than CTR. The advantage is that a single accelerator circuit can be used for both encryption and MAC, saving silicon area / cost, and also lower complexity (chance of bugs).

GMAC (for GCM)

GCM uses GMAC for MAC (and CTR for encryption). GMAC is slow without hardware support in the form of CLMUL instructions. Luckily ~all x86 chips since 2010 have that.

Poly1305

Poly1305 was invented to be fast on chips that don't have special hardware like SHA extensions or CLMUL, like phones and tablets. It's fast enough there (maybe 100MB/sec) and on x86 (maybe 2GB/sec).

In theory, the encryption and MAC modes can be mixed and matched, but in practice they are used in only a few combinations.

Now, how to combine the performance numbers for the encryption part with the performance numbers for the MAC?

If they cannot be overlapped, you need to use (X*Y)/(X+Y) where X and Y are the performance of the two individual parts in the same units (e.g. MB/sec). Thus, for example, 520MB/sec encryption and 670MB/sec MAC leads to 293MB/sec combined throughout.

But in some cases they can be overlapped and then you might get the performance of the slowest of the two, with the other being "free". This is the case with GCM, which can run at the same speed as CTR, with GMAC being effectively for free. This is why it is popular. TLS and SSH switched to GCM (with ChaPoly for mobile). WireGuard is ChaPoly-only, for simplicity.

The above is generalities. I would suggest you always run your own benchmarks of the exact implementation on the exact hardware you intend to use. Message sizes play a role, processing multiple message streams in parallel plays a role and ability to overlap encryption and MAC plays a role.

Performance comparison between AES256 GCM vs AES 256 SHA 256

2 Answers2