

# Practical Use of Finite Fields and their Performance on Modern CPUs GHASH in AES-GCM

**Daniel Kales** 

October 31st , 2019

> www.laik.tugraz.at

# Outline

1 AES-GCM

2 GHASH

3 Finite Field Multiplication on x86 CPUs

4 Performance

# **AES-GCM**

### Introduction

- Authenticated Encryption (AE)
  - Very important building block in TLS
  - TLS 1.3: Only AE modes allowed
- AE modes in TLS 1.3
  - AES-{128,256}-GCM
  - ChaCha20-Poly1305
  - AES-128-CCM

#### **AES-GCM**

- AES in Galois Counter Mode
- Combination of:
  - AES in Counter Mode
  - GHASH authenticator



www.iaik.tugraz.at ■
GHASH

# **GHASH**

## Universal Hash Functions

- Concept by Carter and Wegman in 1977<sup>1</sup>
  - Family of hash funtions  $H = \{h : U \mapsto [m]\}$  is called universal if

$$\forall x, y \in U, x \neq y : \Pr_{h \in H}[h(x) = h(y)] = \frac{1}{m}$$

- In other words, collision probability is as low as if hash values are truly randomly assigned for each key.
- Cyptographic properties such as preimage resistance not required!

<sup>&</sup>lt;sup>1</sup>Larry Carter and Mark N. Wegman, Universal Classes of Hash Functions, Proceedings of the 9th Annual ACM Symposium on Theory of Computing, 1977

## Authenticators from Universal Hash Functions

Building a message authentication code (MAC) from universal hash functions:

- Use universal hash function (selected by secret key) and hash message to short digest
- 2. Encrypt short digest by adding a one-time key

Provably secure in the information theoretic setting!

- one-time key not good for usability
  - replaced by function of nonce and secret key

#### **GHASH**

Polynomial based on input message blocks  $S_i$  (in AES-GCM this is the ciphertext), evaluated at secret key  $H = \mathsf{AES}_k(0^{128})$ 

$$\mathsf{GHASH}(H, A, C) = X_{m+n+1}$$

$$X_i = \sum_{j=1}^i S_j \cdot H^{i-j+1} = \begin{cases} 0 & \text{for } i = 0 \\ (X_{i-1} + S_i) \cdot H & \text{otherwise.} \end{cases}$$

Second form is iterative and used in most cases.

#### **AES-GCM**

- Look at structure of GHASH
  - Iterative form
  - final "one-time" key addition



Finite Field Multiplication on x86 CPUs

# Finite Field Multiplication on x86 CPUs

#### **PCLMULQDQ**

- Specialized CPU instruction
- Carryless Multiplication
  - 64-bit  $\times$  64-bit  $\mapsto$  128-bit
- Carryless multiplication is equivalent to multiplication of polynomials over field GF(2)
  - $GF(2^n)$  is usually represented as polynomials in GF(2)[X]
  - Finite field multiplication is polynomial multiplication
    - followed by reduction with respect to some irreducible polynomial

# PCLMULQDQ (cont.)

Representation of polynomial coefficients as bitstring:

$$a_{n-1}X^{n-1} + \dots + a_1X + a_0 \leftrightarrow a_{n-1}||\dots||a_1||a_0$$

Example:

$$X^7 + X^3 + X + 1 \leftrightarrow 10001011$$

Store one GF(2<sup>128</sup>) element in a 128-bit CPU register

| 127 64                         | 64 63 0                          |  |
|--------------------------------|----------------------------------|--|
| $11001100101\cdots 1101010010$ | $011111110001 \cdots 1010001001$ |  |

# PCLMULQDQ (cont.)

PCLMULQDQ takes two 128-bit registers and a selection value imm

# Multiplication of $GF(2^{128})$ elements

#### Combine the result of 4 sub-multiplications

| 127                      | 64 63                 | 0                      |
|--------------------------|-----------------------|------------------------|
| xmm1 11001100101 · · · 1 | 101010010 01111111000 | $1 \cdots 1010001001$  |
|                          |                       |                        |
| 127                      | 64 63                 | 0                      |
| xmm2 01001011101 · · · 0 | 101011111 0110110001  | $1 \cdots 01011111101$ |
|                          |                       |                        |

#### Modular Reduction

- Result needs to be reduced modulo the irreducible polynomial
  - No specialized instruction
- Option 1: Standard long division
  - Slow and tedious to implement
- Option 2: efficient reduction algorithm
  - Realizes division using only 2 multiplications
  - Special form of Barret reduction algorithm<sup>2</sup>

<sup>&</sup>lt;sup>2</sup>P. Barrett, Implementing the Rivest, Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor, Master's Thesis, University of Oxford, UK, 1986

#### Efficient Modular Reduction

• We have a 256-bit polynomial a(X) and want to reduce it modulo g(X):

$$r(X) = a(X) \bmod g(X)$$
.

use linearity to split calculation in two halves:

$$\begin{split} r(X) &= \left(c(X) \cdot X^{128} + b(X)\right) \bmod p(X) = c(X) \cdot X^{128} \bmod p(X) + b(X) \bmod p(X) \\ &= c(X) \cdot X^{128} \bmod p(X) + b(X) \,. \end{split}$$

- Focus on effcient calculation of  $p(X) = c(X) \cdot X^{128} \mod p(X)$ 
  - "Reduce upper half of 256-bit register and xor result with lower half"

# Efficient Modular Reduction (cont.)

$$p(X) = c(X) \cdot X^{128} \bmod g(X) = g(X) \cdot q(X) \bmod X^{128} \,,$$

where q(X) is the result of the division of  $c(X) \cdot X^{128}$  by g(X):

$$c(X) \cdot X^{128} = g(X) \cdot q(X) + p(X)$$
.

## Why does the first equality hold?

Since the 128 least significant terms of  $c(X) \cdot X^{128}$  are zero, the least significant 128 bits of  $g(X) \cdot q(X)$  and p(X) must be equal, so they cancel to zero in the addition. This reduces the modular reduction to finding the quotient q(X) and a finite field multiplication (the final reduction modulo  $X^{128}$  is equivalent to taking the lower 128 bit of the result).

# Efficient Modular Reduction (cont.)

$$p(X) = c(X) \cdot X^{128} \mod g(X) = g(X) \cdot q(X) \mod X^{128}$$

where q(X) is the result of the division of  $c(X) \cdot X^{128}$  by g(X):

$$c(X) \cdot X^{128} = g(X) \cdot q(X) + p(X)$$
.

## Why does the first equality hold?

Since the 128 least significant terms of  $c(X) \cdot X^{128}$  are zero, the least significant 128 bits of  $g(X) \cdot q(X)$  and p(X) must be equal, so they cancel to zero in the addition. This reduces the modular reduction to finding the quotient q(X) and a finite field multiplication (the final reduction modulo  $X^{128}$  is equivalent to taking the lower 128 bit of the result).

# Efficient Modular Reduction (cont.)

How to find q(X)?

$$q(X) = MSB(c(X) \cdot q^{+}(X)),$$

where  $q^+(X)$  is the result of the division of  $X^{256}$  by g(X) (precompute once):

$$X^{256} = g(X) \cdot q^{+}(X) + p^{+}(X).$$

This is similar to the calculation of the quotient in the Barrett reduction algorithm. For more details on this step, refer to the Intel White Paper<sup>3</sup>.

 $<sup>^3</sup> https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf$ 

# Efficient Modular Reduction Algorithm

- Precompute  $q^+(X)$  for the given irreducible polynomial g(X) Modular reduction algorithm:
- 1. Multiply c(X), the upper half of the input, with  $q^+(X)$
- 2. Take the upper 128-bit half of the result and multiply it with  $g(\boldsymbol{X})$
- 3. Add b(X), the lower half of the input to the result of the calculation
- 4. Return the 128 least significant bits as the reduced result

# Performance

## Performance of AES-GCM

| Features           | Throughput [MB/s] |
|--------------------|-------------------|
| (CT) Software only | 67.24             |
| AES-NI             | 224.20            |
| PCLMULQDQ          | 87.73             |
| Both               | 1013.63           |

Table: Performance of AES-128-GCM on Intel Xeon E3-1220 @ 3.1GHz4

Without specialized instructions for PCLMULQDQ, the calculation of GHASH bottlenecks the AES computations with AES-NI.

<sup>&</sup>lt;sup>4</sup>Based on Thomas Pornin's BearSSL benchmarks: https://www.bearssl.org/speed.html

Questions?

# Questions?