Embeddable, dependency-free C11 implementation of ML-KEM from the FIPS 203 initial public draft (IPD) with scalar, AVX-512, and Neon backends.

FIPS 203 is (or will be) NIST's standardized version of Kyber, a post-quantum key encapsulation mechanism (KEM).

Notes on this implementation:

Uses constant-time Barrett reduction. Not vulnerable to KyberSlash.
Multiple SIMD-accelerated backends: AVX-512 and Neon.
Coefficients are reduced modulo Q during polynomial deserialization, as per this discussion.
Builds with GCC and Clang on x86-64 and ARM in Linux and MacOS.
Test suite is built-in to fips203ipd.c (see bottom of file) and compiles with common sanitizers enabled.
Includes (and passes) test vectors from the NIST PQC Intermediate Values for Draft ML-KEM.
Full Doxygen API documentation. Available online here.
Randomness for keygen() and encaps() is specified as a function parameter.
Uses my SHA-3 implementation internally.

Use make to build a minimal self test application, make doc to build the HTML-formatted API documentation, and make test to run the test suite.

Example

Minimal example of Alice and Bob exchanging a shared secret with KEM512:

//
// hello.c: minimal example of a two parties "alice" and "bob"
// generating a shared secret with KEM512.
//
#include <stdio.h> // fputs()
#include <string.h> // memcmp()
#include "hex.h" // hex_write()
#include "rand-bytes.h" // rand_bytes()
#include "fips203ipd.h" // fips203ipd_*()
 
int main(void) {
  //
  // alice: generate keypair
  //
  uint8_t ek[FIPS203IPD_KEM512_EK_SIZE] = { 0 }; // encapsulation key
  uint8_t dk[FIPS203IPD_KEM512_DK_SIZE] = { 0 }; // decapsulation key
  {
    // alice: get 64 random bytes for keygen()
    uint8_t keygen_seed[64] = { 0 };
    rand_bytes(keygen_seed, sizeof(keygen_seed));
 
    fputs("alice: keygen random (64 bytes) = ", stdout);
    hex_write(stdout, keygen_seed, sizeof(keygen_seed));
    fputs("\n", stdout);
 
    // alice: generate encapsulation/decapsulation key pair
    fips203ipd_kem512_keygen(ek, dk, keygen_seed);
  }
  fputs("alice: generated encapsulation key `ek` and decapsulation key `dk`:\n", stdout);
  printf("alice: ek (%d bytes) = ", FIPS203IPD_KEM512_EK_SIZE);
  hex_write(stdout, ek, sizeof(ek));
  printf("\nalice: dk (%d bytes) = ", FIPS203IPD_KEM512_DK_SIZE);
  hex_write(stdout, dk, sizeof(dk));
  fputs("\n", stdout);
 
  // alice send `ek` to bob
  fputs("alice: sending encapsulation key `ek` to bob\n\n", stdout);
 
  //
  // bob: generate shared secret and ciphertext
  //
  uint8_t b_key[32] = { 0 }; // shared secret
  uint8_t ct[FIPS203IPD_KEM512_CT_SIZE] = { 0 }; // ciphertext
  {
    // bob: get 32 random bytes for encaps()
    uint8_t encaps_seed[32] = { 0 };
    rand_bytes(encaps_seed, sizeof(encaps_seed));
 
    fputs("bob: encaps random (32 bytes) = ", stdout);
    hex_write(stdout, encaps_seed, sizeof(encaps_seed));
    fputs("\n", stdout);
 
    // bob:
    // 1. get encapsulation key `ek` from alice.
    // 2. generate random shared secret.
    // 3. use `ek` from step #1 to encapsulate the shared secret from step #2.
    // 3. store the shared secret in `b_key`.
    // 4. store the encapsulated shared secret (ciphertext) in `ct`.
    fips203ipd_kem512_encaps(b_key, ct, ek, encaps_seed);
  }
 
  fputs("bob: generated secret `b_key` and ciphertext `ct`:\nbob: b_key (32 bytes) = ", stdout);
  hex_write(stdout, b_key, sizeof(b_key));
  printf("\nbob: ct (%d bytes) = ", FIPS203IPD_KEM512_CT_SIZE);
  hex_write(stdout, ct, sizeof(ct));
  fputs("\n", stdout);
 
  // bob sends ciphertext `ct` to alice
  fputs("bob: sending ciphertext `ct` to alice\n\n", stdout);
 
  //
  // alice: decapsulate shared secret
  //
 
  // alice:
  // 1. get ciphertext `ct` from bob.
  // 2. use decapsulation key `dk` to decapsulate shared secret from `ct`.
  // 2. store shared secret in `a_key`.
  uint8_t a_key[32] = { 0 };
  fips203ipd_kem512_decaps(a_key, ct, dk);
 
  fputs("alice: used `dk` to decapsulate secret from `ct` into `a_key`:\nalice: a_key (32 bytes) = ", stdout);
  hex_write(stdout, a_key, sizeof(a_key));
  fputs("\n\n", stdout);
 
  // check result
  // (note: don't use memcmp() in real applications; it's not constant-time)
  if (!memcmp(a_key, b_key, sizeof(a_key))) {
    // success: alice and bob have the same shared secret
    fputs("SUCCESS! alice secret `a_key` and bob secret `b_key` match.\n", stdout);
    return 0;
  } else {
    // failure: alice and bob do not have the same shared secret
    fputs("FAILURE! alice secret `a_key` and bob secret `b_key` do not match.\n", stdout);
    return -1;
  }
}

See examples/0-hello-kem/ for the full buildable example, including a Makefile and support files.

Documentation

API documentation is available online here and also in fips203ipd.h. If you have Doxygen installed, you can build HTML-formatted API documentation by typing make doc.

Tests

Use make test to build and run the test suite.

The test suite checks each component of this implementation for expected answers and is built with common sanitizers supported by both GCC and Clang. The source code for the test suite is embedded at the bottom of fips203ipd.c behind a TEST_FIPS203IPD define.

You can also build a quick test application by typing make in the top-level directory. The test application does the following 1000 times for each parameter set:

Generate a random encapsulation/decapsulation key pair.
Encapsulate a secret using the encapsulation key.
Decapsulate the secret using the decapsulation key.
Verify that the secrets generated in steps #2 and #3 match.

Usage

There are safer and faster alternatives, but if you want to use this library anyway:

Copy fips203ipd.h and fips203ipd.c into your source tree.
Update your build system to compile fips203ipd.o.
Include fips203ipd.h in your application.
Use fips203ipd_*() functions in your code.

Benchmarks

A minimal libcpucycles-based benchmarking tool is available in examples/3-bench/. The bench tool repeatedly measures the number of CPU cycles used for key generation, encapsulation, and decapsulation for each parameter set and then prints summary statistics to standard output in CSV format.

bench results from a few of my systems are shown below.

Lenovo ThinkPad X1 Carbon, 6th Gen (x86-64 i7-1185G7, AVX-512 backend)

set	function	median	mean	stddev	min	max
kem512	keygen	17633	17723	676	17302	54660
kem512	encaps	21602	21705	938	21304	75753
kem512	decaps	25733	25894	4934	25376	372564
kem768	keygen	29384	29528	1557	28885	133739
kem768	encaps	32511	32632	603	31974	59495
kem768	decaps	38176	38223	529	37368	46061
kem1024	keygen	39829	40111	4865	39166	365645
kem1024	encaps	45250	45437	781	44510	76805
kem1024	decaps	52425	52605	636	51282	63121

Raspberry Pi 5 (ARM Cortex-A76, Neon backend)

set	function	median	mean	stddev	min	max
kem512	keygen	53711	53824	945	53355	106488
kem512	encaps	61366	61487	600	61010	70577
kem512	decaps	73559	73666	586	73202	85529
kem768	keygen	92560	92754	896	92026	119126
kem768	encaps	104842	105065	802	104352	114855
kem768	decaps	121485	121718	826	120907	130430
kem1024	keygen	140219	143515	10211	139507	214890
kem1024	encaps	154949	158546	11176	154237	224903
kem1024	decaps	176131	180226	12692	175330	231178

Odroid N2L (ARM Cortex-A73, Neon backend)

set	function	median	mean	stddev	min	max
kem512	keygen	96450	96734	1455	94875	121425
kem512	encaps	107550	107831	1552	106050	132825
kem512	decaps	126375	126681	1704	125250	165450
kem768	keygen	168450	168962	2058	165900	209175
kem768	encaps	186975	187449	2169	184425	214350
kem768	decaps	212925	213580	2769	209550	259725
kem1024	keygen	260325	261156	2861	256275	304950
kem1024	encaps	281175	282064	2891	277575	326850
kem1024	decaps	314250	315165	3441	310875	410850

Backends

This library includes several accelerated backends which are selectable at compile time via the optional BACKEND parameter. By default the fastest available backend is selected at compile-time.

The available backends are:

Auto-detect (BACKEND=0): Automatically detect the appropriate backend at compile-time (default).
Scalar (BACKEND=1): Fallback if no faster backend is available.
AVX-512 (BACKEND=2): AVX-512 acceleration. Selected by default if AVX-512 is supported.
Neon (BACKEND=3): ARM Neon acceleration. Selected by default if Neon is supported..

When the a backend is enabled, the following functions are replaced with accelerated equivalents:

Functions	Description
`permute()`	block permutation for SHA3-256 and SHA3-512
`poly_{ntt,inv_ntt}()`	Number-Theoretic Transform (NTT) and inverse NTT
`poly_{add,add2,sub,mul}()`	Polynomial arithmetic
`poly_encode()`, `poly_encode_{11,10,5,4,1}bit()`	Polynomial encoding
`poly_decode()`, `poly_decode_{11,10,5,4,1}bit()`	Polynomial decoding
`pke{512,768,1024}_keygen_sample()`	Key generation (see below)
`pke{512,768,1024}_encrypt_sample()`	Encryption (see below)

You can use the fips203ipd_backend() function to get the backend name at run-time.

AVX-512 Backend

The AVX-512-enabled key generation and encryption functions perform coefficient sampling in stages and use the 64-bit lanes of 25 512-bit registers to permute and sample from up to 8 SHAKE128 and SHAKE256 contexts at once.

Here are how the lanes of the AVX-512 registers are used for each parameter set, function, and stage:

Parameter Set	Function	Stage	Lane 0	Lane 1	Lane 2	Lane 3	Lane 4	Lane 5	Lane 6	Lane 7
KEM512	keygen	1	A[0,0]	A[0,1]	A[1,0]	A[1,1]	s[0]	s[1]	e[0]	e[1]
KEM512	encrypt	1	A[0,0]	A[1,0]	A[0,1]	A[1,1]	r[0]	r[1]	n/a	n/a
KEM512	encrypt	2	e1[0]	e1[1]	e2	n/a	n/a	n/a	n/a	n/a
KEM768	keygen	1	A[0,0]	A[0,1]	A[0,2]	A[1,0]	A[1,1]	A[1,2]	A[2,0]	A[2,1]
KEM768	keygen	2	A[2,2]	s[0]	s[1]	s[2]	e[0]	e[1]	e[2]	n/a
KEM768	encrypt	1	A[0,0]	A[1,0]	A[2,0]	A[0,1]	A[1,1]	A[2,1]	A[0,2]	A[1,2]
KEM768	encrypt	2	A[2,2]	r[0]	r[1]	r[2]	e1[0]	e1[1]	e1[2]	e2
KEM1024	keygen	1	A[0,0]	A[0,1]	A[0,2]	A[0,3]	A[1,0]	A[1,1]	A[1,2]	A[1,3]
KEM1024	keygen	2	A[2,0]	A[2,1]	A[2,2]	A[2,3]	A[3,0]	A[3,1]	A[3,2]	A[3,3]
KEM1024	keygen	3	s[0]	s[1]	s[2]	s[3]	e[0]	e[1]	e[2]	e[3]
KEM1024	encrypt	1	A[0,0]	A[1,0]	A[2,0]	A[3,0]	A[0,1]	A[1,1]	A[2,1]	A[3,1]
KEM1024	encrypt	2	A[0,2]	A[1,2]	A[2,2]	A[3,2]	A[0,3]	A[1,3]	A[2,3]	A[3,3]
KEM1024	encrypt	3	r[0]	r[1]	r[2]	r[3]	e1[0]	e1[1]	e1[2]	e1[3]
KEM1024	encrypt	4	e2	n/a	n/a	n/a	n/a	n/a	n/a	n/a

Notes:

Unlike the Neon backend and most other ML-KEM implementations, the AVX-512 backend does not alternate between squeezing to a buffer and then sampling polynomial coefficients from the buffer. Instead, the AVX-512 backend transposes the registers containing the SHAKE context and then samples from them directly. This aproach is novel, but it uses a lot of registers and probably slower in the common case. A future implementation will probably switch to the alternating method used by the Neon backend and other implementations.

Neon Backend

The Neon key generation and encryption functions perform coefficient sampling in stages and use the 64-bit lanes of 25 128-bit registers to permute and sample from 2 SHAKE128 or SHAKE256 contexts at once.

Here are how the lanes of the 128-bit Neon registers are used for each parameter set, function, and stage:

Parameter Set	Function	Stage	Lane 0	Lane 1
KEM512	keygen	1	A[0,0]	A[0,1]
KEM512	keygen	2	A[1,0]	A[1,1]
KEM512	keygen	3	s[0]	s[1]
KEM512	keygen	4	e[0]	e[1]
KEM512	encrypt	1	A[0,0]	A[1,0]
KEM512	encrypt	2	A[0,1]	A[1,1]
KEM512	encrypt	3	r[0]	r[1]
KEM512	encrypt	4	e1[0]	e1[1]
KEM512	encrypt	5	e2	n/a
KEM768	keygen	1	A[0,0]	A[0,1]
KEM768	keygen	2	A[0,2]	A[1,0]
KEM768	keygen	3	A[1,1]	A[1,2]
KEM768	keygen	4	A[2,0]	A[2,1]
KEM768	keygen	5	A[2,2]	n/a
KEM768	keygen	6	s[0]	s[1]
KEM768	keygen	7	s[2]	e[0]
KEM768	keygen	8	e[1]	e[2]
KEM768	encrypt	1	A[0,0]	A[1,0]
KEM768	encrypt	2	A[2,0]	A[0,1]
KEM768	encrypt	3	A[1,1]	A[2,1]
KEM768	encrypt	4	A[0,2]	A[1,2]
KEM768	encrypt	5	A[2,2]	n/a
KEM768	encrypt	6	r[0]	r[1]
KEM768	encrypt	7	r[2]	e1[0]
KEM768	encrypt	8	e1[1]	e1[2]
KEM768	encrypt	9	e2	n/a
KEM1024	keygen	1	A[0,0]	A[0,1]
KEM1024	keygen	2	A[0,2]	A[0,3]
KEM1024	keygen	3	A[1,0]	A[1,1]
KEM1024	keygen	4	A[1,2]	A[1,3]
KEM1024	keygen	5	A[2,0]	A[2,1]
KEM1024	keygen	5	A[2,2]	A[2,3]
KEM1024	keygen	6	A[3,0]	A[3,1]
KEM1024	keygen	7	A[3,2]	A[3,3]
KEM1024	keygen	8	s[0]	s[1]
KEM1024	keygen	9	s[2]	s[3]
KEM1024	keygen	10	e[0]	e[1]
KEM1024	keygen	11	e[2]	e[3]
KEM1024	encrypt	1	A[0,0]	A[1,0]
KEM1024	encrypt	2	A[2,0]	A[3,0]
KEM1024	encrypt	3	A[0,1]	A[1,1]
KEM1024	encrypt	4	A[2,1]	A[3,1]
KEM1024	encrypt	5	A[0,2]	A[1,2]
KEM1024	encrypt	5	A[2,2]	A[3,2]
KEM1024	encrypt	6	A[0,3]	A[1,3]
KEM1024	encrypt	7	A[2,3]	A[3,3]
KEM1024	encrypt	8	r[0]	r[1]
KEM1024	encrypt	9	r[2]	r[3]
KEM1024	encrypt	10	e1[0]	e1[1]
KEM1024	encrypt	11	e1[2]	e1[3]
KEM1024	encrypt	12	e2	n/a

Notes:

Unlike the AVX-512 backend, the Neon backend alternates between squeezing bytes to a buffer and then sampling polynomial coefficients from the buffer.
The Neon backend does not use the EOR3 or RAX1 instructions, because I do not have a system which supports them.
Polynomial encoding and decoding are not currently Neon-accelerated. This will be fixed in a future release.

Scalar Backend

Notes: The scalar backend is the default if no accelerated backend is available. It is not particularly optimized.

As of this writing, Clang does a better job optimizing the scalar backend than GCC (see bench results in examples/3-bench/README.md).

References

License

MIT No Attribution (MIT-0)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.