on

# When sorting your data costs more than cracking AES-128

How does Kyber512 stack up against AES128? In which models of computation is Kyber512 more secure? Is it ever less secure? What about in reality?

Of course reality is what matters. But the reality question is the hardest to answer.

# "In reality"

If your adversary can achieve their goal by:

- recoving a Kyber512 session key from the public messages, or by
- decrypting an AES128 encrypted message,

which will they choose? That's what I mean by "in reality". That's the question that real adversaries ask themselves. Real codebreakers want to optimize the use of their resources. Their budget and their hardware are fixed. They need to get the job done by Tuesday.

Here's another framing: What is the adversary's relative success probability in (1) vs (2) if they are given access to machine M for time T?

This question is a lot easier if you replace Kyber512 with a different symmetric cipher, as the hardware required to break a cipher is practically independent of the cipher itself. The relative security of AES and Serpent is practically independent of M and T.

The security of Kyber512 varies wildly with M and T.

# A correction

In the security analysis of Kyber, my co-authors and I focused on an algorithm
that costs 2^{0.292b} operations in the RAM model. In the security
analysis of NTRU, we additionally considered an algorithm that costs
2^{0.349b} operations in a "local" model of computation (something like
a mesh network of RAM machines with constant-sized memories).
The 0.349 constant was a conjecture, not an established fact.

I no longer believe that 2^{0.349b} is a realistic estimate for local
models (*edit: See "Update 2023-11-17" below. I now, once again, believe that
the 0.349b estimate is realistic. Thanks to Sam Jaques and Léo Ducas for
clarifying discussions.*).

The NTRU Prime team, which I was not involved with, also considered local
models. They did not consider the 2^{0.349b} attack. They considered an
2^{0.396b} attack. I worry that my support for the 2^{0.349b}
estimate may have damaged the credibility of the 2^{0.396b} estimate.
This matters because the best exponent was 0.415 a decade ago, and the 0.349
and 0.292 estimates play into a false narrative that the cost of lattice
attacks has been falling precipitously since then.

So, as a personal correction to my past use of 0.349, I'm going to describe a
machine that breaks Kyber512 with *a lot* of sorting and little else. It runs
the 0.396 algorithm.

Contrary to popular misconceptions about the 0.396 algorithm, the machine I
describe below is not a giant circuit laid out in two-dimensions, and it does
not use n^{3/2}–operation sorting algorithms at all stages.
Rather, it is substantially taller than any computer ever built, and it uses n
log n–operation sorting algorithms on exabit sized random access
memories.

So read on and imagine, with me, a trillion server racks sorting a quadrillion exabits of data (and a crime against Nevada).

But please... don't mistake this for anything more than a fun exercise. A rigorous analysis of this attack, and a comparison with lower-memory alternatives, should be left to the academic literature. The NTRU Prime team has already worked out a lot of the details.

# Sieving

The best known attack on Kyber512 finds a short vector in a lattice of dimension 1025. It does this by repeatedly solving lattice problems in dimension 384. Each dimension 384 lattice problem is solved using "sieving".

Here's a recipe for sieving:

- Collect enough random looking lattice vectors of roughly the same length so that, by a simple volume argument, many pairs of vectors are guaranteed to be "close" (v and w are close if |v-w| < |v| <= |w|).
- Use a nearest-neighbor search algorithm to find pairs of vectors that might be close.
- Identify the actually-close pairs and take differences to obtain a collection of
*shorter*vectors in the lattice.

If you input N = sqrt(4/3)^{384} = (4/3)^{192} vectors into
Step (1) then you get enough shorter vectors out of Step (3) to meet the
preconditions for Step (1) again. You can iterate. You can find vectors that
are as short as you like.

**Disclaimer:** My "384" is an approximation to the actual sieving dimension for attacks on Kyber512. It's
impossible to determine the actual value until you fix a machine model, and that's what this post is about.
There's a bit of a chicken and egg problem here.

# BDGL

The best RAM algorithm known for Step (2) is due to Becker, Ducas, Gama, and Laarhoven [1].

The BDGL algorithm partitions the unit ball in dimension 384 into
(3/2)^{192} overlapping regions. It stores the (4/3)^{192}
input vectors, redundantly, in buckets associated with these regions—each
input vector is stored in (9/8)^{192} buckets. It can find close pairs
in any bucket essentially for free, because there aren't many vectors in any
single bucket.

The algorithm has a very clever "list-decoding" subroutine for enumerating the
regions that contain a vector. The cost of this subroutine scales linearly with
the number of regions that the vector is in. So here list-decoding costs
(9/8)^{192} operations per input vector.

To find all close pairs, the BDGL sieve runs the list-decoding algorithm on every
input vector. This costs
(4/3)^{192} · (9/8)^{192} =
(3/2)^{192} operations, which matches the cost of the bucketing stage
up to subexponential factors.

**Disclaimer:** My (4/3)^{b/2} is shorthand for
2^{(0.207...b+o(b))} and my (9/8)^{b/2} is shorthand for
2^{(0.084...b+o(b))}. My (3/2)^{b/2} is the product of these,
2^{(0.292...b+o(b))}. These numbers represent the volume of simple
geometric regions on the sphere. Exact values for fixed b can be computed with
the software from my Asiacrypt 2020 paper with Albrecht, Gheorghiu, and
Postlethwaite [2].

# Space-optimal BDGL

What I just described is the "time-optimal" variant of BDGL. Its time and space complexities are equal.
It uses (3/2)^{192} space.

You can tune BDGL to use exponentially less space if you're willing to accept a
subexponential increase in time. The trick is to iterate over the input list c
times and bucket into a different 1/c fraction of regions in each iteration. In
this configuration, the list-decoding routine then takes (9/8)^{192}/c
operations. The "space-optimal" variant of BDGL takes c = (9/8)^{192}
and stores just (4/3)^{192} nearly-empty buckets in any iteration.

# A dose of reality

Assuming 16 bit coordinates, the time-optimal variant stores 2^{112}
vectors in 2^{125} bits of memory and the space-optimal variant stores
2^{80} vectors in 2^{93} bits of memory.

I suggest you visualize 2^{125} bits of memory as a kilometer thick shell
of terabyte microSD cards surrounding the Earth. If you're bullish on technology, you can
imagine a meter thick shell of petabyte microSD cards. If you're a fanatic, you can imagine a
millimeter thick membrane of exabyte/mm^{3} DNA goop everywhere. Whatever assumptions
you make about technological progress, it's an unreasonable amount of memory.

So, for the remainder of this post, I'll assume that attacks that use more than 2^{100} bits of memory are
out-of-scope for comparison with AES-128. This is somewhat arbitrary. I may justify it in a follow-up post.

**Note**: By exact computation of "2^{(0.207...b+o(b))}" at b=384, I find that the space-optimal
variant actually needs to store 2^{85} vectors, i.e. 2^{98} bits. I'll use the exact
value in what follows.

# The attack machine

The attack machine is a work of science fiction. The story goes like this.

The General Routing Corporation builds R-machines for the government. Their
machines all have a similar design: model R(x, y, z, m) has 2^{x}
aisles of server racks, 2^{y} racks per aisle, 2^{z} shelves
per rack, one (many-core) CPU per shelf, and 2^{m} bits of RAM per CPU.

The CPUs in an R-machine are networked. The CPUs in one rack have all-to-all connectivity. The racks have nearest-neighbor connectivity.

General Routing's glossy marketing literature advertises R-machines that come packaged as Dyson spheres. But R(20, 20, 10, 50) is the largest machine they've built. It's housed in Nevada. All of Nevada. What use to be Nevada.

The all-to-all connectivity inside a rack
provides each CPU with RAM-like access to 2^{z+m} bits of data.
For R(20, 20, 10, 50) that's one exabit per rack. There are a trillion racks.

**Note:** After accounting for power, cooling, and wiring (all handled by
separate subcontractors), the racks are one meter square at the base. So Nevada
was actually a little small. General Routing and the government agreed to a
change order during construction: they swapped the R(20, 20, 10, 50) for four
R(19, 19, 10, 50)s stacked vertically. This didn't change the performance of
the machine, but it was very expensive.

# Sketch of the computation

The computation that R(20, 20, 10, 50) performs to break Kyber512 uses the space-optimal BDGL sieve as described above.

The machine performs (9/8)^{192} iterations of a loop. In each
iteration the collection of N = 2^{85} input vectors are laid out
(arbitrarily) in memory across the cluster. The machine:

- Uses the BDGL near-neighbor search algorithm to tag each vector with (zero or more) region labels.
- Sorts the labelled vectors.
- Identifies close pairs among vectors that share the same label.

**Note**: The list-decoding algorithm uses a "random product code". I won't go
into the details here, but I do need to mention that every CPU in R(20, 20, 10,
50) needs a copy of this random product code. The code is represented by
mN^{1/m} unit vectors in dimension 384/m. We need m to be large enough
so that the code size is negligible compared to each CPU's share of the input.
But if we take m to be too large then the list-decoding algorithm becomes
the dominant cost of the attack. This is a balancing act, but something like
m=6 should be fine.

**Note**: The random product code must be generated fresh in each iteration of
the outer loop. This can be done in parallel by pre-sharing an DRBG seed across
the cluster.

**Note**: Step 1 maps each input vector to zero or more labelled vectors. The
"zero" and "more than one" cases must be handled carefully, but both can be
solved by giving each CPU slightly more memory than it needs to store its share of
the input list.

# The sort

A region label is a string of log_{2} N = 85 bits. We'll think of these 85 bits as a
quadruple (w, x, y, z) where w is a 35-bit tag and (x, y, z) indexes a CPU.

The sort proceeds in three steps:

- Sort by z in each rack such that the CPU in the i-th shelf receives all of the vectors with z = i.

Since the CPUs in each rack are all-to-all connected, this can be done with
radix sort in 10 rounds of communication. In each round, each of the
2^{10} CPUs exchanges about 2^{50} bits of data with a neighbor. The
cost is 2^{50} in-rack bit moves per CPU per round. This is
2^{63} in-rack bit moves per rack and 2^{103} in-rack bit moves
across the entire cluster.

- Sort by (x,y) within each z-layer.

Since the racks have nearest-neighbor connectivity, this can be done with the
Schnorr–Shamir mesh sort in 2^{20} steps. In each step, each of the
2^{50} CPUs in the cluster exchanges about 2^{50} bits of data
with a neighbor in an adjacent rack. This gives a total of
2^{(20+50+50)} = 2^{120} adjacent-rack bit move operations.

- Sort by w on each CPU.

This can be done using any fast RAM sort at negligible cost.

# Close pair detection

After the sort, vectors with the same (w, x, y, z) label are adjacent in the memory
of CPU (x, y, z). There are not many vectors with any
given (w, x, y, z) label, so actual close-pairs can be identified using
parallel, nearly-linear time, searches. This step involves something like
2^{85} vector comparisons. It's safe to assume that each comparison
costs less than 2^{35} adjacent-rack bit moves, so this is negligible.

# The cost

The sorting and close-pair detection steps are iterated (9/8)^{192} =
2^{32} times. The total cost of the attack is dominated by
2^{120+32} = 2^{152} adjacent-rack bit moves. This is
signficantly more expensive than the 2^{143} logic gates that are
needed to break AES-128.

**Note:** A naive application (misinterpretation?) of the NTRU Prime submission's "Non-quantum 0.396b,
non-hybrid, real cost of memory" estimate gives something similar (0.396
· 384 = 152), but this is a coincidence. This naive 0.396b estimate is
less realistic than mine in some respects and more realistic in others. It is
less realistic because I used an exact computation to arrive at N =
2^{85} instead of N = 2^{80}. It is more realistic because it
does not allow unit-cost random access to the exabit memories.

# What if I can only afford an R(20, 20, 10, 40)?

If you only have 2^{90} bits of memory, then you cannot iterate a
BDGL-based sieve. So you cannot use BDGL to find dimension 384 vectors that
are as short as you like. You cannot use BDGL to break Kyber512.

This point bears repeating, as the problem compounds at higher security levels.

The standard attack on Kyber768 uses something like 2^{138} bits of
storage. If you can't afford an R(20, 20, 10, 90)—and who can these
days?—you can't use BDGL to attack Kyber768. If all you have lying around is an R(20,
20, 10, 50)... then you start looking at other sieves—exponentially slower sieves that use less space. You
might end up pretty far down the list of low-memory sieving algorithms. You might
even have to consider superexponential time lattice point enumeration.

After all, your goal as a codebreaker is to maximize the success probability of the computation that you can afford to run. You're not in it to daydream about Dyson spheres.

# Update 2023-11-17: better attack parameters

Instead of covering the sphere with (3/2)^{b/2} overlapping regions, we
can cover it with fewer, but larger, regions. As we decrease the number of
regions that we use, the number of vectors in each bucket increases, and the
cost of close-pair detection grows (quadratically). But this is offset by the
fact that the number of close pairs that we find per bucket also increases, and
the total number of buckets that we need to consider decreases.

When using an R(x, x, z, m), the optimal number of vectors per bucket is roughly x. That's because the sorting step moves 2x+z+m bits of memory x times, and the close pair detection step (implemented as a "systolic array" where a copy of the data is moved past the original in a cycle) moves 2x+z+m bits of data k times where k is the average number of vectors in a bucket.

The RAM model gives you an R(0, 0, 0, m). In this case
(3/2)^{b/2} regions (which corresponds to covering the sphere with
spherical caps of angle π/3) is optimal.

I used an R(20, 20, 10, 50) above, and I failed to balance the cost of sorting
and close-pair detection. Using this python script that I based on
the software from ePrint 2019/1161, I find that caps of angle approximately
61π/173 are optimal. Properly parameterized, the attack performs
2^{142} adjacent-rack bit moves and an equal number of vector
comparisons. (Both of these operations cost significantly more than one logic
gate.)

What happens as the sieving dimension, b, grows? Locality forces the attacker
to keep m constant. Power and cooling constraints force them to keep z
constant. So the they will use an R(x, x, c_{z}, c_{m}) for the
smallest x that gives them enough memory. They will tune the algorithm so that
there are x vectors in each bucket, which is roughly the square root of the input
list. And, ultimately, their cost will grow like
2^{((0.349...+o(1))b)}, although the convergence is pretty slow.

They'll use an R(18.3, 18.3, 10, 50) for b = 384, which costs
2^{0.371·384} = 2^{142} operations.

They'll use an R(38.7, 38.7, 10, 50) for b = 576, which costs
2^{0.366·576} = 2^{211} operations.

They'll use an R(59, 59, 10, 50) for b = 768, which costs
2^{0.363·768} = 2^{279} operations.

The asymptotics don't really matter, as realistic adversaries will use an asymptotically slower algorithm if it has a higher success probability when run on a machine that they can afford (and those algorithms exist).

It's much more important to nail down the c_{z} and c_{m}. For
instance, if we only allow gigabit RAMs. Then the adversary will use an R(33.3,
33.3, 0, 30) or R(33.3, 33.3, 10, 20) for b = 384, which will cost
2^{0.392·384} = 2^{151} operations. If we only
allow kilobit RAMs the cost is 2^{0.407·384} =
2^{156} operations.