When sorting your data costs more than cracking AES-128

How does Kyber512 stack up against AES128? In which models of computation is Kyber512 more secure? Is it ever less secure? What about in reality?

Of course reality is what matters. But the reality question is the hardest to answer.

"In reality"

If your adversary can achieve their goal by:

  1. recoving a Kyber512 session key from the public messages, or by
  2. decrypting an AES128 encrypted message,

which will they choose? That's what I mean by "in reality". That's the question that real adversaries ask themselves. Real codebreakers want to optimize the use of their resources. Their budget and their hardware are fixed. They need to get the job done by Tuesday.

Here's another framing: What is the adversary's relative success probability in (1) vs (2) if they are given access to machine M for time T?

This question is a lot easier if you replace Kyber512 with a different symmetric cipher, as the hardware required to break a cipher is practically independent of the cipher itself. The relative security of AES and Serpent is practically independent of M and T.

The security of Kyber512 varies wildly with M and T.

A correction

In the security analysis of Kyber, my co-authors and I focused on an algorithm that costs 20.292b operations in the RAM model. In the security analysis of NTRU, we additionally considered an algorithm that costs 20.349b operations in a "local" model of computation (something like a mesh network of RAM machines with constant-sized memories). The 0.349 constant was a conjecture, not an established fact.

I no longer believe that 20.349b is a realistic estimate for local models (edit: See "Update 2023-11-17" below. I now, once again, believe that the 0.349b estimate is realistic. Thanks to Sam Jaques and Léo Ducas for clarifying discussions.).

The NTRU Prime team, which I was not involved with, also considered local models. They did not consider the 20.349b attack. They considered an 20.396b attack. I worry that my support for the 20.349b estimate may have damaged the credibility of the 20.396b estimate. This matters because the best exponent was 0.415 a decade ago, and the 0.349 and 0.292 estimates play into a false narrative that the cost of lattice attacks has been falling precipitously since then.

So, as a personal correction to my past use of 0.349, I'm going to describe a machine that breaks Kyber512 with a lot of sorting and little else. It runs the 0.396 algorithm.

Contrary to popular misconceptions about the 0.396 algorithm, the machine I describe below is not a giant circuit laid out in two-dimensions, and it does not use n3/2–operation sorting algorithms at all stages. Rather, it is substantially taller than any computer ever built, and it uses n log n–operation sorting algorithms on exabit sized random access memories.

So read on and imagine, with me, a trillion server racks sorting a quadrillion exabits of data (and a crime against Nevada).

But please... don't mistake this for anything more than a fun exercise. A rigorous analysis of this attack, and a comparison with lower-memory alternatives, should be left to the academic literature. The NTRU Prime team has already worked out a lot of the details.

Sieving

The best known attack on Kyber512 finds a short vector in a lattice of dimension 1025. It does this by repeatedly solving lattice problems in dimension 384. Each dimension 384 lattice problem is solved using "sieving".

Here's a recipe for sieving:

  1. Collect enough random looking lattice vectors of roughly the same length so that, by a simple volume argument, many pairs of vectors are guaranteed to be "close" (v and w are close if |v-w| < |v| <= |w|).
  2. Use a nearest-neighbor search algorithm to find pairs of vectors that might be close.
  3. Identify the actually-close pairs and take differences to obtain a collection of shorter vectors in the lattice.

If you input N = sqrt(4/3)384 = (4/3)192 vectors into Step (1) then you get enough shorter vectors out of Step (3) to meet the preconditions for Step (1) again. You can iterate. You can find vectors that are as short as you like.

Disclaimer: My "384" is an approximation to the actual sieving dimension for attacks on Kyber512. It's impossible to determine the actual value until you fix a machine model, and that's what this post is about. There's a bit of a chicken and egg problem here.

BDGL

The best RAM algorithm known for Step (2) is due to Becker, Ducas, Gama, and Laarhoven [1].

The BDGL algorithm partitions the unit ball in dimension 384 into (3/2)192 overlapping regions. It stores the (4/3)192 input vectors, redundantly, in buckets associated with these regions—each input vector is stored in (9/8)192 buckets. It can find close pairs in any bucket essentially for free, because there aren't many vectors in any single bucket.

The algorithm has a very clever "list-decoding" subroutine for enumerating the regions that contain a vector. The cost of this subroutine scales linearly with the number of regions that the vector is in. So here list-decoding costs (9/8)192 operations per input vector.

To find all close pairs, the BDGL sieve runs the list-decoding algorithm on every input vector. This costs (4/3)192 · (9/8)192 = (3/2)192 operations, which matches the cost of the bucketing stage up to subexponential factors.

Disclaimer: My (4/3)b/2 is shorthand for 2(0.207...b+o(b)) and my (9/8)b/2 is shorthand for 2(0.084...b+o(b)). My (3/2)b/2 is the product of these, 2(0.292...b+o(b)). These numbers represent the volume of simple geometric regions on the sphere. Exact values for fixed b can be computed with the software from my Asiacrypt 2020 paper with Albrecht, Gheorghiu, and Postlethwaite [2].

Space-optimal BDGL

What I just described is the "time-optimal" variant of BDGL. Its time and space complexities are equal. It uses (3/2)192 space.

You can tune BDGL to use exponentially less space if you're willing to accept a subexponential increase in time. The trick is to iterate over the input list c times and bucket into a different 1/c fraction of regions in each iteration. In this configuration, the list-decoding routine then takes (9/8)192/c operations. The "space-optimal" variant of BDGL takes c = (9/8)192 and stores just (4/3)192 nearly-empty buckets in any iteration.

A dose of reality

Assuming 16 bit coordinates, the time-optimal variant stores 2112 vectors in 2125 bits of memory and the space-optimal variant stores 280 vectors in 293 bits of memory.

I suggest you visualize 2125 bits of memory as a kilometer thick shell of terabyte microSD cards surrounding the Earth. If you're bullish on technology, you can imagine a meter thick shell of petabyte microSD cards. If you're a fanatic, you can imagine a millimeter thick membrane of exabyte/mm3 DNA goop everywhere. Whatever assumptions you make about technological progress, it's an unreasonable amount of memory.

So, for the remainder of this post, I'll assume that attacks that use more than 2100 bits of memory are out-of-scope for comparison with AES-128. This is somewhat arbitrary. I may justify it in a follow-up post.

Note: By exact computation of "2(0.207...b+o(b))" at b=384, I find that the space-optimal variant actually needs to store 285 vectors, i.e. 298 bits. I'll use the exact value in what follows.

The attack machine

The attack machine is a work of science fiction. The story goes like this.

The General Routing Corporation builds R-machines for the government. Their machines all have a similar design: model R(x, y, z, m) has 2x aisles of server racks, 2y racks per aisle, 2z shelves per rack, one (many-core) CPU per shelf, and 2m bits of RAM per CPU.

The CPUs in an R-machine are networked. The CPUs in one rack have all-to-all connectivity. The racks have nearest-neighbor connectivity.

General Routing's glossy marketing literature advertises R-machines that come packaged as Dyson spheres. But R(20, 20, 10, 50) is the largest machine they've built. It's housed in Nevada. All of Nevada. What use to be Nevada.

The all-to-all connectivity inside a rack provides each CPU with RAM-like access to 2z+m bits of data. For R(20, 20, 10, 50) that's one exabit per rack. There are a trillion racks.

Note: After accounting for power, cooling, and wiring (all handled by separate subcontractors), the racks are one meter square at the base. So Nevada was actually a little small. General Routing and the government agreed to a change order during construction: they swapped the R(20, 20, 10, 50) for four R(19, 19, 10, 50)s stacked vertically. This didn't change the performance of the machine, but it was very expensive.

Sketch of the computation

The computation that R(20, 20, 10, 50) performs to break Kyber512 uses the space-optimal BDGL sieve as described above.

The machine performs (9/8)192 iterations of a loop. In each iteration the collection of N = 285 input vectors are laid out (arbitrarily) in memory across the cluster. The machine:

  1. Uses the BDGL near-neighbor search algorithm to tag each vector with (zero or more) region labels.
  2. Sorts the labelled vectors.
  3. Identifies close pairs among vectors that share the same label.

Note: The list-decoding algorithm uses a "random product code". I won't go into the details here, but I do need to mention that every CPU in R(20, 20, 10, 50) needs a copy of this random product code. The code is represented by mN1/m unit vectors in dimension 384/m. We need m to be large enough so that the code size is negligible compared to each CPU's share of the input. But if we take m to be too large then the list-decoding algorithm becomes the dominant cost of the attack. This is a balancing act, but something like m=6 should be fine.

Note: The random product code must be generated fresh in each iteration of the outer loop. This can be done in parallel by pre-sharing an DRBG seed across the cluster.

Note: Step 1 maps each input vector to zero or more labelled vectors. The "zero" and "more than one" cases must be handled carefully, but both can be solved by giving each CPU slightly more memory than it needs to store its share of the input list.

The sort

A region label is a string of log2 N = 85 bits. We'll think of these 85 bits as a quadruple (w, x, y, z) where w is a 35-bit tag and (x, y, z) indexes a CPU.

The sort proceeds in three steps:

  1. Sort by z in each rack such that the CPU in the i-th shelf receives all of the vectors with z = i.

Since the CPUs in each rack are all-to-all connected, this can be done with radix sort in 10 rounds of communication. In each round, each of the 210 CPUs exchanges about 250 bits of data with a neighbor. The cost is 250 in-rack bit moves per CPU per round. This is 263 in-rack bit moves per rack and 2103 in-rack bit moves across the entire cluster.

  1. Sort by (x,y) within each z-layer.

Since the racks have nearest-neighbor connectivity, this can be done with the Schnorr–Shamir mesh sort in 220 steps. In each step, each of the 250 CPUs in the cluster exchanges about 250 bits of data with a neighbor in an adjacent rack. This gives a total of 2(20+50+50) = 2120 adjacent-rack bit move operations.

  1. Sort by w on each CPU.

This can be done using any fast RAM sort at negligible cost.

Close pair detection

After the sort, vectors with the same (w, x, y, z) label are adjacent in the memory of CPU (x, y, z). There are not many vectors with any given (w, x, y, z) label, so actual close-pairs can be identified using parallel, nearly-linear time, searches. This step involves something like 285 vector comparisons. It's safe to assume that each comparison costs less than 235 adjacent-rack bit moves, so this is negligible.

The cost

The sorting and close-pair detection steps are iterated (9/8)192 = 232 times. The total cost of the attack is dominated by 2120+32 = 2152 adjacent-rack bit moves. This is signficantly more expensive than the 2143 logic gates that are needed to break AES-128.

Note: A naive application (misinterpretation?) of the NTRU Prime submission's "Non-quantum 0.396b, non-hybrid, real cost of memory" estimate gives something similar (0.396 · 384 = 152), but this is a coincidence. This naive 0.396b estimate is less realistic than mine in some respects and more realistic in others. It is less realistic because I used an exact computation to arrive at N = 285 instead of N = 280. It is more realistic because it does not allow unit-cost random access to the exabit memories.

What if I can only afford an R(20, 20, 10, 40)?

If you only have 290 bits of memory, then you cannot iterate a BDGL-based sieve. So you cannot use BDGL to find dimension 384 vectors that are as short as you like. You cannot use BDGL to break Kyber512.

This point bears repeating, as the problem compounds at higher security levels.

The standard attack on Kyber768 uses something like 2138 bits of storage. If you can't afford an R(20, 20, 10, 90)—and who can these days?—you can't use BDGL to attack Kyber768. If all you have lying around is an R(20, 20, 10, 50)... then you start looking at other sieves—exponentially slower sieves that use less space. You might end up pretty far down the list of low-memory sieving algorithms. You might even have to consider superexponential time lattice point enumeration.

After all, your goal as a codebreaker is to maximize the success probability of the computation that you can afford to run. You're not in it to daydream about Dyson spheres.


Update 2023-11-17: better attack parameters

Instead of covering the sphere with (3/2)b/2 overlapping regions, we can cover it with fewer, but larger, regions. As we decrease the number of regions that we use, the number of vectors in each bucket increases, and the cost of close-pair detection grows (quadratically). But this is offset by the fact that the number of close pairs that we find per bucket also increases, and the total number of buckets that we need to consider decreases.

When using an R(x, x, z, m), the optimal number of vectors per bucket is roughly x. That's because the sorting step moves 2x+z+m bits of memory x times, and the close pair detection step (implemented as a "systolic array" where a copy of the data is moved past the original in a cycle) moves 2x+z+m bits of data k times where k is the average number of vectors in a bucket.

The RAM model gives you an R(0, 0, 0, m). In this case (3/2)b/2 regions (which corresponds to covering the sphere with spherical caps of angle π/3) is optimal.

I used an R(20, 20, 10, 50) above, and I failed to balance the cost of sorting and close-pair detection. Using this python script that I based on the software from ePrint 2019/1161, I find that caps of angle approximately 61π/173 are optimal. Properly parameterized, the attack performs 2142 adjacent-rack bit moves and an equal number of vector comparisons. (Both of these operations cost significantly more than one logic gate.)

What happens as the sieving dimension, b, grows? Locality forces the attacker to keep m constant. Power and cooling constraints force them to keep z constant. So the they will use an R(x, x, cz, cm) for the smallest x that gives them enough memory. They will tune the algorithm so that there are x vectors in each bucket, which is roughly the square root of the input list. And, ultimately, their cost will grow like 2((0.349...+o(1))b), although the convergence is pretty slow.

They'll use an R(18.3, 18.3, 10, 50) for b = 384, which costs 20.371·384 = 2142 operations.

They'll use an R(38.7, 38.7, 10, 50) for b = 576, which costs 20.366·576 = 2211 operations.

They'll use an R(59, 59, 10, 50) for b = 768, which costs 20.363·768 = 2279 operations.

The asymptotics don't really matter, as realistic adversaries will use an asymptotically slower algorithm if it has a higher success probability when run on a machine that they can afford (and those algorithms exist).

It's much more important to nail down the cz and cm. For instance, if we only allow gigabit RAMs. Then the adversary will use an R(33.3, 33.3, 0, 30) or R(33.3, 33.3, 10, 20) for b = 384, which will cost 20.392·384 = 2151 operations. If we only allow kilobit RAMs the cost is 20.407·384 = 2156 operations.