Why are neural networks and cryptographic ciphers so similar?

At first glance, training language models and encrypting data seem like completely different problems: one learns patterns from examples to generate text, the other scrambles information to hide it. Yet their underlying algorithms share a curious resemblance, and it’s not for lack of creativity.

Sequence processing: the sequential version

Consider the venerable recurrent neural network, feeding text token by token into a recurrent state before generating the output text:

This is structurally identical to the Sponge construction in SHA-3, absorbing bytes into a state before squeezing out the hash:

Perhaps this similarity isn’t surprising: to process variable-length input into a fixed-size state, absorbing sequentially is a natural choice.

Sequence processing: the parallel version

Modern hardware is parallel all the way down, so sequential absorbing wastes performance. Both fields found the same solution: run the expensive function $f$ on all chunks in parallel rather than sequentially, then combine with simple addition:

Addition loses ordering information, so both approaches recover ordering by adding position encodings to each chunk.

In neural networks, this construction¹ drives the Transformer architecture, which improved upon sequential recurrent networks. In cryptography, this construction powers the fastest Message Authentication Codes².

The basic primitive: alternating linear and nonlinear layers, repeated identically

Strip away the variable-length processing. What’s inside the core function? The same pattern in both fields: linear transform, nonlinear transform, repeat:

Linear transforms provide “mixing” between different vector positions, allowing many vector elements to influence many other vector elements. Nonlinear transforms provide complexity: without them, the whole stack of layers would degenerate to a single linear transform.

Both fields repeat this identical layer many times rather than crafting bespoke structures. This focuses research and engineering effort: one layer type to analyze, and to optimize in software or in silicon.

Efficient mixing: alternating rows and columns

Zoom in further. Both fields organize their state as a grid and alternate between mixing rows and mixing columns:

In neural nets: attention mixes across sequence positions (rows), while feed-forward layers mix within each position (columns). In the AES cipher: ShiftRows permutes across columns while MixColumns combines within them. The ChaCha20 cipher alternates row-wise and diagonal mixing.

This factored approach often beats mixing the entire state at once. It’s often asymptotically faster if the mixing step is slower than linear: e.g. under quadratic mixing, mixing $n$ rows of size $m$ costs $O(nm^2)$ versus $O(n^2m^2)$ for the full matrix. More importantly, each row processes independently and with a small working set size, offering more parallelism and fitting better in caches and registers.

What’s causing the similarities?

The similarities do not appear to be due to shallow copying of ideas: the research papers and histories of the fields do not reveal much copying between the fields. Instead, there are some underlying similarities between the problem statements.

What distinguishes neural networks and symmetric cryptography from other fields of algorithm design are the following three properties.

1. The correctness property demanded of the algorithm is remarkably weak

Most algorithms face strong correctness requirements. Compilers must preserve program meaning. Databases must return exactly what was stored. Network routers must deliver the packet.

In comparison, cryptography just needs invertibility, to avoid information loss. Neural networks need just differentiability, for gradient descent. You can build a wide range of invertible or differentiable functions simply by composing smaller invertible or differentiable functions.

This freedom enables radical simplicity. Both fields build from two or three simple primitives repeated in a loop: simple enough to implement in 20 lines of code. This freedom also enables rapid experimentation: 50+ SHA-3 submissions, hundreds of attention variants. When almost any function could “work”, you can optimize your other goals more aggressively.

2. Quality requirements focus on complexity and mixing

More than the basic correctness requirement, both fields share a similar notion of quality. Cryptography needs every output bit to depend on every input bit in complicated ways. Neural networks need the outputs to make the best use of all input information. Both of these reward designs that allow every part of the state to interact with every other part of the state, over and over again. Hence the repeated mixing layers: information must flow between positions not once but many times, creating rich interdependencies.

Other fields value mixing but not complexity: sorting requires every output to be compared to every input; network topologies such as Clos networks require every output to be reachable from every input. These fields tend to produce algorithms that interact all inputs with each other exactly once and then finish, whereas cryptography and neural networks repeat the interaction many times.

3. Unusually large emphasis on performance

These fields are rare among algorithmic fields in the emphasis placed on low-level hardware performance, routinely including assembly implementations and custom hardware. This emphasis arises from economic pressures such as the ubiquity of encryption and the massive scale of neural networks.

Emphasizing performance rewards simple algorithms: it makes assembly implementations or custom hardware tractable. Emphasizing performance also rewards parallelism that we saw at every level of the design: parallel sequence processing at the top level, parallel mixers like alternating row/columns at the middle level, and linear algebra—which is easy parallelizable—at the lowest level.

Convergent evolution in algorithms

These parallels suggest something fundamental: when we demand algorithms that mix thoroughly and in a complex way, have few other correctness requirements, and perform extremely well on hardware, the best solutions may look very similar. Just as biological evolution independently invented eyes multiple times, human research seems to have invented the “deeply parallel repeated-layer mixers” structure multiple times.

We’ve already seen ideas jump between fields. RevNets brought cryptography’s Feistel networks to neural networks, enabling reversible layers that save memory. What’s next? Are there neural network analogs of Column Parity Mixers or “unaligned mixers”?

Or rather, the same construction repeated $n$ times, once for each output chunk.↩︎
Protected Counter Sum and Farfalle use exactly this construction. Even polynomial MACs like GMAC and Poly1305 follow this pattern if you squint, encoding position $i$ as the monomial $k^i$ .↩︎