Glue and coprocessor architectures
2024 Sep 02
See all posts
Glue and coprocessor architectures
Special thanks to Justin Drake, Georgios Konstantopoulos, Andrej
Karpathy, Michael Gao, Tarun Chitra and various Flashbots contributors
for feedback and review.
If you analyze any resource-intensive computation being done in the
modern world in even a medium amount of detail, one feature that you
will find again and again is that the computation can be broken up into
two parts:
- A relatively small amount of complex, but not very computationally
intensive, "business logic"
- A large amount of intensive, but highly structured,
"expensive work"
These two forms of computation are best handled in different ways:
the former, with an architecture that may have lower efficiency but
needs to have very high generality, and the latter, with an architecture
that may have lower generality, but needs to have very high
efficiency.
What are
some examples of this separation in practice?
To start off, let us look under the hood of the environment I am most
familiar with: the Ethereum Virtual Machine (EVM). Here
is the geth debug trace of a recent Ethereum transaction that I did:
updating the IPFS hash of my blog on ENS. The transaction consumes a
total of 46924 gas, which can be categorized in this way:
- Base cost: 21,000
- Calldata: 1,556
- EVM execution: 24,368
SLOAD
opcode: 6,400
SSTORE
opcode: 10,100
LOG
opcode: 2,149
- Other: 6,719
EVM trace of an ENS hash update. Second last column is gas
consumption.
The moral of the story is: most of the execution
(~73% if you look at the EVM alone, ~85% if you include the portion of
the base cost that covers computation) is concentrated in a very
small number of structured expensive operations: storage reads
and writes, logs, and cryptography (the base cost includes 3000 to pay
for signature verification, and the EVM also includes 272 to pay for
hashing). The rest of the execution is "business
logic": fiddling around with the bits of the calldata to
extract the ID of the record I am trying to set and the hash I am
setting it to, and so on. In a token transfer, this would include adding
and subtracting balances, in a more advanced application, this might
include a loop, and so on.
In the EVM, these two forms of execution are handled in different
ways. The high-level business logic is written in a higher-level
language, often Solidity, which compiles to the EVM. The expensive work
is still triggered by EVM opcodes (SLOAD
, etc),
but > 99% of the actual computation is done in specialized modules
written directly inside of client code (or even libraries).
To reinforce our understanding of this pattern, let's explore it in
another context: AI code written in python using torch.
Forward pass of one block of a transformer model, source.
What do we see here? We see a relatively small amount of "business
logic", written in python, which describes the structure of the
operations that are being done. In an actual application, there will
also be another type of business logic, which determines details like
how you get the input and what you do to the output. But, if we peek
into each of the individual operations themselves
(self.norm
, torch.cat
, +
,
*
, the various steps inside self.attn
...), we
see vectorized computation: the same operation getting computed
on a large number of values in parallel. Similarly to the first example,
a small portion of the compute is spent on business logic, and the bulk
of the compute is spent on performing the big structured matrix and
vector operations - in fact, the majority is just matrix
multiplication.
Just like in the EVM example, the two types of work are handled in
two different ways. The high-level business logic code is written in
Python, a highly general and flexible language which is also very slow,
and we just accept the inefficiency because it only touches a small part
of the total computational cost. Meanwhile, the intensive operations are
written in highly optimized code, often CUDA code running on a GPU.
Increasingly, we're even starting to see LLM inference being
done on ASICs.
Modern programmable
cryptography, such as SNARKs,
follows a similar pattern yet again, on two levels.
First, the prover can be written in a high-level language where the
heavy work is done with vectorized operations, just like the AI example
above. My circle
STARK code here shows this in action. Second, the program that is
being executed inside the cryptography can itself be written in a way
that is split between generalized business logic and highly structured
expensive work.
To see how this works, we can look at one of the latest trends in
STARK proving. To be general-purpose and easy to use, teams are
increasingly building STARK provers for widely-adopted minimal virtual
machines, such as RISC-V. Any program
whose execution needs to be proven can be compiled into RISC-V, and then
the prover can prove the RISC-V execution of that code.
Diagram from RiscZero
documentation
This is super convenient: it means that we only need to write the
prover logic once, and from that point forward any program that needs to
be proven can just be written in any "conventional" programming language
(eg. RiskZero supports Rust). However, there is a problem: this approach
incurs significant overhead. Programmable cryptography is already very
expensive; adding the overhead of running code inside a RISC-V
interpreter is too much. And so developers have come up with a hack: you
identify the specific expensive operations that make up the bulk of the
computation (often that's hashes and signatures), and you create
specialized modules to prove those operations extremely efficiently. And
then you just combine the inefficient-but-general RISC-V proving system
and the efficient-but-specialized proving systems together, and you get
the best of both worlds.
Programmable cryptography other than ZK-SNARKs, such as multi-party
computation (MPC) and fully
homomorphic encryption (FHE) will likely be optimized using a
similar approach.
What is the general pattern
at play?
Modern computation is increasingly following what I call a
glue and coprocessor architecture: you have some
central "glue" component, which has high generality but low efficiency,
which is responsible for shuttling data between one or more
coprocessor components, which have low generality but
high efficiency.
This is a simplification: in practice, there are almost always more
than two levels along the tradeoff curve between efficiency and
generality. GPUs, and other chips that are often called "coprocessors"
in industry, are less general than CPUs but more general than ASICs.
There are complicated tradeoffs of how far to specialize, which
are decided based on projections and intuitions about what parts of an
algorithm will still be the same in five years, and which parts will
change in six months. In a ZK-proving architecture, we often similarly
see multiple layers of specialization. But for a broad mental model,
it's sufficient to think about two levels. There are parallels to this
in many domains of computation:
Ethereum |
EVM |
Dedicated opcodes/precompiles for specialized operations |
AI (eg. LLMs) |
Python (often) |
GPU via CUDA; ASICs |
Web apps |
Javascript |
WASM |
Programmable cryptography |
RISC-V |
Specialized modules (eg. for hashes and signatures) |
From the above examples, it might feel like a law of nature that of
course computation can be split in this way. And indeed, you can find
examples of specialization in computing for decades. However, I would
argue that this separation is increasing. I think this
is true for a few key reasons:
- We have only relatively recently hit the limits of increasing CPU
clock speed, and so further gains can only come from
parallelization. However, parallelization is hard to
reason about, and so it's often more practical for developers to
continue reasoning sequentially, and let the parallelization happen in
the backend, wrapped inside specialized modules built for specific
operations.
- Computation has only recently become so fast that the
computational costs of business logic have become truly
negligible. In this world, it makes sense to optimize the VM
that business logic runs in for goals other than compute efficiency:
developer friendliness, familiarity, security, and other similar
objectives. Meanwhile, the specialized "coprocessor" modules can
continue to be designed for efficiency, and get their security and
developer friendliness properties from the relatively simple "interface"
that they have with the glue.
- It's becoming clearer what the most important expensive
operations are. This is most obvious in cryptography, where
it's clear what kinds of specific expensive operations are most likely
to be used: modular arithmetic, elliptic curve linear
combinations (aka multi-scalar multiplications), Fast Fourier
transforms, and so on. It's also becoming clearer in AI, where the bulk
of computation has been "mostly matrix multiplication" (albeit with different
levels of precision) for
over two decades. Similar trends appear in other domains. There are just
much fewer unknown unknowns in (computationally intensive) computing
than there were 20 years ago.
What does this imply?
A key takeaway is that glue should optimize for being good
glue, and coprocessors should optimize for being good
coprocessors. We can explore the implications of this in a few
key areas.
EVM
Blockchain virtual machines (eg. EVM) don't need to be
efficient, they just need to be familiar. Computation in an
inefficient VM can be made almost as efficient in practice as
computation in a natively efficient VM by just adding the right
coprocessors (aka "precompiles"). The overhead incurred by eg. the EVM's
256-bit registers is relatively small, while the benefits from the EVM's
familiarity and existing developer ecosystem are great and durable.
Developer teams optimizing the EVM are even finding that lack of
parallelization is often not a primary barrier to scalability.
The best ways to improve the EVM may well just be (i) adding
better precompiles or specialized opcodes, eg. some combination
of EVM-MAX
and SIMD may be
justified, and (ii) improving the storage layout, which
eg. the Verkle tree
changes do as a side effect by greatly reducing the cost of
accessing storage slots that are beside each other.
Storage optimizations in the Ethereum Verkle tree proposal,
putting adjacent storage keys together and adjusting gas costs to
reflect this. Optimizations like this, together with better precompiles,
may well matter more than tweaking the EVM itself.
Secure computing and open
hardware
One of the big challenges with improving security of modern computing
at the hardware layer is the overcomplicated and proprietary nature of
it: chips are designed to be highly efficient, which requires
proprietary optimizations. Backdoors are easy to hide, and side
channel vulnerabilities keep getting discovered.
There continues to be a valiant effort to push more open and more
secure alternatives from multiple angles. Some computations are
increasingly done in trusted execution environments, including on users'
phones, and this has increased security for users already. The push
toward more-open-source consumer hardware continues, with recent
victories like a RISC-V
laptop running Ubuntu.
RISC-V laptop running Debian, source
However, efficiency continues to be a problem. The author of the
above-linked article writes:
It's unfeasible for a newer, open-source chip design like RISC-V to
go toe to toe with processor technologies that have been around for and
refined for decades. Progress has a starting point.
More paranoid ideas, like this design for
building a RISC-V computer on top of an FPGA, face even more
overhead. But what if glue and coprocessor architectures mean
that this overhead does not actually matter? What if we accept
that open and secure chips will be be slower than proprietary chips, if
needed even giving up on common optimizations like speculative execution
and branch prediction, but try to compensate for this by adding (if
needed, proprietary) ASIC modules for specific types of computation that
are the most intensive? Sensitive computations can be done in a "main
chip" that would be optimized for security, open source design and
side-channel resistance. More intensive computations (eg. ZK-proving,
AI) would be done in the ASIC modules, which would learn less
information (potentially, with cryptographic blinding, perhaps in some
cases even zero information) about the computation being performed.
Cryptography
Another key takeaway is that this is all very optimistic for
cryptography, especially programmable
cryptography, going mainstream. We're already seeing hyper-optimized
implementations of some specific highly structured computations in
SNARKs, MPC and other settings: overhead for some hash
functions is in the range of being only a few hundred times more
expensive than running the computation directly, and extremely low
overhead for AI (which is mostly just matrix multiplications) is also
possible. Further improvements like GKR will
likely reduce this further. Fully general purpose VM execution,
especially if executed inside a RISC-V interpreter, will likely continue
to have something like ten-thousand-fold overhead, but for the reasons
described in this post, this will not matter: as long as the
most intensive parts of a computation are handled separately using
efficient dedicated techniques, the total overhead will be
manageable.
A simplified diagram of a dedicated MPC for matrix
multiplication, the largest component in AI model inference. See this paper for more
details, including ways to keep both the model and the input
private.
One exception to the idea that "glue only needs to be familiar, not
efficient" is latency, and to a smaller extent data
bandwidth. If a computation involves doing repeated heavy
operations on the same data dozens of times (as cryptography and AI both
do), any delays that result from an inefficient glue layer can become a
primary bottleneck to running time. Hence, glue also has efficiency
requirements, though they are more specific ones.
Conclusion
On the whole, I consider the above trends to be very positive
developments from several perspectives. First, it is the logical way to
maximize computing efficiency while preserving developer friendliness,
and being able to get more of both at the same time benefits everyone.
In particular, by enabling more efficiency gains from specialization on
the client side, it improves our ability to run computations that are
both sensitive and performance-demanding (eg. ZK-proving, LLM inference)
locally on the user's hardware. Second, it creates a large
window of opportunity to ensure that the drive for efficiency does not
compromise other values, most notably security, openness and
simplicity: side-channel security and openness in computer
hardware, reducing circuit complexity in ZK-SNARKs, and reducing
complexity in virtual machines. Historically, the drive for efficiency
has led to these other factors taking a back seat. With
glue-and-coprocessor architectures, it no longer needs to. One
part of the machine optimizes for efficiency, and the other part
optimizes for generality and other values, and the two work
together.
The trend is also very beneficial for cryptography, because
cryptography itself is a major example of "expensive structured
computation", which gets accelerated by this trend. This adds a further
opportunity to increase security. A security increase also becomes
possible in the world of blockchains: we can worry less about optimizing
virtual machines, and instead focus more on optimizing precompiles and
other features that live alongside virtual machines.
Third, this trend presents an opportunity for smaller and
newer players to participate. If computation is becoming less
monolithic, and more modular, that greatly decreases the barrier to
entry. Even with an ASIC for one type of computation, it's possible to
make a difference. The same will be true in the ZK-proving space, and in
EVM optimization. Writing code that has near-frontier-level efficiency
becomes much easier and more accessible. Auditing and formally
verifying such code becomes easier and more accessible. And
finally, because these very different domains of computing are
converging on some common patterns, there is more room for collaboration
and learning between them.
Glue and coprocessor architectures
2024 Sep 02 See all postsSpecial thanks to Justin Drake, Georgios Konstantopoulos, Andrej Karpathy, Michael Gao, Tarun Chitra and various Flashbots contributors for feedback and review.
If you analyze any resource-intensive computation being done in the modern world in even a medium amount of detail, one feature that you will find again and again is that the computation can be broken up into two parts:
These two forms of computation are best handled in different ways: the former, with an architecture that may have lower efficiency but needs to have very high generality, and the latter, with an architecture that may have lower generality, but needs to have very high efficiency.
What are some examples of this separation in practice?
To start off, let us look under the hood of the environment I am most familiar with: the Ethereum Virtual Machine (EVM). Here is the geth debug trace of a recent Ethereum transaction that I did: updating the IPFS hash of my blog on ENS. The transaction consumes a total of 46924 gas, which can be categorized in this way:
SLOAD
opcode: 6,400SSTORE
opcode: 10,100LOG
opcode: 2,149EVM trace of an ENS hash update. Second last column is gas consumption.
The moral of the story is: most of the execution (~73% if you look at the EVM alone, ~85% if you include the portion of the base cost that covers computation) is concentrated in a very small number of structured expensive operations: storage reads and writes, logs, and cryptography (the base cost includes 3000 to pay for signature verification, and the EVM also includes 272 to pay for hashing). The rest of the execution is "business logic": fiddling around with the bits of the calldata to extract the ID of the record I am trying to set and the hash I am setting it to, and so on. In a token transfer, this would include adding and subtracting balances, in a more advanced application, this might include a loop, and so on.
In the EVM, these two forms of execution are handled in different ways. The high-level business logic is written in a higher-level language, often Solidity, which compiles to the EVM. The expensive work is still triggered by EVM opcodes (
SLOAD
, etc), but > 99% of the actual computation is done in specialized modules written directly inside of client code (or even libraries).To reinforce our understanding of this pattern, let's explore it in another context: AI code written in python using torch.
Forward pass of one block of a transformer model, source.
What do we see here? We see a relatively small amount of "business logic", written in python, which describes the structure of the operations that are being done. In an actual application, there will also be another type of business logic, which determines details like how you get the input and what you do to the output. But, if we peek into each of the individual operations themselves (
self.norm
,torch.cat
,+
,*
, the various steps insideself.attn
...), we see vectorized computation: the same operation getting computed on a large number of values in parallel. Similarly to the first example, a small portion of the compute is spent on business logic, and the bulk of the compute is spent on performing the big structured matrix and vector operations - in fact, the majority is just matrix multiplication.Just like in the EVM example, the two types of work are handled in two different ways. The high-level business logic code is written in Python, a highly general and flexible language which is also very slow, and we just accept the inefficiency because it only touches a small part of the total computational cost. Meanwhile, the intensive operations are written in highly optimized code, often CUDA code running on a GPU. Increasingly, we're even starting to see LLM inference being done on ASICs.
Modern programmable cryptography, such as SNARKs, follows a similar pattern yet again, on two levels. First, the prover can be written in a high-level language where the heavy work is done with vectorized operations, just like the AI example above. My circle STARK code here shows this in action. Second, the program that is being executed inside the cryptography can itself be written in a way that is split between generalized business logic and highly structured expensive work.
To see how this works, we can look at one of the latest trends in STARK proving. To be general-purpose and easy to use, teams are increasingly building STARK provers for widely-adopted minimal virtual machines, such as RISC-V. Any program whose execution needs to be proven can be compiled into RISC-V, and then the prover can prove the RISC-V execution of that code.
Diagram from RiscZero documentation
This is super convenient: it means that we only need to write the prover logic once, and from that point forward any program that needs to be proven can just be written in any "conventional" programming language (eg. RiskZero supports Rust). However, there is a problem: this approach incurs significant overhead. Programmable cryptography is already very expensive; adding the overhead of running code inside a RISC-V interpreter is too much. And so developers have come up with a hack: you identify the specific expensive operations that make up the bulk of the computation (often that's hashes and signatures), and you create specialized modules to prove those operations extremely efficiently. And then you just combine the inefficient-but-general RISC-V proving system and the efficient-but-specialized proving systems together, and you get the best of both worlds.
Programmable cryptography other than ZK-SNARKs, such as multi-party computation (MPC) and fully homomorphic encryption (FHE) will likely be optimized using a similar approach.
What is the general pattern at play?
Modern computation is increasingly following what I call a glue and coprocessor architecture: you have some central "glue" component, which has high generality but low efficiency, which is responsible for shuttling data between one or more coprocessor components, which have low generality but high efficiency.
This is a simplification: in practice, there are almost always more than two levels along the tradeoff curve between efficiency and generality. GPUs, and other chips that are often called "coprocessors" in industry, are less general than CPUs but more general than ASICs. There are complicated tradeoffs of how far to specialize, which are decided based on projections and intuitions about what parts of an algorithm will still be the same in five years, and which parts will change in six months. In a ZK-proving architecture, we often similarly see multiple layers of specialization. But for a broad mental model, it's sufficient to think about two levels. There are parallels to this in many domains of computation:
From the above examples, it might feel like a law of nature that of course computation can be split in this way. And indeed, you can find examples of specialization in computing for decades. However, I would argue that this separation is increasing. I think this is true for a few key reasons:
What does this imply?
A key takeaway is that glue should optimize for being good glue, and coprocessors should optimize for being good coprocessors. We can explore the implications of this in a few key areas.
EVM
Blockchain virtual machines (eg. EVM) don't need to be efficient, they just need to be familiar. Computation in an inefficient VM can be made almost as efficient in practice as computation in a natively efficient VM by just adding the right coprocessors (aka "precompiles"). The overhead incurred by eg. the EVM's 256-bit registers is relatively small, while the benefits from the EVM's familiarity and existing developer ecosystem are great and durable. Developer teams optimizing the EVM are even finding that lack of parallelization is often not a primary barrier to scalability.
The best ways to improve the EVM may well just be (i) adding better precompiles or specialized opcodes, eg. some combination of EVM-MAX and SIMD may be justified, and (ii) improving the storage layout, which eg. the Verkle tree changes do as a side effect by greatly reducing the cost of accessing storage slots that are beside each other.
Storage optimizations in the Ethereum Verkle tree proposal, putting adjacent storage keys together and adjusting gas costs to reflect this. Optimizations like this, together with better precompiles, may well matter more than tweaking the EVM itself.
Secure computing and open hardware
One of the big challenges with improving security of modern computing at the hardware layer is the overcomplicated and proprietary nature of it: chips are designed to be highly efficient, which requires proprietary optimizations. Backdoors are easy to hide, and side channel vulnerabilities keep getting discovered.
There continues to be a valiant effort to push more open and more secure alternatives from multiple angles. Some computations are increasingly done in trusted execution environments, including on users' phones, and this has increased security for users already. The push toward more-open-source consumer hardware continues, with recent victories like a RISC-V laptop running Ubuntu.
RISC-V laptop running Debian, source
However, efficiency continues to be a problem. The author of the above-linked article writes:
More paranoid ideas, like this design for building a RISC-V computer on top of an FPGA, face even more overhead. But what if glue and coprocessor architectures mean that this overhead does not actually matter? What if we accept that open and secure chips will be be slower than proprietary chips, if needed even giving up on common optimizations like speculative execution and branch prediction, but try to compensate for this by adding (if needed, proprietary) ASIC modules for specific types of computation that are the most intensive? Sensitive computations can be done in a "main chip" that would be optimized for security, open source design and side-channel resistance. More intensive computations (eg. ZK-proving, AI) would be done in the ASIC modules, which would learn less information (potentially, with cryptographic blinding, perhaps in some cases even zero information) about the computation being performed.
Cryptography
Another key takeaway is that this is all very optimistic for cryptography, especially programmable cryptography, going mainstream. We're already seeing hyper-optimized implementations of some specific highly structured computations in SNARKs, MPC and other settings: overhead for some hash functions is in the range of being only a few hundred times more expensive than running the computation directly, and extremely low overhead for AI (which is mostly just matrix multiplications) is also possible. Further improvements like GKR will likely reduce this further. Fully general purpose VM execution, especially if executed inside a RISC-V interpreter, will likely continue to have something like ten-thousand-fold overhead, but for the reasons described in this post, this will not matter: as long as the most intensive parts of a computation are handled separately using efficient dedicated techniques, the total overhead will be manageable.
A simplified diagram of a dedicated MPC for matrix multiplication, the largest component in AI model inference. See this paper for more details, including ways to keep both the model and the input private.
One exception to the idea that "glue only needs to be familiar, not efficient" is latency, and to a smaller extent data bandwidth. If a computation involves doing repeated heavy operations on the same data dozens of times (as cryptography and AI both do), any delays that result from an inefficient glue layer can become a primary bottleneck to running time. Hence, glue also has efficiency requirements, though they are more specific ones.
Conclusion
On the whole, I consider the above trends to be very positive developments from several perspectives. First, it is the logical way to maximize computing efficiency while preserving developer friendliness, and being able to get more of both at the same time benefits everyone. In particular, by enabling more efficiency gains from specialization on the client side, it improves our ability to run computations that are both sensitive and performance-demanding (eg. ZK-proving, LLM inference) locally on the user's hardware. Second, it creates a large window of opportunity to ensure that the drive for efficiency does not compromise other values, most notably security, openness and simplicity: side-channel security and openness in computer hardware, reducing circuit complexity in ZK-SNARKs, and reducing complexity in virtual machines. Historically, the drive for efficiency has led to these other factors taking a back seat. With glue-and-coprocessor architectures, it no longer needs to. One part of the machine optimizes for efficiency, and the other part optimizes for generality and other values, and the two work together.
The trend is also very beneficial for cryptography, because cryptography itself is a major example of "expensive structured computation", which gets accelerated by this trend. This adds a further opportunity to increase security. A security increase also becomes possible in the world of blockchains: we can worry less about optimizing virtual machines, and instead focus more on optimizing precompiles and other features that live alongside virtual machines.
Third, this trend presents an opportunity for smaller and newer players to participate. If computation is becoming less monolithic, and more modular, that greatly decreases the barrier to entry. Even with an ASIC for one type of computation, it's possible to make a difference. The same will be true in the ZK-proving space, and in EVM optimization. Writing code that has near-frontier-level efficiency becomes much easier and more accessible. Auditing and formally verifying such code becomes easier and more accessible. And finally, because these very different domains of computing are converging on some common patterns, there is more room for collaboration and learning between them.