Kaspa 20 minute BlockDAG freeze post-mortem
Last Saturday (16/12/2023) Kaspa mining halted for about 20–30 minutes and no new blocks were added to the BlockDAG because of a bug, which was fixed by the release of version v0.12.15. I have delayed posting a post-mortem until 99% of the miners upgraded to the new version in order to prevent potential attackers from taking advantage of the situation. Now that this threshold was reached (validated with Michael Sutton’s cool tool), this is the detailed explanation about the bug and how it was fixed.
In February an Italian group that worked on implementing atomic swaps on Kaspa noticed something weird: even after formulating a perfect atomic swap transaction, their transaction couldn’t be accepted to the mempool. They reached me out and asked about it, and after some investigation I found out the reason is a peculiar bug:
When a transaction is received from RPC, it’s converted to an internal golang type called DomainTransaction
, which is then used when processing the transaction in the mempool, or inside the consensus layer as part of a block. It seemed there was a small bug that copied the content of one field, called lockTime
, to the value of another field, called gas
.
The lockTime
field is related to implementing contracts where funds are locked for some time (for more info, you can read Andreas Antonopoulos’ explanation about it), and is utilized by atomic swaps. The gas
field is reserved for future integration of subnetworks in Kaspa, and since subnetworks support is not currently enabled, any valid transaction has to set it to 0. For a typical transaction, the lockTime
field is set to 0, so the bug simply copied its 0 value to the gas
field and the transaction remained valid. However, in the case of an atomic swap transactions the lockTime
field is set it to a non-zero value, which is copied to the gas
field and makes the transaction invalid.
The solution for this was simple, but I wanted to wait for more feedback from the Italian group about my solution before I publish it. At the time, I thought the fix is not urgent since the function of locking funds is currently not available to non-developers, and if some developer will try to use that function, the worst that could happen is that their transaction will get rejected.
And I was very wrong about that. The bug fix was forgotten, and in December 16th the consequences have arrived. A fresh recruit to the Rust team, called Maxim Biryukov, started experimenting with HTLC transactions, and in one of his tests he used a transaction with lockTime
set to 1 (Just to clarify — it is completely OK that Maxim ran this test, as such transactions weren’t predicted to be problematic). Since Maxim used a rusty-kaspa node that was free of this bug, his transaction was accepted to his mempool without any problem. The transaction, which we’ll call mtx
from now on (which denotes “Maxim Transaction”), was then broadcast to the rest of the network in the p2p layer. Since the bug doesn’t appear in the p2p code, all of the golang nodes accepted the transaction to their mempool without any problem.
But here comes the tricky part. When any miner found a block, it included mtx
(since it was not mined yet). The miner then sent the block to their golang node via RPC, so it can propagate it to the rest of the network. And then, when converting the RPC block transactions to DomainTransactions
the bug came into action: the conversion code took mtx
, and since mtx.lockTime
was set to 1, it also sets mtx.gas
to 1. This change in the block transactions resulted the merkle root (the part in a block header that commits to the block content) to be invalid, since it was built before mtx
was changed. Since no blocks were mined, mtx
remained in everyone’s mempool, and each block that was mined was invalid.
When pools reported me of getting ErrBadMerkleRoot
when submitting blocks and I heard about Maxim’s tests I connected the dots and released a new version with the fix.
To summarize:
- Maxim creates
mtx
and sends to his rusty-kaspa node (wheremtx.lockTime=1
andmtx.gas=0
). - Maxim’s node propagates
mtx
to the rest of the network, where it reaches the miners node (Still,mtx.lockTime=1
andmtx.gas=0
). - When a miner solves a block
b
, they includemtx
in it and sends it to their golang node via RPC. - When converting the RPC transactions of
b
toDomainTransactionsthe
miner node setsmtx.gas=mtx.lockTime=1
. Then, when the block is validated it throws anErrBadMerkleRoot
error and the block is rejected. - This results in no mining until the fix is released.
Conclusion
This case can teach us a few things:
- Even when the consensus layer is well tested with near 100% coverage, a bug in communicating between a few components (in this case, the bridge between the RPC layer and the consensus layer) can result in a lot of damage. So to prevent it, more integration tests, which involve a few components, should be written.
- Even with perfect testing conventions, no system is resilient to bugs, and we need to think of better ways to handle them when they happen. For example, I talked with Yonatan Sompolinsky about the case, and we think about implementing a fallback mechanism that will mine empty blocks in case of such bugs, and will work on keeping to secure the network in such extreme events. More on that in the future.