Inside Sui's Network Turbulence: A Technical Post-Mortem
The Sui network recently faced a series of operational challenges, experiencing three distinct outages within a short timeframe. The core development team has conducted a thorough review and released a detailed account of the incidents.
The Catalyst: Unforeseen Interactions from an Upgrade
The initial two outages stemmed from a critical bug. It was triggered by an unforeseen interaction between the network's gas metering logic and the newly introduced "Address Balances" feature in the recent v1.72 upgrade. This interaction caused a crash that halted network consensus.
In response to the first outage, engineers deployed a temporary fix. The priority was to restore network availability swiftly while a permanent solution was being developed. The team was aware this interim patch carried a minimal risk of causing a subsequent failure but proceeded to minimize downtime.
This known risk materialized in a slightly different form the following morning, leading to the second network halt.
The Third Halt: A Latent Bug Surfaces
The third incident occurred during a scheduled epoch change. As validators restarted nodes to deploy the permanent fix for the earlier issue, a long-dormant bug related to randomness state persistence was inadvertently activated, causing another consensus failure.
Timeline and Key Impact
- Outage 1: Began Thu ~7:00 PT, recovered ~13:30 PT.
- Outage 2: Began Fri ~5:00 PT, recovered ~8:30 PT.
- Outage 3: Began Fri ~13:30 PT, recovered ~19:20 PT.
A paramount point emphasized throughout the event was the safety of user assets. No confirmed transactions were rolled back upon network recovery, preserving finality for all users.
Resolution and Path Forward
Validators have now successfully applied comprehensive patches that address the root gas charging bug and the randomness state flaw. Network operations have returned to normal. This episode underscores the intricate challenges of maintaining and upgrading complex blockchain protocols and the critical need for robust incident response protocols.