A Unprecedented Alliance Reshapes AI's Backbone

As the hunger for artificial intelligence computing power explodes, network communication has emerged as the critical bottleneck limiting the scale and efficiency of model training. A newly released technical specification, born from a unique collaboration, aims to fundamentally redesign how AI clusters connect.

Tackling the Achilles' Heel of Massive-Scale Training

With AI models surpassing trillions of parameters, training clusters now incorporate tens of thousands of GPUs. Traditional network architectures reveal a fatal flaw under this scale: even a minuscule packet delay or a single link failure can idle vast arrays of expensive processors, halting training tasks and incurring massive costs. Larger clusters suffer these network-induced disruptions more frequently and severely.

The MRC Protocol: Parallelism Over Raw Speed

The core innovation of the Multi-path Reliable Connection (MRC) protocol is a paradigm shift from chasing pure bandwidth to ensuring robustness and resilience. Instead of relying on a single, monolithic high-speed link, MRC logically splits a physical network interface into multiple independent, parallel sub-paths.

In practice, the protocol allows a high-speed port to connect to several different network switches, creating a parallel web of connections. For instance, an 800Gb/s interface can be configured as eight separate 100Gb/s paths. This architecture delivers key benefits:

  • Enhanced Reliability: Traffic instantly reroutes through healthy paths if one sub-link congests or fails, preventing a single point of failure from crippling the entire cluster.
  • Reduced Latency: Fine-grained path management enables smarter data scheduling, cutting down queueing delays.
  • Improved Utilization: Parallel paths balance load more evenly, boosting overall network throughput efficiency.

Industry-Wide Push for Standardization

This achievement stems from a rare convergence of leaders across the entire tech stack—from silicon design and hardware manufacturing to software and cloud services. The two-year joint development effort underscores the industry's unified commitment to solving this universal challenge. By releasing the specification through the Open Compute Project (OCP), the consortium ensures its neutrality and openness, fostering rapid adoption across global data centers.

Networking solutions implementing the MRC protocol are already operational in supercomputing clusters powered by the latest acceleration platforms, successfully handling extreme-scale AI workloads. This marks more than a technical breakthrough; it's a profound demonstration of industry collaboration for AI infrastructure, paving the way for training the next generation of trillion-parameter and beyond models.