Orchestrating Thousands of GPUs: Engineering Patterns for Large-Scale Model Training

Training large AI models requires more than raw compute. It demands careful orchestration of multi-node GPU systems, robust communication, and disciplined engineering trade-offs. This session traces the shift from traditional computing models to large-scale parallel training, explaining how distributed training works beneath the surface and what it takes to make it reliable in production. The talk examines real-world challenges in distributed data processing, introduces the five dimensions of parallelism, and walks through practical heuristics and trade-off decisions used to scale AI training architectures across diverse hardware environments.

What You Will Learn

How gradient synchronization, collective operations, and fault tolerance operate in practice, including the role of frameworks such as NCCL, Gloo, and MPI
The five dimensions of parallelism and how data, tensor, pipeline, expert, and context parallelism are applied at scale
Engineering trade-offs across communication patterns, memory management, network topology, and resource utilization in distributed training systems

Who Should Attend