< session />

Orchestrating Thousands of GPUs: Engineering Patterns for Large-Scale Model Training

Tue, 21 April

Training large AI models requires more than raw compute. It demands careful orchestration of multi-node GPU systems, robust communication, and disciplined engineering trade-offs. This session traces the shift from traditional computing models to large-scale parallel training, explaining how distributed training works beneath the surface and what it takes to make it reliable in production. The talk examines real-world challenges in distributed data processing, introduces the five dimensions of parallelism, and walks through practical heuristics and trade-off decisions used to scale AI training architectures across diverse hardware environments.

What You Will Learn

  • How gradient synchronization, collective operations, and fault tolerance operate in practice, including the role of frameworks such as NCCL, Gloo, and MPI

  • The five dimensions of parallelism and how data, tensor, pipeline, expert, and context parallelism are applied at scale

  • Engineering trade-offs across communication patterns, memory management, network topology, and resource utilization in distributed training systems

Who Should Attend

  • Software Architects

  • Platform Engineers

  • Distributed Systems Engineers

  • Infrastructure and Systems Practitioners

  • Technical Leads working on large-scale compute platforms

< speaker_info />

About the speaker

Krishnaswamy Subramanian

Principal Consultant, ThoughtWorks

Krishnaswamy Subramanian is a Principal Consultant at Thoughtworks with over 18 years of experience in custom software development. As an "expert generalist," he specializes in solving complex technical challenges across full-stack development, mobile applications, and DevOps. His expertise encompasses databases, infrastructure, and Kubernetes, with a proven track record of leading large-scale infrastructure projects.

Throughout his career, Krishnaswamy has served as technical leader, advisor, and principal architect. He is passionate about empowering teams and delivering impactful, scalable solutions. A dedicated knowledge sharer, he has presented at multiple conferences and actively contributes to open-source projects, demonstrating his commitment to technological innovation and community collaboration.

His technical approach focuses on understanding system architectures and creating innovative solutions through strategic development.