Lucas Atkins: How Not to Blow Up. Training a 400B MoE to 17T Tokens Without Loss Spikes

8/13/2026

11:45 am

-

12:15 am

Speakers

Lucas Atkins

Add to Calendar

About the Session

LLM progress now depends heavily on one practical issue: training stability at scale. Sparse Mixture-of-Experts (MoE) models are especially sensitive, since routing drift can overload experts, collapse utilization, and stall learning.

In this talk, I will share an "anti-loss-spike" playbook from a recent open-weight run: a 400B-parameter MoE with 13B active parameters per token, trained for 17T tokens with an unsmoothed loss curve and zero loss spikes. I will start with the failure pattern we saw, router drift, overload, MaxVio divergence, and plateau, then cover the fixes that restored steady convergence: bounded and momentum expert-bias updates (SMEBU), z-loss for logit stabilization, a precision fallback from MXFP8 to BF16, better balancing objectives, and data/packing choices that reduced step-to-step variance.

You will leave with a concrete checklist for stability instrumentation and first-response fixes to keep large open-weight runs on track.

‍

Add to Calendar

Related Sessions

View All

Emerging Tech

Abhi Desai: Agents at Scale for Marketing - Troubleshooting with AI Reasoning

Real time systems generate a constant stream of events, yet most AI models are still designed to run offline and produce reports rather than actions. This gap between prediction and decision is where many production systems fail. Developers often have accurate models but no safe, reliable way to use them in live environments. This talk focuses on how to design AI driven decision systems that act on live events with low latency and high reliability. It explains how models connect to event streams, APIs, and rule engines, and how decisions flow through real software systems. The emphasis is on architecture and system behavior rather than algorithms or math. Using real examples from large scale retail and marketing platforms, the session highlights common failure modes such as delayed signals, noisy data, unstable actions, and lack of explainability. It then shows practical design patterns like guardrails, decision thresholds, rollback strategies, and continuous monitoring that allow AI systems to operate safely in production. The ideas and patterns discussed translate directly to WebRTC, telephony, and real time communication systems where decisions such as routing, prioritization, or optimization must happen quickly and predictably. Attendees will leave with a clear reference architecture and practical guidelines they can apply to their own real time systems.

10:00 am

-

10:30 am

Emerging Tech

Abhishek Rai: SLMs and the Shift Toward Lightweight, Everywhere AI

For years, the story of AI has been about going bigger — bigger models, bigger data, bigger GPUs. But a powerful counter-trend is emerging: going smaller and smarter. Small Language Models (SLMs) and Tiny LMs are reshaping how we think about deploying and using AI. In this talk, we’ll explore how this shift is enabling organizations and individuals to run advanced language capabilities on edge devices, low-cost GPUs, and even mobile hardware. We’ll look at what’s driving this movement — from efficiency breakthroughs like pruning and quantization to new training approaches that let smaller models punch far above their weight. More importantly, we’ll talk about why this matters: making AI more accessible, energy-efficient, privacy-friendly, and deployable in real-world environments where massive compute isn’t an option. We’ll also look ahead at what the next few years might hold for the SLM ecosystem — including personalized on-device models, hybrid AI architectures, and new business opportunities enabled by this “small is powerful” era. Whether you’re an AI practitioner, product leader, or just curious about where the field is heading, this session will give you a clear view of the trend, future, and impact of this major shift in AI.

10:00 am

-

10:30 am

Emerging Tech

Pavel Oborin: AI and the Future of Cognitive Work. What's Changing, What Remains

AI is automating cognitive work faster than most organisations realise, but the gap between what the technology can theoretically do and what it's actually doing in practice is still enormous. This talk looks at where that gap sits today, what current labour market research says about how it's closing, and which parts of cognitive work are genuinely at risk versus which parts remain distinctly human. Grounded in real deployment experience and recent research, not forecasts.

10:00 am

-

10:30 am