Emerging Tech

Lucas Atkins: How Not to Blow Up. Training a 400B MoE to 17T Tokens Without Loss Spikes

Calendar Icon - Evently Webflow Template
8/13/2026
 
Clock Icon - Evently Webflow Template
11:45 am
 - 
12:15 am

About the Session

LLM progress now depends heavily on one practical issue: training stability at scale. Sparse Mixture-of-Experts (MoE) models are especially sensitive, since routing drift can overload experts, collapse utilization, and stall learning.

In this talk, I will share an "anti-loss-spike" playbook from a recent open-weight run: a 400B-parameter MoE with 13B active parameters per token, trained for 17T tokens with an unsmoothed loss curve and zero loss spikes. I will start with the failure pattern we saw, router drift, overload, MaxVio divergence, and plateau, then cover the fixes that restored steady convergence: bounded and momentum expert-bias updates (SMEBU), z-loss for logit stabilization, a precision fallback from MXFP8 to BF16, better balancing objectives, and data/packing choices that reduced step-to-step variance.

You will leave with a concrete checklist for stability instrumentation and first-response fixes to keep large open-weight runs on track.

Add to Calendar