LLM progress now depends heavily on one practical issue: training stability at scale. Sparse Mixture-of-Experts (MoE) models are especially sensitive, since routing drift can overload experts, collapse utilization, and stall learning.
In this talk, I will share an "anti-loss-spike" playbook from a recent open-weight run: a 400B-parameter MoE with 13B active parameters per token, trained for 17T tokens with an unsmoothed loss curve and zero loss spikes. I will start with the failure pattern we saw, router drift, overload, MaxVio divergence, and plateau, then cover the fixes that restored steady convergence: bounded and momentum expert-bias updates (SMEBU), z-loss for logit stabilization, a precision fallback from MXFP8 to BF16, better balancing objectives, and data/packing choices that reduced step-to-step variance.
You will leave with a concrete checklist for stability instrumentation and first-response fixes to keep large open-weight runs on track.