Modern retail platforms operate at a scale where traditional observability—metrics, logs, and alerts—is necessary but no longer sufficient. At Walmart Global Tech, our checkout platform processes millions of real-time transactions daily, where even brief degradation translates directly into customer impact and lost revenue. In such environments, engineers don’t just need visibility—they need reasoning.
This talk explores how agentic AI systems can move beyond passive observability to actively reason about failures in large-scale, highly distributed retail systems. Instead of alert floods and manual triage, we apply AI agents that correlate signals across dependencies, deployments, traffic patterns, and historical incidents to infer why a failure is happening—and what to do next.
Drawing from real production lessons and my open-source work on Dependency-OPS-Sentinel (DOS)—an AI-driven DevOps intelligence system adopted by teams and featured in the PySpark community—I will demonstrate how failure reasoning can be modeled as a graph problem, not a dashboard problem. AI agents traverse dependency graphs, evaluate blast radius, detect change-induced instability, and recommend mitigations such as rollback, traffic shaping, or graceful degradation.
Attendees will learn:
Why observability breaks down at extreme scale and high availability targets
How agentic AI differs from rule-based automation in incident response
Architectural patterns for AI-assisted failure reasoning using telemetry and dependency graphs
Guardrails for building deterministic, trustworthy AI agents in mission-critical systems
Practical lessons from deploying these ideas in global retail environments
This session is aimed at architects, SREs, and platform engineers building real-time, highly available systems, and looking to evolve from reactive monitoring to self-reasoning operational intelligence.