Kapil Kumar Reddy Poreddy: Beyond Observability. Teaching AI Agents to Reason About Failures at Retail Scale

8/11/2026

11:15 am

11:45 am

Speakers

Kapil Kumar Reddy Poreddy

Add to Calendar

About the Session

Modern retail platforms operate at a scale where traditional observability—metrics, logs, and alerts—is necessary but no longer sufficient. At Walmart Global Tech, our checkout platform processes millions of real-time transactions daily, where even brief degradation translates directly into customer impact and lost revenue. In such environments, engineers don’t just need visibility—they need reasoning.

This talk explores how agentic AI systems can move beyond passive observability to actively reason about failures in large-scale, highly distributed retail systems. Instead of alert floods and manual triage, we apply AI agents that correlate signals across dependencies, deployments, traffic patterns, and historical incidents to infer why a failure is happening—and what to do next.

Drawing from real production lessons and my open-source work on Dependency-OPS-Sentinel (DOS)—an AI-driven DevOps intelligence system adopted by teams and featured in the PySpark community—I will demonstrate how failure reasoning can be modeled as a graph problem, not a dashboard problem. AI agents traverse dependency graphs, evaluate blast radius, detect change-induced instability, and recommend mitigations such as rollback, traffic shaping, or graceful degradation.

Attendees will learn:

Why observability breaks down at extreme scale and high availability targets

How agentic AI differs from rule-based automation in incident response

Architectural patterns for AI-assisted failure reasoning using telemetry and dependency graphs

Guardrails for building deterministic, trustworthy AI agents in mission-critical systems

Practical lessons from deploying these ideas in global retail environments

This session is aimed at architects, SREs, and platform engineers building real-time, highly available systems, and looking to evolve from reactive monitoring to self-reasoning operational intelligence.

‍

Add to Calendar

Kapil Kumar Reddy Poreddy: Beyond Observability. Teaching AI Agents to Reason About Failures at Retail Scale

About the Session

Related Sessions