SPEAKER

Kapil Kumar Reddy Poreddy

Here's what keeps me up at night: What if we could make technology truly understand us? Not just process our requests, but anticipate our needs, learn from our behaviors, and adapt in real-time to make our lives genuinely better. For the past 19 years, I've been on a mission to answer that question. I've led teams building AI systems that don't just work—they think. From retail platforms that predict what customers need before they know it themselves, to healthcare systems that catch critical issues before they become emergencies, I've seen firsthand how intelligent systems can transform entire industries. But here's what I've learned: the most powerful technology isn't the one with the most features—it's the one that disappears. The best AI doesn't announce itself;it quietly makes millions of lives easier, one interaction at a time. That's what drives me—building systems so intuitive, so responsive, that they feel less like technology and more like magic. Today, I'm focused on the next frontier: creating autonomous, intelligent ecosystems where AI doesn't just respond to problems—it prevents them. Where systems don't just scale—they evolve. And where technology doesn't replace human judgment—it amplifies it.

Topic:

View Full Agenda

Kapil Kumar Reddy Poreddy: Beyond Observability. Teaching AI Agents to Reason About Failures at Retail Scale

Modern retail platforms operate at a scale where traditional observability—metrics, logs, and alerts—is necessary but no longer sufficient. At Walmart Global Tech, our checkout platform processes millions of real-time transactions daily, where even brief degradation translates directly into customer impact and lost revenue. In such environments, engineers don’t just need visibility—they need reasoning. This talk explores how agentic AI systems can move beyond passive observability to actively reason about failures in large-scale, highly distributed retail systems. Instead of alert floods and manual triage, we apply AI agents that correlate signals across dependencies, deployments, traffic patterns, and historical incidents to infer why a failure is happening—and what to do next. Drawing from real production lessons and my open-source work on Dependency-OPS-Sentinel (DOS)—an AI-driven DevOps intelligence system adopted by teams and featured in the PySpark community—I will demonstrate how failure reasoning can be modeled as a graph problem, not a dashboard problem. AI agents traverse dependency graphs, evaluate blast radius, detect change-induced instability, and recommend mitigations such as rollback, traffic shaping, or graceful degradation. Attendees will learn: Why observability breaks down at extreme scale and high availability targets How agentic AI differs from rule-based automation in incident response Architectural patterns for AI-assisted failure reasoning using telemetry and dependency graphs Guardrails for building deterministic, trustworthy AI agents in mission-critical systems Practical lessons from deploying these ideas in global retail environments This session is aimed at architects, SREs, and platform engineers building real-time, highly available systems, and looking to evolve from reactive monitoring to self-reasoning operational intelligence.

11:15 am

11:45 am

Speaker Articles

All Articles

No items found.