What OpenAI & Google engineers learned deploying 50+ AI products in production
Audio Brief
Show transcript
Episode Overview
- Explores the fundamental shift required to move from building deterministic software (predictable inputs/outputs) to non-deterministic AI products (probabilistic behavior).
- details the "V1-V3 Framework" for safely increasing AI agency, guiding teams on how to move from human-in-the-loop copilots to fully autonomous agents.
- Examines the "Success Triangle" necessary for AI adoption, emphasizing that organizational culture and vulnerable leadership are often bigger bottlenecks than technology.
- Provides a roadmap for establishing "Continuous Calibration" (CC/CD) to manage model drift and changing user behaviors in production environments.
Key Concepts
-
Deterministic vs. Non-Deterministic Engineering Traditional software relies on a contract where specific inputs yield predictable outputs. AI development breaks this contract. Product teams must shift from building rigid specifications to designing for "behavior calibration," managing infinite variations of natural language input and probabilistic model output.
-
The Agency vs. Control Trade-off There is an inverse relationship between an AI's autonomy (Agency) and human control. You cannot start at high agency. You must earn the right to increase autonomy by proving reliability. The progression typically flows:
- V1 (High Control): Routing or Consulting (AI suggests, Human acts).
- V2 (Shared Control): Copilot (AI drafts, Human reviews/edits).
-
V3 (High Agency): Autonomy (AI acts, Human audits).
-
The "Copilot" Data Strategy The "Copilot" (V2) phase is not just a safety measure; it is a data generation strategy. When an AI drafts a response and a human edits it, the difference between the draft and the final version acts as a "golden dataset" for error analysis. This correction data is essential for training the model to eventually handle the task autonomously.
-
Continuous Calibration (CC/CD) Traditional CI/CD (Continuous Integration/Deployment) is insufficient for AI. Teams need CC/CD (Continuous Calibration). This acknowledges that AI systems degrade not just because of code bugs, but because of "Model Drift" (underlying LLMs change) and "Behavioral Drift" (users change how they prompt or try to game the system over time).
-
Evals vs. Monitoring (The Reality Loop) Companies often confuse these two or prioritize one.
- Evals: Codified product thinking representing known scenarios (what you expect the system to do).
- Monitoring: Capturing production reality (unknown unknowns and edge cases).
-
The Cycle: Monitoring discovers new failure modes, which must then be turned into new Evals to prevent regression.
-
Pain as the New Moat In an era where implementation costs are dropping to near-zero and "building" is cheap, technical knowledge is no longer the primary differentiator. The new competitive advantage comes from the willingness to endure the painful, tedious process of cleaning data, calibrating models, and refining judgment ("taste") that others avoid.
Quotes
- At 0:08:23 - "You're pretty much working with a non-deterministic API as compared to traditional software... In traditional software, you pretty much have a very well-mapped decision engine or workflow." - Ash explaining the fundamental engineering shift required when moving from standard SaaS to AI products.
- At 0:10:12 - "Every time you hand over decision-making capabilities or autonomy to agentic systems, you're kind of relinquishing some amount of control on your end." - Ash highlighting the inherent risk management required in AI product design.
- At 0:11:49 - "If your objective is to hike Half Dome in Yosemite... you don't start hiking it every day. You start training yourself... in minor paths and then you slowly improve." - Kiriti using an analogy to explain why you shouldn't launch fully autonomous agents on Day 1.
- At 0:19:08 - "It's also the most beautiful part of AI... we're all much more comfortable talking than following a bunch of buttons... The bar to using AI products is much lower... but that's also the problem." - Ash on the double-edged sword of natural language interfaces: they are accessible but highly unpredictable.
- At 0:25:50 - "It's never always technical. Every technology problem is a people problem first... It's these three dimensions: great leaders, good culture, and technical prowess." - Shreya Rajpal explaining that organizational structure is often the bigger blocker to AI adoption than the LLMs themselves.
- At 0:26:22 - "Leaders have built intuitions over 10 or 15 years... but now with AI in the picture, those intuitions will have to be relearned... You probably are the dumbest person in the room and you want to learn from everyone." - Shreya Rajpal on why executives cannot delegate AI strategy and must rebuild intuition.
- At 0:33:52 - "Think of what are evals? Evals are basically your trusted product thinking... This is the kind of problems that my agent should not do... Production monitoring... allows you to catch different kinds of problems that you might encounter." - Kirite Vashee dismantling the idea that you only need one or the other; evals test your hypothesis, monitoring tests reality.
- At 0:35:22 - "If you don't like the answer, sometimes customers don't give you thumbs down, but actually regenerate the answer. So that is a clear indication that... the initial answer... is not meeting the customer's expectation." - Kirite Vashee on the importance of implicit signals over explicit feedback for training data.
- At 0:41:42 - "If someone is selling you one-click agents, it’s pure marketing. You don't want to buy into that. I would rather go with a company that says 'We're going to build this pipeline for you that will learn over time'." - Shreya Rajpal warning against the hype of instant AI solutions versus pipelines that learn.
- At 0:50:35 - "Try to think of lower agency iterations in the beginning... constrain the number of decisions your AI systems can make... and then increase that over time because you're building a flywheel of behavior." - Shreya Rajpal advocating for a stepped deployment approach.
- At 0:56:04 - "When you do this [Copilot stage], you're also logging human behavior, which means that... you're actually getting error analysis for free... because you're literally logging everything that the user is doing that you could then build back into your flywheel." - Aishwarya Ashok explaining why "human-in-the-loop" is a data generation strategy.
- At 0:58:27 - "There's not really a rulebook you can follow, but it's all about minimizing surprise... If you figure out that you're not seeing new data distribution patterns... and that's when you know you can actually go to the next stage." - Aishwarya Ashok defining the specific trigger for when it is safe to move an agent from Copilot to Autonomy.
- At 1:02:17 - "People have this notion of... I'm going to divide the responsibilities based on functionality and somehow expect all of this to work together in some sort of gossip protocol... is misunderstood." - Kirthi Kumar critiquing the over-hyped view of autonomous multi-agent swarms versus structured supervision.
- At 1:12:46 - "You can learn anything overnight... having that persistence and going through that pain of learning this, implementing this, and understanding what works and what doesn't work... that pain is the new moat." - Kirthi Kumar identifying the new source of career and business value in an AI world.
Takeaways
- Start with "Routing" not "Solving": Begin your AI implementation by asking the model to classify and route tasks (V1) rather than solve them. This exposes messy internal data taxonomies without risking customer trust.
- Use "Boredom" as the Metric for Autonomy: Monitor the logs of your human-in-the-loop systems. Only transition to fully autonomous agents when the human supervisors are "bored" because they are accepting 99% of the AI's suggestions without edits.
- Harvest Implicit Feedback: Do not rely on "thumbs up/down" buttons. Build listeners for implicit signals—like a user re-generating an answer, copying only part of a code block, or rage-quitting—to find the true failure points.
- Schedule Executive "Hands-On" Blocks: Leaders must discard 15 years of deterministic software intuition. Schedule dedicated time (e.g., 4 AM to 6 AM) to personally use and break the AI tools to understand their non-deterministic nature.
- Implement Continuous Calibration: Treat your AI prompts and models like living organisms. Schedule regular "calibration" cycles to update evaluations, as user behavior will evolve to game the system and underlying models will drift.
- Clean Your "Data Debt": Before buying an AI agent, audit your internal permissions and taxonomies. An AI cannot magically fix broken business logic; if your categorization of support tickets is messy, the AI will fail.
- Avoid the "One-Click Agent" Hype: Be skeptical of tools promising instant autonomy. Prioritize vendors or internal builds that offer a "pipeline" approach—starting with drafting and learning from corrections over time.
- Don't separate Evals and Monitoring: Use your "Evals" to test for errors you can predict, and use "Production Monitoring" to catch the errors you couldn't predict. Feed the monitoring discoveries back into your Evals suite.