What OpenAI & Google engineers learned deploying 50+ AI products in production

L
Lenny's Podcast Jan 11, 2026

Audio Brief

Show transcript
In this conversation, the discussion centers on the fundamental engineering shift required to move from building deterministic software to managing non-deterministic AI products. There are four key takeaways to understand about safely increasing AI agency and managing the transition to autonomous systems. First, product teams must respect the inverse relationship between AI agency and human control. You cannot launch fully autonomous agents on day one. Instead, teams should follow a stepped progression known as the V1 to V3 Framework. This begins with V1, where the AI only routes tasks or acts as a consultant. It moves to V2, a Copilot model where the AI drafts content for human review. You only reach V3, full autonomy, once the system has proven reliability. The operational metric for this transition is boredom. You should only grant full autonomy when human supervisors are bored because they are accepting 99 percent of the AI's suggestions without edits. Second, the Copilot phase is not just a safety buffer; it is a critical data generation strategy. When a human edits an AI-generated draft, the difference between the draft and the final version creates a golden dataset for error analysis. This correction data is essential for training the model to handle tasks independently. It allows you to harvest implicit feedback, such as a user re-generating an answer or editing a code block, rather than relying on unreliable explicit feedback like thumbs-up buttons. Third, traditional Continuous Integration and Deployment is insufficient for AI. Teams need to implement Continuous Calibration. AI systems degrade not just because of code bugs, but due to model drift and changing user behaviors. To manage this, teams must distinguish between Evals and Monitoring. Evals test for known scenarios and codified product thinking, while Production Monitoring captures the unknown unknowns of reality. The cycle requires feeding discoveries from monitoring back into the evaluation suite to prevent regression. Finally, organizational culture is often a bigger bottleneck than technology. Executives must discard years of deterministic software intuition and rebuild their understanding of how probabilistic systems fail. The new competitive advantage is no longer just technical knowledge, but the willingness to endure the painful, tedious process of cleaning data and refining judgment. Leaders must schedule hands-on time to break the tools themselves to understand the non-deterministic nature of the products they are building. Ultimately, success in AI adoption depends less on buying one-click agents and more on building a pipeline that learns, adapts, and calibrates over time.

Episode Overview

  • Explores the fundamental shift required to move from building deterministic software (predictable inputs/outputs) to non-deterministic AI products (probabilistic behavior).
  • details the "V1-V3 Framework" for safely increasing AI agency, guiding teams on how to move from human-in-the-loop copilots to fully autonomous agents.
  • Examines the "Success Triangle" necessary for AI adoption, emphasizing that organizational culture and vulnerable leadership are often bigger bottlenecks than technology.
  • Provides a roadmap for establishing "Continuous Calibration" (CC/CD) to manage model drift and changing user behaviors in production environments.

Key Concepts

  • Deterministic vs. Non-Deterministic Engineering Traditional software relies on a contract where specific inputs yield predictable outputs. AI development breaks this contract. Product teams must shift from building rigid specifications to designing for "behavior calibration," managing infinite variations of natural language input and probabilistic model output.

  • The Agency vs. Control Trade-off There is an inverse relationship between an AI's autonomy (Agency) and human control. You cannot start at high agency. You must earn the right to increase autonomy by proving reliability. The progression typically flows:

  • V1 (High Control): Routing or Consulting (AI suggests, Human acts).
  • V2 (Shared Control): Copilot (AI drafts, Human reviews/edits).
  • V3 (High Agency): Autonomy (AI acts, Human audits).

  • The "Copilot" Data Strategy The "Copilot" (V2) phase is not just a safety measure; it is a data generation strategy. When an AI drafts a response and a human edits it, the difference between the draft and the final version acts as a "golden dataset" for error analysis. This correction data is essential for training the model to eventually handle the task autonomously.

  • Continuous Calibration (CC/CD) Traditional CI/CD (Continuous Integration/Deployment) is insufficient for AI. Teams need CC/CD (Continuous Calibration). This acknowledges that AI systems degrade not just because of code bugs, but because of "Model Drift" (underlying LLMs change) and "Behavioral Drift" (users change how they prompt or try to game the system over time).

  • Evals vs. Monitoring (The Reality Loop) Companies often confuse these two or prioritize one.

  • Evals: Codified product thinking representing known scenarios (what you expect the system to do).
  • Monitoring: Capturing production reality (unknown unknowns and edge cases).
  • The Cycle: Monitoring discovers new failure modes, which must then be turned into new Evals to prevent regression.

  • Pain as the New Moat In an era where implementation costs are dropping to near-zero and "building" is cheap, technical knowledge is no longer the primary differentiator. The new competitive advantage comes from the willingness to endure the painful, tedious process of cleaning data, calibrating models, and refining judgment ("taste") that others avoid.

Quotes

  • At 0:08:23 - "You're pretty much working with a non-deterministic API as compared to traditional software... In traditional software, you pretty much have a very well-mapped decision engine or workflow." - Ash explaining the fundamental engineering shift required when moving from standard SaaS to AI products.
  • At 0:10:12 - "Every time you hand over decision-making capabilities or autonomy to agentic systems, you're kind of relinquishing some amount of control on your end." - Ash highlighting the inherent risk management required in AI product design.
  • At 0:11:49 - "If your objective is to hike Half Dome in Yosemite... you don't start hiking it every day. You start training yourself... in minor paths and then you slowly improve." - Kiriti using an analogy to explain why you shouldn't launch fully autonomous agents on Day 1.
  • At 0:19:08 - "It's also the most beautiful part of AI... we're all much more comfortable talking than following a bunch of buttons... The bar to using AI products is much lower... but that's also the problem." - Ash on the double-edged sword of natural language interfaces: they are accessible but highly unpredictable.
  • At 0:25:50 - "It's never always technical. Every technology problem is a people problem first... It's these three dimensions: great leaders, good culture, and technical prowess." - Shreya Rajpal explaining that organizational structure is often the bigger blocker to AI adoption than the LLMs themselves.
  • At 0:26:22 - "Leaders have built intuitions over 10 or 15 years... but now with AI in the picture, those intuitions will have to be relearned... You probably are the dumbest person in the room and you want to learn from everyone." - Shreya Rajpal on why executives cannot delegate AI strategy and must rebuild intuition.
  • At 0:33:52 - "Think of what are evals? Evals are basically your trusted product thinking... This is the kind of problems that my agent should not do... Production monitoring... allows you to catch different kinds of problems that you might encounter." - Kirite Vashee dismantling the idea that you only need one or the other; evals test your hypothesis, monitoring tests reality.
  • At 0:35:22 - "If you don't like the answer, sometimes customers don't give you thumbs down, but actually regenerate the answer. So that is a clear indication that... the initial answer... is not meeting the customer's expectation." - Kirite Vashee on the importance of implicit signals over explicit feedback for training data.
  • At 0:41:42 - "If someone is selling you one-click agents, it’s pure marketing. You don't want to buy into that. I would rather go with a company that says 'We're going to build this pipeline for you that will learn over time'." - Shreya Rajpal warning against the hype of instant AI solutions versus pipelines that learn.
  • At 0:50:35 - "Try to think of lower agency iterations in the beginning... constrain the number of decisions your AI systems can make... and then increase that over time because you're building a flywheel of behavior." - Shreya Rajpal advocating for a stepped deployment approach.
  • At 0:56:04 - "When you do this [Copilot stage], you're also logging human behavior, which means that... you're actually getting error analysis for free... because you're literally logging everything that the user is doing that you could then build back into your flywheel." - Aishwarya Ashok explaining why "human-in-the-loop" is a data generation strategy.
  • At 0:58:27 - "There's not really a rulebook you can follow, but it's all about minimizing surprise... If you figure out that you're not seeing new data distribution patterns... and that's when you know you can actually go to the next stage." - Aishwarya Ashok defining the specific trigger for when it is safe to move an agent from Copilot to Autonomy.
  • At 1:02:17 - "People have this notion of... I'm going to divide the responsibilities based on functionality and somehow expect all of this to work together in some sort of gossip protocol... is misunderstood." - Kirthi Kumar critiquing the over-hyped view of autonomous multi-agent swarms versus structured supervision.
  • At 1:12:46 - "You can learn anything overnight... having that persistence and going through that pain of learning this, implementing this, and understanding what works and what doesn't work... that pain is the new moat." - Kirthi Kumar identifying the new source of career and business value in an AI world.

Takeaways

  • Start with "Routing" not "Solving": Begin your AI implementation by asking the model to classify and route tasks (V1) rather than solve them. This exposes messy internal data taxonomies without risking customer trust.
  • Use "Boredom" as the Metric for Autonomy: Monitor the logs of your human-in-the-loop systems. Only transition to fully autonomous agents when the human supervisors are "bored" because they are accepting 99% of the AI's suggestions without edits.
  • Harvest Implicit Feedback: Do not rely on "thumbs up/down" buttons. Build listeners for implicit signals—like a user re-generating an answer, copying only part of a code block, or rage-quitting—to find the true failure points.
  • Schedule Executive "Hands-On" Blocks: Leaders must discard 15 years of deterministic software intuition. Schedule dedicated time (e.g., 4 AM to 6 AM) to personally use and break the AI tools to understand their non-deterministic nature.
  • Implement Continuous Calibration: Treat your AI prompts and models like living organisms. Schedule regular "calibration" cycles to update evaluations, as user behavior will evolve to game the system and underlying models will drift.
  • Clean Your "Data Debt": Before buying an AI agent, audit your internal permissions and taxonomies. An AI cannot magically fix broken business logic; if your categorization of support tickets is messy, the AI will fail.
  • Avoid the "One-Click Agent" Hype: Be skeptical of tools promising instant autonomy. Prioritize vendors or internal builds that offer a "pipeline" approach—starting with drafting and learning from corrections over time.
  • Don't separate Evals and Monitoring: Use your "Evals" to test for errors you can predict, and use "Production Monitoring" to catch the errors you couldn't predict. Feed the monitoring discoveries back into your Evals suite.