Evals, error analysis, and better prompts: A systematic approach to improving your AI products

How I AI How I AI Oct 13, 2025

Audio Brief

Show transcript
This episode discusses the most effective strategies for improving AI quality, emphasizing a human-centric, data-driven approach. There are four key takeaways from this conversation. First, significantly improve your AI by manually reading real user conversations for a few hours. This simple, human-led process provides more immediate and actionable insights than complex automated systems. This initial bottom-up error analysis grounds your understanding of actual system failures, offering high returns. Second, establish a continuous improvement loop. This virtuous cycle involves logging LLM interactions, evaluating them with specific, actionable criteria, and then using those insights to refine the system, whether through prompt engineering or fine-tuning. This iterative process generates better interactions, feeding the cycle. Third, abandon vague, generic metrics and dashboards. Instead, create specific, domain-relevant, and human-validated evaluations that provide clear, actionable signals for improvement. Vague terms like "helpfulness" are ineffective because they lack actionability and cannot be reliably measured. Fourth, treat human review as the essential engine for AI quality, not a bottleneck. Human expertise is crucial for identifying nuanced real-world problems, creating high-quality data, and clarifying system requirements that automated systems alone cannot define. Humans are uniquely capable of understanding what "good" actually looks like for an AI's output. These insights highlight the power of targeted, human-driven data analysis in building robust and truly effective AI systems.

Episode Overview

  • The most effective way to improve AI quality is to start with a simple, human-led process of manually reviewing real user data to identify and categorize common failure modes.
  • This initial "error analysis" provides critical grounding for building a structured, three-tiered evaluation framework, moving from unit tests to human evaluations and finally to A/B testing.
  • The conversation emphasizes creating a "virtuous cycle" where insights from evaluations and data curation are continuously used to improve the AI system through prompt engineering or fine-tuning.
  • A central theme is the rejection of vague, generic metrics (like "helpfulness") in favor of specific, domain-relevant evaluations that are actionable and validated by human judgment.

Key Concepts

  • Bottom-Up Error Analysis: The foundational step for improving AI quality is to manually review real user interactions (traces) to observe, annotate, categorize, and prioritize actual system failures. This simple, human-driven process provides immediate clarity and a high return on investment.
  • The Virtuous Cycle of AI Improvement: An iterative flywheel process that involves logging LLM interactions, performing evaluation and curation (both human and automated), and using those insights to improve the model, which in turn generates better interactions to log.
  • Structured Evaluation Framework: A three-tiered approach to testing and validation: Level 1 (automated unit tests), Level 2 (manual human review and specific model-based evaluations), and Level 3 (live A/B testing).
  • Domain-Specific Evals: Effective evaluations must be specific to the product's domain, often with binary (pass/fail) outcomes. Vague, generic metrics like "helpfulness" are useless because they are not actionable and cannot be reliably measured.
  • Human-in-the-Loop: Human expertise is not a bottleneck to be eliminated but is the core engine for driving AI improvement. Manual review is essential for identifying real-world problems, creating high-quality data, and clarifying system requirements.

Quotes

  • At 0:05 - "The most important thing is looking at data." - Hamel states that the core principle for improving AI quality is analyzing real user interactions.
  • At 23:31 - "Some of my clients are so happy with just this process that they're like, 'That's great, Hamel, we're done.'" - Hamel highlights that the initial manual error analysis provides so much value and clarity that clients often feel it's a complete solution in itself.
  • At 23:55 - "This is something that no one talks about... Before you get into all that stuff, you need to have some grounding in like what eval you should even write." - Hamel stresses that understanding real-world failures through manual review is a prerequisite for designing meaningful and effective automated evaluations.
  • At 30:07 - "What the hell does that mean? Does anyone know what that means? Nobody knows..." - Hamel criticizes vague measurements like "Helpfulness," arguing they are not actionable and don't help teams solve concrete product issues.
  • At 33:21 - "The research shows... people are really bad at writing specifications or requirements until they need to react to what an LLM is doing to clarify and help them externalize what they want." - Hamel explains that the iterative process of prompting and reviewing an LLM's output is crucial for defining the desired behavior.

Takeaways

  • Start improving your AI by manually reading real user conversations for a few hours; this simple act provides more immediate and actionable insights than complex automated systems.
  • Build a continuous improvement loop: log interactions, evaluate them with specific criteria, use those insights to refine the system, and repeat the process.
  • Abandon vague, generic dashboards and instead create specific, domain-relevant, and human-validated evaluations that provide a clear signal for improvement.
  • Treat human review as the essential engine for quality, not a bottleneck. Humans are uniquely capable of identifying nuanced failures and clarifying what "good" actually looks like.