How Hex Builds AI Agents: Making Agents Reason Like Human Data Analysts | Izzy Miller, AI Engineer
Audio Brief
Show transcript
This episode covers the unique challenges of building AI agents for data analytics and why data is a uniquely cursed domain compared to standard software engineering.
There are three key takeaways from this conversation. First, the necessity of strict semantic models to ground agents in approved business logic. Second, the importance of dividing monolithic applications into specialized agents to prevent context overload. Third, the critical need for long horizon testing and governed context over individualized user memory.
Unlike software code that definitively compiles, data correctness is highly subjective. AI cannot deduce company specific business logic purely from raw database schemas. Curated semantic layers are required to define metrics like customer churn and keep agent outputs strictly on rails. Organizations must maintain strict separation between ephemeral user preferences and official organizational context to prevent dangerous hallucinations.
Equipping a single agent with massive toolsets routinely causes analysis paralysis and system collapse. When an AI is fed conflicting context, it enters a failure mode rather than delivering insights. Splitting these monolithic models into purpose built specialized agents with constrained toolsets dramatically improves reliability. This structured approach allows agents to effectively handle the multi step reasoning required for real analytics.
Analytics is rarely a simple single shot query process. It requires conversational exploration where initial answers prompt further drilling down by users. Because of this, current public benchmarks testing unrealistic static scenarios are deeply flawed. Meaningful evaluation requires long horizon simulations that mimic dynamic warehouse environments, schema changes, and complex user interactions over time.
To maintain accuracy during these long horizon workflows, establishing robust human in the loop feedback systems is critical. Data administrators need a comprehensive view to review flagged answers and continually update organizational context. This oversight builds the necessary trust for non technical business stakeholders who cannot verify code themselves.
Ultimately, building effective data AI requires tightly governed infrastructure and specialized engineering talent driven by deep domain expertise rather than generalized technical skills.
Episode Overview
- Explores the unique challenges of building AI agents for data analytics, contrasting them with software engineering agents due to the highly subjective and iterative nature of data discovery.
- Traces the evolution of data AI from restrictive, single-shot query generation tools to complex, long-horizon agents that must actively manage context and recover from errors.
- Emphasizes that the future of reliable enterprise AI relies on human-in-the-loop systems, robust context management, and prioritizing domain empathy over pure technical ML skills.
Key Concepts
- The Limitation of Single-Shot Workflows: Early AI data tools relied on generating a single query from a prompt. This failed because real data analytics is an iterative process of discovery and "rabbit-holing," requiring conversational, multi-step capabilities.
- Coding vs. Data Analytics Agents: Software coding often involves executing a defined spec with a binary pass/fail state. Data analytics is subjective; an agent might write perfectly valid SQL that still returns the wrong business metric due to organizational nuances.
- Context Management as the Core Bottleneck: AI model reasoning is no longer the primary limitation. The main architectural challenge is dynamically retrieving and filtering context, as giving an agent too much or conflicting context causes it to stall or hallucinate.
- The Problem with "Showing Your Work": Simply displaying the generated SQL or Python code does not build trust with non-technical business users. Agents need stronger, governed ways to verify accuracy and prove confidence against organizational definitions.
- Human-in-the-Loop Observability: Because data "truth" is subjective, agents will inevitably make mistakes. Systems must allow data teams to review AI failures and update the core context (like semantic models or documentation) to continuously train the system.
- The Danger of User-Level Memory: Allowing an AI to "remember" an individual user's potentially flawed definition of a metric is dangerous for enterprises. It creates conflicting truths across the organization; governed, systemic context is necessary to ensure accuracy.
- Long-Horizon Evaluations: Traditional static, point-in-time AI benchmarks are inadequate for data agents. Effective evaluation requires simulating long-horizon tasks where an agent must explore an environment, encounter errors, and self-correct over time.
Quotes
- At 0:02:48 - "we were very focused on this single shot type workflow, which I think is kind of in many ways uniquely cursed in data analytics because it is this iterative domain in which you get an answer and it's like, 'hmm, interesting. I want to follow up.'" - Explains why conversational, iterative agents are necessary for data tasks over simple code-generation tools.
- At 0:04:00 - "it was increasingly feeling like we had this high horsepower beast driving 25 in a school zone. It's like the thing we are harnessing here needs more context, not just from this cell, but from the whole project." - Illustrates the turning point where model capabilities outpaced the restrictive UI, necessitating a broader context window.
- At 0:07:49 - "A lot of the way that I code... you layout the plan up front and it executes... you're kind of either yes or no. And I think that when you do good data science... there are many decision points along the journey... in which you're not just saying 'good job, bad job,' but you're saying like, 'oh, that's interesting. Should we look at it this way?'" - Highlights the core difference between building coding agents and data agents.
- At 0:14:11 - "we break our concept of context up into dynamic and static context... the capabilities bundle together tools, static context, prompts, also some weird esoteric finicky stuff like final turn behavior." - Provides insight into how complex agents are architected behind the scenes to manage context and capabilities.
- At 0:24:26 - "if you inject in a piece of conflicting context, it goes against some other information it has, it will spend 30 minutes pondering going 'wait, wait, but let me see, hmm actually' and enter this crazy sort of collapse mode." - Demonstrates the fragility of LLMs when presented with contradictory information and the importance of clean context pipelines.
- At 0:28:20 - "I think that the concept of showing your work is not the right concept for data work in terms of citation and verifiability. I think it needs to be some stronger form of verification or confidence or accuracy." - Explains why code generation alone doesn't solve the trust problem for business stakeholders.
- At 0:38:35 - "memory, as users think about it, I think is a very user-level concept... I think that's less impactful and almost scary to data teams and admins who are worried that the exact opposite of what we just talked about might happen within a user's little memory" - Highlights the risk of localized AI memory overriding governed organizational truths.
- At 0:46:17 - "the core feedback loop of Hex improving today is driven by the data team. It's by users doing work, their work or sort of the exhaust of their work being surfaced to the data team in an interface that allows them to, you know, notice mistakes, improve the guides, the models, the warehouse context" - Details the practical mechanism for continuous agent improvement in an enterprise setting.
- At 1:00:00 - "as a general rule, most eval sets are bad. Unless they are being actively worked on." - Critiques the static nature of standard AI benchmarks and their inability to measure real-world performance accurately.
- At 1:04:12 - "the most interesting way to evaluate data agents and data work has to be long horizon." - Points toward the future of agent evaluation, emphasizing the need to test how agents navigate complex, multi-step processes over time.
- At 1:11:11 - "the best person is someone who really cares about the problem that you're trying to solve." - Summarizes the most important trait for modern AI engineers building product-focused agents.
Takeaways
- Transition data tools away from single-shot generation features toward interfaces that support multi-turn, exploratory workflows.
- Prevent LLM context collapse by consolidating tool access and using intermediate retrieval agents rather than overwhelming the model with thousands of options.
- Implement observability systems that capture AI failures, allowing data teams to treat error exhaust as a continuous training loop for organizational semantic models.
- Restrict user-level AI memory regarding business logic; centrally govern all metric definitions to prevent conflicting organizational truths.
- Abandon static, single-turn benchmarks in favor of custom, long-horizon evaluations that test an agent's ability to iteratively fail and self-correct.
- Hire AI engineers based on deep domain empathy and product intuition rather than solely indexing on pure machine learning expertise.