The Gradient Podcast - Davidad Dalrymple: Towards Provably Safe AI

T
The Gradient Sep 05, 2024

Audio Brief

Show transcript
In this conversation, Davidad Dalrymple discusses his evolving views on AI safety, moving from techno-optimism to focusing on existential risk mitigation through a strategic approach to technology development. There are four key takeaways from this discussion. First, the order of technological innovation is paramount. Second, AI safety should adopt a rules-based, deontological approach. Third, a practical strategy involves building provably safe, special-purpose AI for critical, narrow domains. Finally, long-term AI safety necessitates an ongoing, dynamic human governance process. Dalrymple advocates for differential technology development, prioritizing safety, control, and verification technologies before building more powerful AI capabilities. This approach, central to his "Safeguarded AI" program, establishes a secure foundation. It recognizes that the sequence of innovation profoundly impacts AI outcomes, urging a shift from "alignment by default" to proactive risk mitigation. AI safety should adopt a rules-based, deontological approach rather than a utilitarian one. This means focusing on establishing verifiable, often binary constraints to prevent specific harms. It moves away from trying to define and optimize for a single, broad definition of "good," which is vulnerable to Goodhart's Law and often proves fragile. A practical strategy involves building provably safe, special-purpose AI for critical, narrow domains. Instead of attempting to solve the alignment problem for a hypothetical general superintelligence all at once, this approach focuses on creating verifiable safety in specific applications. These systems, for areas like power grids or transportation, can then be safely composed. Long-term AI safety necessitates an ongoing, dynamic human governance process. This multi-stakeholder, deliberative approach is crucial for frequently revising AI specifications and rules, preventing permanent value lock-in. This strategy aims to shift the global AI race from a competitive "Prisoner's Dilemma" to a cooperative "Stag Hunt," demonstrating that safer development paths are viable and attractive. These insights underscore a critical shift towards pragmatic, verifiable, and cooperatively governed approaches to navigate the complex future of AI development.

Episode Overview

  • Davidad Dalrymple discusses his personal evolution on AI safety, moving from a techno-optimist who believed in "alignment by default" to a pragmatist focused on mitigating existential risks.
  • The conversation introduces "differential technology development," a core strategy that prioritizes creating safety, control, and verification technologies before developing more powerful AI capabilities.
  • Dalrymple advocates for a "deontological" or rules-based approach to AI safety, focusing on establishing verifiable constraints to prevent harm, rather than a "utilitarian" approach of optimizing for a single definition of good.
  • The strategy is to shift the global AI race from a competitive "Prisoner's Dilemma" to a cooperative "Stag Hunt" by demonstrating that safer development paths are viable and attractive.
  • The discussion covers the practical implementation of these ideas through the "Safeguarded AI" program at ARIA, which aims to build provably safe, special-purpose AI for critical domains.

Key Concepts

  • Differential Technology Development: The strategic principle that the order of technological innovation matters. The focus should be on developing safety, control, and governance technologies first to create a secure foundation for subsequent, more powerful AI systems.
  • Shift from Utilitarian to Deontological Safety: Moving away from trying to define and maximize a single utility function for "good," which is vulnerable to Goodhart's Law. The alternative is a rules-based approach that establishes verifiable, often binary (pass/fail) constraints to prevent specific harms.
  • The Orthogonality Thesis: The idea that an AI's level of intelligence is independent of its ultimate goals or values. Dalrymple's acceptance of this thesis was a key reason for his shift from believing AGI would be "aligned by default" to seeing it as a major risk.
  • Game Theory and AI Strategy: Modeling the global AI race as a game. The goal is to change the incentives from a "Prisoner's Dilemma" (where actors race ahead unsafely) to a "Stag Hunt" (where cooperation on safety becomes the most rational strategy).
  • Provably Safe, Special-Purpose AI: Rather than trying to solve safety for a hypothetical, all-powerful AGI, the strategy focuses on building verifiably safe AI for narrow, critical domains like power grids or transportation, and then composing them.
  • Human-in-the-Loop Governance: The critical need for an ongoing, multi-stakeholder, collective, and deliberative human process to frequently revise the rules and specifications guiding AI behavior, preventing permanent "value lock-in."

Quotes

  • At 4:11 - "My opinion about the net effect on the well-being of the future of humanity from having AGI sooner has switched from positive to negative." - Dalrymple explains the most significant change in his thinking, driven by a deeper appreciation for AI safety risks.
  • At 22:06 - "one should think about path dependence in the order in which technologies are developed and how that affects their impact." - Dalrymple explains the core principle of differential technology development, emphasizing the importance of the sequence of innovation.
  • At 25:35 - "My theory of change is if we can demonstrate that this slower path is not as slow as it seems, then that opens up the possibility for a cooperative equilibrium." - He outlines his strategy to shift the global AI race from a competitive "Prisoner's Dilemma" to a cooperative "Stag Hunt" by making the safer route more attractive.
  • At 47:50 - "I'm not a utilitarian... I'm taking an approach that's a little bit more like... deontology, where I'm saying, 'Let's try not to mess things up,' rather than... 'try to do world optimization.'" - He clarifies his philosophical approach to AI safety, prioritizing the avoidance of harm through verifiable constraints over trying to define and maximize a universal "good."
  • At 52:48 - "It's really important that humans have a multi-stakeholder, collective, deliberative process for revising the specifications quite frequently." - Dalrymple stressing the need for an ongoing, human-led process to update AI values and rules, rather than locking them in permanently.

Takeaways

  • Prioritize the development of safety and verification technologies over raw AI capabilities; the order in which we innovate is critical for a safe outcome.
  • AI safety should focus on establishing clear, verifiable rules and constraints (a deontological approach) to prevent harm, as attempting to optimize for a single concept of "good" is fragile and prone to catastrophic failure.
  • A practical path forward is to build provably safe, specialized AI for specific, critical domains rather than attempting to solve the alignment problem for a hypothetical general superintelligence all at once.
  • Long-term safety requires establishing a dynamic, human-led governance process to continually update AI values and rules, ensuring that AI remains aligned with evolving societal needs.