Building AlphaGo from scratch – Eric Jang

Dwarkesh Patel • May 15, 2026

Audio Brief

Show transcript

This episode covers the historical breakthrough of AlphaGo and how its architectural lessons apply to the current bottlenecks in large language model development. There are three key takeaways. First, complex algorithmic search can be amortized during training to create rapid neural intuition at runtime. Second, shifting from delayed, sparse rewards to dense, step-by-step supervision solves massive inefficiencies in reinforcement learning. Third, the true hurdle for automated artificial intelligence self improvement is no longer idea generation, but the outer verification loop. Addressing the first point, AlphaGo conquered a game with a state space far exceeding the number of atoms in the universe. It achieved this by using a policy network to reduce the breadth of possible moves and a value network to reduce the depth of necessary search. By utilizing intensive Monte Carlo Tree Search during the training phase, the developers essentially distilled massive computational effort into the network weights. This process amortizes compute, allowing the system to instantly predict what a long, exhaustive search would have eventually concluded. The second takeaway highlights a critical flaw in modern language model training. Current reinforcement learning pipelines often treat an entire generated sequence as a single action, delivering just one delayed reward at the very end. This creates massive variance because the model struggles to assign credit to the specific tokens that caused a success or failure. AlphaGo solved this by using its search tree as an automated expert teacher, providing dense, step-by-step corrections. By training models on full probability distributions rather than binary outcomes, engineers can maximize efficiency and transfer significantly more relational information. Finally, we examine the frontier of automated scientific research. As laboratories attempt to build artificial intelligence that can improve itself, generating new code or architectures is relatively easy. The profound bottleneck is the outer verification loop. Engineers must build automated environments that can definitively prove whether a new idea is actually an improvement over the baseline. To ensure these systems remain robust in production, they must be trained on off policy data, forcing the models to learn how to recover from suboptimal states and compounding errors. Ultimately, deep learning networks should be viewed not simply as pattern recognizers, but as highly efficient mechanisms for caching and deploying complex algorithmic search.

Episode Overview

Explores the historical breakthrough of AlphaGo, detailing how combining Monte Carlo Tree Search (MCTS) with deep neural networks solved the practically infinite complexity of the game of Go.
Analyzes the critical mechanisms of "amortizing compute," where massive algorithmic search during training is distilled into rapid neural network intuition during inference.
Connects the architectural lessons of AlphaGo to modern AI challenges, specifically contrasting the dense supervision of MCTS with the high-variance, sparse rewards currently bottlenecking Reinforcement Learning (RL) in Large Language Models (LLMs).
Examines the frontier of AI development, highlighting why off-policy training is necessary for system robustness and why the "outer verification loop" remains the primary hurdle for creating automated AI researchers.

Key Concepts

The AlphaGo Dual-Network Breakthrough: Go's state space ($361^{300}$) is too vast for brute-force search. AlphaGo solved this by using a Policy Network to reduce the breadth of the search (focusing only on likely good moves) and a Value Network to reduce the depth (predicting the winner intuitively without playing to the end).
Monte Carlo Tree Search (MCTS) as an Improvement Operator: MCTS navigates game trees by balancing the exploitation of known good moves with the exploration of new ones. It acts as an automated expert, taking the neural network's baseline predictions and refining them through simulation into a highly confident, accurate distribution of the best next move.
Amortizing Computation via Self-Play: By using the optimized, slow output of MCTS as the "ground truth" to train the base neural network, AlphaGo distills (amortizes) the massive computational cost of deep search into a fast, single forward-pass. The system effectively learns to instantly predict what a long search would have concluded.
The Variance and Credit Assignment Problem in RL: A major inefficiency in modern LLM training is treating an entire generated sequence as a single action with one delayed reward at the end. This creates massive variance because the model struggles to know which specific tokens caused the success or failure. AlphaGo circumvented this by using MCTS to provide dense, step-by-step supervision.
Advantage Estimation: To stabilize RL and reduce variance, algorithms use advantage estimation, which compares the reward of a specific action against an expected average baseline. This ensures the model explicitly learns to favor actions that are better than its standard behavior, rather than just learning from absolute rewards.
Off-Policy Learning for Robustness: Training an AI purely on optimal "happy paths" leaves it vulnerable to compounding errors in the real world. Off-policy learning exposes the model to suboptimal or perturbed states, teaching it the crucial skill of recovering from a mistake and funneling back to a winning condition.
Information Density (Bits per FLOP) and Distillation: Training models using soft targets (the full probability distributions generated by MCTS or a larger model) is vastly more efficient than using sparse binary rewards (win/loss). Soft labels contain high entropy, teaching the model the relative value of all possible actions simultaneously.
The Outer Verification Loop Bottleneck: As labs push toward automated AI researchers, generating novel ideas via LLMs is no longer the main constraint. The profound challenge is building automated, computationally efficient verification environments that can definitively prove whether a new AI architecture or code change is actually an improvement.

Quotes

At 0:01:23 - "The decisions are actually the result of a very, very deep search. And it's always been very mysterious to me how a 10-layer network can sort of amortize the simulation of something so, so deep in the game tree." - Highlights the core mystery and breakthrough of AlphaGo: using relatively shallow neural networks to approximate an infinite search tree.
At 0:04:50 - "And this is what makes Go a very beautiful game, is that um, you can kind of, uh, lose the battle but win the war... as the board size increases, the complexity of these kind of like micro versus macro dynamics gets more interesting." - Explains the strategic depth of Go, where local sacrifices can lead to global advantages.
At 0:09:40 - "And this is what makes Go a very difficult game, is that you don't actually know who won until you really get to the end of the game." - Identifies the "sparse reward" problem in Go, necessitating deep search or accurate value approximation.
At 0:10:48 - "So this is something to the order of like, you know, 361 to 300, power of 300, which is far more than the number of atoms in the universe." - Illustrates the mathematical impossibility of solving Go through brute-force computation.
At 0:25:25 - "the human glances at the board and they know like I'm probably going to lose right and and they're essentially running a neural network that looks at a board and implicitly they are advertising a huge number of possible game playouts" - Explains the conceptual intuition for how value networks operate.
At 0:26:15 - "And this is one of the fundamental intuitions behind why AlphaGo works is that you can train a value function to look at a board and quickly resolve the game without playing out all of these trees into a very deep search depth." - Summarizes replacing exhaustive end-game simulations with rapid neural network evaluation.
At 0:27:30 - "conceptually there's kind of two problems there there's the breadth of the tree and then there's the depth of the tree and alphago gives us a way to basically uh shrink both of those to be very practically" - Describes the core innovation of AlphaGo's dual network architecture.
At 0:35:50 - "go is a perfect information game and in perfect information games um there does exist a Nash equilibrium strategy for which you can do no worse than any other strategy" - Lays out the theoretical mathematical foundation that makes AlphaGo possible.
At 0:42:49 - "over time the raw network actually takes all of the burden of that big TPU pod and just pushes it into the network and and you can do all of that work with one you know neural network process" - Explains how compute is shifted from test-time search to the network's weights.
At 0:58:14 - "The Monte Carlo Tree Search in Go is very, very sparse. It only considers the paths that you've expanded children nodes on." - Explains how MCTS efficiently navigates the massive branching factor compared to exhaustive search algorithms.
At 1:02:47 - "The beauty of how AlphaGo trains itself is that it actually can take this final search process... and tell the policy network, 'Hey, instead of having MCTS do all this legwork to arrive here, why don't you just predict that from the get-go?'" - Summarizes the core bootstrapping loop that continually improves the model's intuition.
At 1:03:59 - "What MCTS is doing is basically saying, you played this game where you eventually lost, but on every single action, I'm going to give you a strictly better action that you should take instead." - Connects MCTS to imitation learning, providing dense, step-by-step corrections rather than just sparse signals.
At 1:08:42 - "In current LLM RL, they treat this entire sequence as a single action... the scale of your variance is actually very bad because you only have one label out of this enormous dataset of actions." - Contrasts AlphaGo's dense supervision with the high-variance rewards currently bottlenecking Large Language Models.
At 1:28:20 - "Karpathy when he was on the podcast called it like sucking supervision through a straw." - Highlights the intense inefficiency of extracting meaningful learning signals from a single sequence-end reward.
At 1:33:02 - "You want to make sure that you're subtracting... penalizing the labels that don't help and only rewarding the ones that actually make you better." - Describes the core intuition behind advantage estimation, a crucial stabilizing force in RL.
At 1:35:19 - "Monte Carlo Tree Search is doing something very fundamentally different, which is it's not trying to do credit assignment on wins. It's trying to improve the label for any given action you took." - Clarifies the structural distinction between traditional RL credit assignment and simulation-based search.
At 1:44:59 - "This is actually why you want to do off policy training sometimes is that like you don't want to have a compounding error where if you make a mistake, you don't have the data of how to return back to your optimal distribution." - Explains the critical importance of off-policy learning for recovering from errors in dynamic environments.
At 2:12:03 - "RL... is even more information inefficient than you might think... you have to unroll a whole trajectory in order to get any learning signal at all." - Highlights the core computational bottleneck and economic pain of standard reinforcement learning approaches.
At 2:14:56 - "The entropy of this [soft target] distribution is far, far higher than the one-hot. So there's actually way more information and bits per sample in a soft label." - Explains the mathematical reason why distillation and MCTS targets train models much more effectively than binary outcomes.
At 2:24:43 - "Automated scientific research is one of the most exciting skills that the frontier labs are developing right now... it's important for everyone who's doing any kind of research to get a good intuition of what it can do now." - Contextualizes the current frontier of AI lab development and the broader implications for scientists and engineers.

Takeaways

Distill complex workflows by using intensive search or computation during training to generate data, allowing your system to execute fast, "intuitive" inference at runtime.
Incorporate both "value" (macro-evaluation of the end state) and "policy" (micro-evaluation of immediate actions) prediction mechanisms to efficiently prune massive decision spaces.
Shift your training paradigms from sparse, delayed rewards to dense, step-by-step supervision to drastically improve the learning efficiency of reinforcement learning models.
Utilize advantage estimation in RL pipelines to ensure you are consistently rewarding actions that outperform the expected baseline, rather than just reacting to absolute outcomes.
Intentionally expose AI models to off-policy data and sub-optimal states so they learn the necessary mechanics of recovering from mistakes, rather than just memorizing perfect "happy paths."
Maximize your "Bits per FLOP" efficiency by training models with soft targets (full probability distributions) rather than binary or one-hot labels, as they transfer significantly more relational information.
Use Monte Carlo Tree Search or similar simulation engines as an automated "expert teacher" to continuously generate and provide better target distributions for your base models to learn from.
Anticipate compounding errors in production environments by designing systems that specifically know how to funnel degraded or perturbed states back toward an optimal trajectory.
Avoid applying single, sequence-level RL rewards to granular, multi-step tasks, as the high variance makes accurate credit assignment nearly impossible; seek intermediate evaluation metrics instead.
Recognize that the true bottleneck for automated AI self-improvement is the "outer verification loop," and direct engineering efforts toward building reliable, automated testing and verification environments.
Leverage LLMs as a "universal grammar" or reasoning layer to bootstrap local feedback loops when attempting to build complex automated research or decision pipelines.
View deep learning networks not simply as pattern recognizers, but as highly efficient mechanisms for caching, distilling, and amortizing complex algorithmic search processes.

Audio Brief

Episode Overview

Key Concepts

Quotes

Takeaways

More from Dwarkesh Patel

David Reich – Why the Bronze Age was an inflection point in human evolution

The math behind how LLMs are trained and served – Reiner Pope

Dario Amodei — The highest-stakes financial model in history

Elon Musk – "In 36 months, the cheapest place to put AI will be space”