A Course in Reinforcement Learning (2nd Edition) - Dimitri P. Bertsekas

A deep dive into the math and meaning behind RL algorithms

machine learning
🇬🇧
A Course in Reinforcement Learning (2nd Edition) by Dimitri P. Bertsekas presents a mathematically grounded yet accessible framework for understanding and applying reinforcement learning (RL) to sequential decision-making problems. Centered on the duality between off-line training and on-line play, and inspired by systems like AlphaZero and TD-Gammon, the book explores a unified methodology grounded in approximate dynamic programming. With emphasis on approximation in value space, the text bridges concepts from control theory, operations research, and artificial intelligence. Suitable for graduate-level courses or self-study, it offers a modular structure, visual intuitions, and real-world insight into key methods such as rollout algorithms, policy iteration, and deep learning-based value approximation. This edition integrates contemporary advances and expands discussions on topics like multi-agent systems, adaptive control, and model predictive control, while remaining focused on rigorous yet intuitive explanations.
Author

Antonio Montano

Published

April 19, 2025

Modified

April 19, 2025

Review

Dimitri P. Bertsekas’ A Course in Reinforcement Learning (2nd Edition) is a deeply insightful and uniquely structured textbook that bridges the gap between classical optimization and the modern field of reinforcement learning (RL). Unlike most RL books that emerge from the machine learning community and focus heavily on empirical performance and implementation, Bertsekas approaches the subject from the lens of dynamic programming (DP), control theory, and operations research, offering a mathematically grounded, algorithmically rigorous, and conceptually unified perspective.

Conceptual focus: value-centric reinforcement learning

The core philosophy of the book revolves around approximation in value space—a concept that plays a foundational role in dynamic programming and is extended here to RL settings. This is in contrast to the increasingly popular policy-gradient methods that dominate deep RL literature. Rather than focusing solely on training end-to-end policies with neural networks, Bertsekas places heavy emphasis on computing or approximating value functions, and using those to derive or improve policies.

This is a natural continuation of the ideas developed in his seminal works on Dynamic Programming and Optimal Control. It also reflects the underlying architecture of high-performing RL systems such as AlphaZero and TD-Gammon, both of which use trained evaluators in conjunction with powerful online search or rollout techniques. In fact, one of the book’s major contributions is to formalize and explain these empirical successes within a Newton-like optimization framework for solving Bellman’s equation.

Structure and content overview

The book is divided into three major chapters, designed to support both linear reading and modular, instructor-tailored course planning:

Chapter 1: exact and approximate dynamic programming

This foundational chapter provides a broad overview of the RL landscape, rooted in dynamic programming. It begins with deterministic and stochastic finite-horizon problems, develops into approximation methods, and presents the conceptual framework of offline training + online play, drawing detailed analogies with AlphaZero and model predictive control (MPC). It also introduces the reader to concepts like rollout, policy iteration, cost-to-go approximations, and terminal cost evaluation. The chapter is comprehensive and forms a standalone platform for readers unfamiliar with RL but versed in optimization.

Chapter 2: approximation in value space – rollout algorithms

Here, Bertsekas deepens the discussion on value function-based methods. He details variants of rollout algorithms—methods that combine a base heuristic with lookahead or simulation to improve decisions in real-time. Topics include constrained optimization, Monte Carlo Tree Search (MCTS), randomized rollout, Bayesian optimization, and their application to deterministic and stochastic control, POMDPs, and even adversarial games. This chapter shows the versatility of the rollout paradigm across discrete and continuous domains.

Chapter 3: learning values and policies

This is where the book engages with neural networks and other parametric function approximators. It examines how value functions and policies can be learned from data using fitted value iteration, Q-learning, SARSA, and policy gradient methods. While coverage of policy optimization is relatively lean compared to deep RL books like Sutton & Barto’s Reinforcement Learning, it is sufficient to understand core ideas and their integration into the Newton-based value approximation framework. The focus remains on conceptual clarity and practical convergence issues.

Each chapter ends with detailed notes, sources, and exercises, often pointing to supplementary material in Bertsekas’ other books, such as Rollout, Policy Iteration, and Distributed Reinforcement Learning (2020) or Lessons from AlphaZero (2022). These references make the book an excellent launchpad for research or deeper specialization.

Highlights and unique strengths

What sets A Course in Reinforcement Learning apart is its rare synthesis of intellectual depth, conceptual clarity, and practical relevance. Bertsekas leverages decades of foundational work in optimization theory and dynamic programming, crafting a perspective on RL that feels both timeless and sharply attuned to contemporary developments like AlphaZero. Rather than presenting RL as a bag of tricks or an empirical race to outperform benchmarks, the book offers a structured, principled framework for thinking about sequential decision-making under uncertainty.

A central strength lies in the elegant articulation of the synergy between offline learning and online planning. Instead of treating training as a static preprocessing step, Bertsekas shows how real-time control and model predictive strategies can interact meaningfully with learned approximations—a conceptual bridge that is both underexplored and urgently needed in modern RL discourse. This also gives the book a unique relevance to high-stakes engineering domains where stability, safety, and interpretability matter.

Another distinguishing trait is the author’s ability to speak fluently across disciplinary boundaries. Readers from control theory, operations research, artificial intelligence, and applied mathematics will all find familiar anchors—but also be challenged to expand their mental models. The book avoids dense formalism without sacrificing rigor, preferring geometric insights, intuitive visualizations, and algorithmic thinking to heavy abstraction. This makes it an accessible yet intellectually satisfying read for those seeking more than surface-level understanding.

Moreover, the text is modular and customizable, making it ideal for various course structures—from short introductory classes to advanced research seminars on RL theory. Each chapter is self-contained yet richly interlinked, allowing instructors or self-learners to navigate the material according to their background and goals.

Enduring relevance in a fast-evolving field

As machine learning matures beyond supervised pattern recognition, the methods described in A Course in Reinforcement Learning are becoming increasingly vital. Bertsekas’ emphasis on approximation in value space, policy iteration, and offline-online synergy aligns directly with the architecture of some of the most successful and well-known AI systems to date—AlphaZero, MuZero, and ChatGPT’s planning-inspired extensions. For example, MuZero (Schrittwieser et al., 2020) generalizes AlphaZero by learning its own model dynamics, yet it retains the value-based planning loop that Bertsekas formalizes through Newton-like updates on Bellman’s equation. Similarly, the resurgence of model-based reinforcement learning (e.g., DreamerV3, EfficientZero) is built on the same principle: learning value estimates offline and refining them online via planning—precisely the synergy this book explores in depth.

In addition, current research on LLMs with planning capabilities (e.g., ReAct, Tree of Thoughts, and OpenAI’s work on tool-use agents) echoes the structure of lookahead planning guided by learned evaluators, another core theme in the book. These methods increasingly blend rollout-based reasoning, value estimation, and policy improvement—even if not framed as reinforcement learning per se. Likewise, in robotics and safety-critical applications, algorithms must generalize reliably, adapt to perturbations, and provide interpretable decision-making—objectives far better served by Bertsekas’ structured, optimization-rooted methods than by end-to-end neural policy training alone.

Thus, what this book presents is not a retrospective—it is a forward-looking foundation. As machine learning turns toward long-horizon reasoning, planning under uncertainty, and adaptive control, the concepts in Bertsekas’ work are proving to be not only relevant, but essential.

Who is this book for?

This textbook is not for everyone—and that’s a strength, not a weakness.

  • Ideal readers include:
    • Graduate students in electrical engineering, applied mathematics, operations research, or control systems.
    • Researchers and professionals interested in optimization, model predictive control, robotics, or multiagent systems.
    • Advanced undergraduates with a solid mathematical background and interest in decision-making under uncertainty.
    • Practitioners who have experience with algorithms and modeling, and want to understand the “why” behind reinforcement learning, not just the “how”.
  • Less ideal for:
    • Beginners looking for an introductory, code-heavy book. For this, Reinforcement Learning: An Introduction by Sutton & Barto or Deep Reinforcement Learning Hands-On by Maxim Lapan may be more approachable.
    • Readers wanting a focus on PyTorch/TensorFlow, environment setups, or software engineering best practices for RL pipelines.

Instead, Bertsekas offers something increasingly rare: a textbook that doesn’t treat reinforcement learning as a subfield of deep learning, but rather as a mathematically grounded discipline that can integrate with—but also stand apart from—modern AI trends.

Supplementary materials

The book is accompanied by video lectures, slides, and supplemental content from Bertsekas’ Arizona State University course, accessible via his website. This makes the book particularly useful for self-study, especially for learners who appreciate visual explanation and conceptual repetition across formats.

Verdict

A Course in Reinforcement Learning (2nd Edition) is a rigorous, insightful, and conceptually rich text that helps readers understand reinforcement learning as a structured decision-making framework, not just a toolbox of tricks. If you’re serious about applying RL to real-world, high-stakes systems—where stability, interpretability, and theoretical guarantees matter—this is the book you’ve been looking for.

It challenges, rewards, and broadens the reader’s view of what reinforcement learning is and what it can be. Highly recommended for those looking to build durable understanding, not just quick implementations.

Further readings

  • Bertsekas, D. P. (2022). Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control. Athena Scientific. Download from author website

  • Bertsekas, D. P. (2020). Rollout, Policy Iteration, and Distributed Reinforcement Learning. Athena Scientific. More information

  • Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., … & Silver, D. (2020). Mastering Atari, Go, Chess and Shogi by planning with a learned model. Nature, 588(7839), 604–609. Read the paper

  • Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104. Read the paper

  • Ye, J., Lin, G., Xu, H., Liao, R., Yang, E., Lu, H., … & Liu, S. (2021). Mastering Atari Games with Limited Data. Advances in Neural Information Processing Systems, 34. Read the paper

  • Yao, S., Zhao, J., Yu, D., Narasimhan, K., & Zhang, Y. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv preprint arXiv:2305.10601. Read the paper

  • Yao, S., Zhao, J., Yu, D., Kasai, J., Wong, A., & Zhang, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv preprint arXiv:2210.03629. Read the paper

  • Aghzal, M., Plaku, E., Stein, G. J., & Yao, Z. (2025). A Survey on Large Language Models for Automated Planning. arXiv preprint arXiv:2502.12435. Read the paper

  • Tantakoun, M., Zhu, X., & Muise, C. (2025). LLMs as Planning Modelers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models. arXiv preprint arXiv:2503.18971. Read the paper

  • Li, H., Chen, Z., Zhang, J., & Liu, F. (2024). LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning. arXiv preprint arXiv:2409.01806. Read the paper

  • Changle, Q., et al. (2024). Tool Learning with Large Language Models: A Survey. Frontiers of Computer Science. Read the paper

About the author

Dimitri P. Bertsekas is a world-renowned scholar in optimization and control theory. He received his Ph.D. in system science from MIT and has held faculty positions at Stanford, the University of Illinois, and MIT, where he remains McAfee Professor of Engineering. Since 2019, he has been Fulton Professor of Computational Decision Making at Arizona State University.

Over a prolific academic career, Bertsekas has authored over twenty influential books covering topics from nonlinear programming to data networks, and notably, dynamic programming and reinforcement learning. His contributions have earned him numerous accolades, including the IEEE Control Systems Award, INFORMS John von Neumann Theory Prize, and election to the U.S. National Academy of Engineering. His recent focus on RL reflects decades of foundational work in dynamic programming and optimization, making him a key figure in bridging classical control with modern machine learning.

Info

Subject Content
Title A Course in Reinforcement Learning
Year 2025 (2nd edition)
Author Dimitri P. Bertsekas
Publisher Athena Scientific
Language English
Topics Reinforcement learning, Model predictive control, Dynamics programming, Machine learning
Downloads Book PDF
Other links Course videolectures and materials
ISBN/DOI 1-886529-29-9
Buy online

Social media

Back to top