The print version of the book is available from the publishing company Athena Scientific, or from Amazon.com. The book is also available as an Ebook from Google Books.

Click here for class notes based on this book.

Click here for preface and table of contents.

The purpose of the book is to consider large and challenging multistage decision problems, which can be solved in principle by dynamic programming and optimal control, but their exact solution is computationally intractable. We discuss solution methods that rely on approximations to produce suboptimal policies with adequate performance. These methods are collectively referred to as reinforcement learning, and also by alternative names such as approximate dynamic programming, and neuro-dynamic programming.

Our subject has benefited enormously from the interplay of ideas from optimal control and from artificial intelligence. One of the aims of this monograph is to explore the common boundary between these two fields and to form a bridge that is accessible by workers with background in either field.

The mathematical style of the book is somewhat different from the author's dynamic programming books, and the neuro-dynamic programming monograph, written jointly with John Tsitsiklis. We rely more on intuitive explanations and less on proof-based insights. Still we provide a rigorous short account of the theory of finite and infinite horizon dynamic programming, and some basic approximation methods, in an appendix. For this we require a modest mathematical background: calculus, elementary probability, and a minimal use of matrix-vector algebra.

The methods of this book have been successful in practice, and often spectacularly so, as evidenced by recent amazing accomplishments in the games of chess and Go. However, across a wide range of problems, their performance properties may be less than solid. This is a reflection of the state of the art in the field: there are no methods that are guaranteed to work for all or even most problems, but there are enough methods to try on a given challenging problem with a reasonable chance that one or more of them will be successful in the end. Accordingly, we have aimed to present a broad range of methods that are based on sound principles, and to provide intuition into their properties, even when these properties do not include a solid performance guarantee. Hopefully, with enough exploration with some of these methods and their variations, the reader will be able to address adequately his/her own problem.

Notes, videolectures, slides, and other material for the current course in Reinforcement Learning and Optimal Control (January 13-April 16, 2021), at Arizona State University:

Class Notes on Reinforcement Learning (extended version of Chapter 1 of the author's Reinforcement Learning Books)

Video-Lecture 1, Video-Lecture 2, Video-Lecture 3, Video-Lecture 4, Video-Lecture 5, Video-Lecture 6, Video-Lecture 7, Video-Lecture 8, Video-Lecture 9, Video-Lecture 10, Video-Lecture 11, Video-Lecture 12, Video-Lecture 13

Slides-Lecture 1, Slides-Lecture 2, Slides-Lecture 3, Slides-Lecture 4, Slides-Lecture 5, Slides-Lecture 6, Slides-Lecture 7, Slides-Lecture 8, Slides-Lecture 9, Slides-Lecture 10, Slides-Lecture 11, Slides-Lecture 12, Slides-Lecture 13

The final Lecture 13 is an overview of the entire course.

**"Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control,"** draft of a forthcoming (2022) research monograph; an short early version (August 2021) went on line as arXiv preprint arXiv:2108.10315.

This draft will be periodically updated until shortly before its formal publication. Check this site for the latest version.

** Lessons from AlphaZero Videolecture**: A summary presentation at KTH, Nov. 2021

**A brief description:** Some of the most exciting success stories in artificial intelligence and reinforcement learning have been in the area of games. Primary examples are the recent AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon).
These programs were trained off-line extensively using sophisticated self-play/approximate policy iteration algorithms and neural networks. Yet the AlphaZero player that has been obtained off-line is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used, which is based on multistep lookahead and a terminal position evaluator that was trained off-line. The on-line player performs a form of policy improvement, which unlike the off-line player, is not degraded by neural network approximations. As a result, it has greatly improved performance.

Similarly, TD-Gammon performs on-line a policy improvement step using lookahead minimization that is not degraded by neural network approximations. To this end it uses an off-line neural-network trained terminal position evaluator, and importantly it also extends its on-line lookahead by rollout (simulation with the one-step lookahead player that is based on the position evaluator).

An important lesson from AlphaZero and TD-Gammon is that performance of an off-line trained controller can be greatly improved by on-line play, with long lookahead (involving minimization or rollout with an off-line obtained policy, or both), and terminal cost approximation that is obtained off-line. This performance enhancement is often dramatic and is due to a simple fact, which is the central point of our work: on-line play amounts to a step of Newton's method for solving Bellman's equation, while the starting point for the Newton step is based on the results of off-line training and may be enhanced by longer lookahead and on-line rollout. This process can be understood in terms of abstract models of dynamic programming and simple geometrical constructions. It manifests itself to some extent in model predictive control, but it seems that it has yet to be fully appreciated within the decision and control community.

In this work we aim to provide insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. While we will deemphasize mathematical proofs, there is considerable related analysis, which supports our conclusions and can be found in the author's recent RL books (2019, 2020). One of our principal aims is to show through the unifying principles of abstract DP that the AlphaZero/TD-Gammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.

The print version of the book is available from the publishing company Athena Scientific, and from Amazon.com. The book is also available as an Ebook from Google Books.

Click here for class notes based on this book.

This is a research monograph at the forefront of research on reinforcement learning, also referred to by other names such as approximate dynamic programming and neuro-dynamic programming. The purpose of the monograph is to develop in greater depth some of the methods from the author's recently published textbook on Reinforcement Learning (Athena Scientific, 2019). In particular, we present new research, relating to systems involving multiple agents, partitioned architectures, and distributed asynchronous computation. We pay special attention to the contexts of dynamic programming/policy iteration and control theory/model predictive control. We also discuss in some detail the application of the methodology to challenging discrete/combinatorial optimization problems, such as routing, scheduling, assignment, and mixed integer programming, including the use of neural network approximations within these contexts.

The book focuses on the fundamental idea of policy iteration, i.e., start from some policy, and successively generate one or more improved policies. If just one improved policy is generated, this is called rollout, which, based on broad and consistent computational experience, appears to be one of the most versatile and reliable of all reinforcement learning methods. Among others, it can be applied on-line using easily implementable simulation, and it can be used for discrete deterministic combinatorial optimization, as well as for stochastic Markov decision problems. Moreover, rollout can make on-line use of the policy produced off-line by policy iteration or by any other method (including a policy gradient method), and improve on the performance of that policy,

Much of the new research is inspired by the remarkable AlphaZero chess program, where policy iteration, value and policy networks, approximate lookahead minimization, and parallel computation all play an important role. In addition to the fundamental process of successive policy iteration/improvement, this program includes the use of deep neural networks for representation of both value functions and policies, the extensive use of large scale parallelization, and the simplification of lookahead minimization, through methods involving Monte Carlo tree search and pruning of the lookahead tree. In this book, we also focus on policy iteration, value and policy neural network representations, multiagent systems, parallel and distributed computation, and lookahead simplification. Thus while there are significant differences, the principal design ideas that form the core of this monograph are shared by the AlphaZero architecture, except that we develop these ideas within a broader and less application-specific framework.

Videolectures and slides for an intensive course in Reinforcement Learning and Optimal Control (January 8-February 21, 2019), at Arizona State University:

Slides-Lecture 1, Slides-Lecture 2, Slides-Lecture 3, Slides-Lecture 4, Slides-Lecture 5, Slides-Lecture 6, Slides-Lecture 7, Slides-Lecture 8, Slides-Lecture 9, Slides-Lecture 10, Slides-Lecture 11, Slides-Lecture 12, Slides-Lecture 13.

Videos of lectures from Reinforcement Learning and Optimal Control course at Arizona State University: (Click around the screen to see just the video, or just the slides, or both simultaneously).

Video-Lecture 1, Video-Lecture 2, Video-Lecture 3,Video-Lecture 4, Video-Lecture 5, Video-Lecture 6, Video-Lecture 7, Video-Lecture 8, Video-Lecture 9, Video-Lecture 10, Video-Lecture 11, Video-Lecture 12, Video-Lecture 13.

The final Lecture 13 is an overview of the entire course.

Video of an Overview Lecture on Distributed RL from IPAM workshop at UCLA, Feb. 2020 (Slides).

Video of an Overview Lecture on Multiagent RL from a lecture at ASU, Oct. 2020 (Slides).

Slides for an extended overview lecture on RL: Ten Key Ideas for Reinforcement Learning and Optimal Control.

**The following papers and reports have a strong connection to material in the reinforcement learning book, and amplify on its analysis and its range of applications.**

The fourth edition (February 2017) contains a substantial amount of new material, particularly on approximate DP in Chapter 6. This chapter was thoroughly reorganized and rewritten, to bring it in line, both with the contents of Vol. II, whose latest edition appeared in 2012, and with recent developments, which have propelled approximate DP to the forefront of attention.

Some of the highlights of the revision of Chapter 6 are an increased emphasis on one-step and multistep lookahead methods, parametric approximation architectures, neural networks, rollout, and Monte Carlo tree search. Among other applications, these methods have been instrumental in the recent spectacular success of computer Go programs. The material on approximate DP also provides an introduction and some perspective for the more analytically oriented treatment of Vol. II.

Click here for direct ordering from the publisher and preface, table of contents, supplementary educational material, lecture slides, videos, etc

Dynamic Programming and Optimal Control, Vol. I, ISBN-13: 978-1-886529-43-4, 576 pp., hardcover, 2017

The fourth edition of Vol. II of the two-volume DP textbook was published in June 2012. This is a major revision of Vol. II and contains a substantial amount of new material, as well as a reorganization of old material. The length has increased by more than 60% from the third edition, and most of the old material has been restructured and/or revised. Volume II now numbers more than 700 pages and is larger in size than Vol. I. It can arguably be viewed as a new book!

Approximate DP has become the central focal point of this volume, and occupies more than half of the book (the last two chapters, and large parts of Chapters 1-3). Thus one may also view this new edition as a followup of the author's 1996 book "Neuro-Dynamic Programming" (coauthored with John Tsitsiklis). A lot of new material, the outgrowth of research conducted in the six years since the previous edition, has been included.

A new printing of the fourth edition (January 2018) contains some updated material, particularly on undiscounted problems in Chapter 4, and approximate DP in Chapter 6. References were also made to the contents of the 2017 edition of Vol. I, and to high profile developments in deep reinforcement learning, which have brought approximate DP to the forefront of attention.

Dynamic Programming and Optimal Control, Vol. II: Approximate Dynamic Programming, ISBN-13: 978-1-886529-44-1, 712 pp., hardcover, 2012

Click here for an updated version of Chapter 4, which incorporates recent research on a variety of undiscounted problem topics, including

Click here for preface and detailed information.

Click here to order at Amazon.com

Lectures on Exact and Approximate Finite Horizon DP: Videos from a 4-lecture, 4-hour short course at the University of Cyprus on finite horizon DP, Nicosia, 2017. Videos from Youtube. (Lecture Slides: Lecture 1, Lecture 2, Lecture 3, Lecture 4.)

Videos from a 6-lecture, 12-hour short course at Tsinghua Univ., Beijing, China, 2014. From the Tsinghua course site, and from Youtube. Click here to download Approximate Dynamic Programming Lecture slides, for this 12-hour video course.

Click here to download lecture slides for a 7-lecture short course on Approximate Dynamic Programming, Caradache, France, 2012.

Click here to download lecture slides for the MIT course "Dynamic Programming and Stochastic Control (6.231), Dec. 2015. The last six lectures cover a lot of the approximate dynamic programming material.

Click here to download research papers and other material on Dynamic Programming and Approximate Dynamic Programming.

The 2nd edition of the research monograph "Abstract Dynamic Programming," is available in hardcover from the publishing company, Athena Scientific, or from Amazon.com.

The 2nd edition aims primarily to amplify the presentation of the semicontractive models of Chapter 3 and Chapter 4 of the first (2013) edition, and to supplement it with a broad spectrum of research results that I obtained and published in journals and reports since the first edition was written (see below). As a result, the size of this material more than doubled, and the size of the book increased by nearly 40%.

The book can be downloaded and used freely for noncommercial purposes. The version below corrects a few errata from the book's first printing, and is identical to the book's second printing (to appear in 2021).

Abstract Dynamic Programming, 2ND EDITION, Complete

In addition to the changes in Chapters 3, and 4, I have also eliminated from the second edition the material of the first edition that deals with restricted policies and Borel space models (Chapter 5 and Appendix C). These models are motivated in part by the complex measurability questions that arise in mathematically rigorous theories of stochastic optimal control involving continuous probability spaces. The restricted policies framework aims primarily to extend abstract DP ideas to Borel space models. Since this material is fully covered in Chapter 6 of the 1978 monograph by Bertsekas and Shreve, and followup research on the subject has been limited, I decided to omit Chapter 5 and Appendix C of the first edition from the second edition and just post them below.

**The following papers and reports have a strong connection to the book, and amplify on the analysis and the range of applications of the semicontractive models of Chapters 3 and 4:**

A series of 5 Videolectures on Abstract Dynamic Programming and corresponding slides posted at Youtube.

Visits since April 9, 2019