# Reinforcement Learning is Direct Adaptive Optimal Control

**Reinforcement Learning is Direct Adaptive Optimal Control**

Stanford_cs229-notes12_Andrew_Ng Reinforcement Learning and Control

How should **Reinforcement learning** be viewed from a **control** systems perspective?

Control problems can be divided into two classes:

**regulation**and**tracking**problems, in which the objective is to follow a**reference trajectory**.**optimal**control problems, which the objective is to**extremize a functional**of the controlled system’s behavior that not necessarily defined in terms of a reference trajectory.

**Tracking** vs **Optimization**

In some problems, the control objective is defined in terms of a reference level or **reference trajectory** that the controlled system’s output should match or **track as closely as possible**. **Stability** is the** key issue** in these regulation and **tracking problems**.

In other problems, the control objective is to **extremize a functional** of the controlled system’s behavior that is **not necessarily** defined in terms of a reference level or **trajectory**. The key issue in the latter problems is **constrained optimization**; here optimal-control methods based on the calculus of variations and dynamic programming have been extensively studied.

When a detailed and accurate model of the system to be controlled is not available, adaptive control methods can be applied. The overwhelming majority of **adaptive control** methods address regulation and **tracking** problems. However, **adaptive methods** for **optimal control** problems would be widely applicable if methods could be developed that were computationally feasible and that could be applied robustly to nonlinear systems.

For example, **trajectory planning** is a key and difficult problem in robot navigation tasks, as it is in other robot control tasks. To design a **robot** capable of **walking bipedally**, one may not be able to **specify a desired trajectory for the limbs** a priori, but one can specify the objective of moving forward, **maintaining equilibrium**, not damaging the robot, etc.

For both **tracking** and **optimal** control, it is usual to distinguish between **indirect** and **direct **adaptive control methods. An **indirect** method replies on a system identification procedure to form an explicit **model** of the controlled system and determines then the **control rule** from the **model**. **Direct** methods determine the** control rule without forming such a system model**.

Control problems | Tracking problems |
Optimal problems |

Trajectory |
Yes |
Not necessary |

functional |
No | Extremize a functional of the controlled system’s behavior |

Adaptive control methods
explicit model |
Indirect (system model)/Direct | Direct (no system model, Q-learning)/Indirect |

Present **reinforcement learning** methods as a **direct** approach to **adaptive optimal control**.

**Reinforcement Learning**

Reinforcement learning is based on the common sense idea that if an action is followed by a satisfactory state of affairs, or by an improvement in the state of affairs (as determined in some clearly defined way), then the **tendency **to produce that action is strengthened, i.e., **reinforced**. This idea plays a fundamental role in theories of animal learning, in parameter-perturbation adaptive-control method, and in the theory of learning automata and bandit problems. Extending this idea to allow action selections to depend on state information introduces aspects of feedback control, pattern recognition and associative learning. Further, it is possible to extend the idea of being “followed by a satisfactory state of affairs” to include the long-term consequences of actions.

—————————

Conserved and Nonconserved Quantities

A conserved quantity cannot be created or destroyed and therefore has no sources or sinks; for conserved quantities such as atomic species i or internal energy U,

For nonconserved quantities such as entropy, S,

where

: the rate of entropy production per unit volume.

Flux

A flux of i, describes the rate at which i flows through a unit area fixed with respect to a specified coordinate system.

Gradient

Divergence

—————————

Studies of reinforcement-learning neural networks in nonlinear control problems have generally focused on one of two main types of algorithm: actor–critic learning or Q-learning (they are direct adaptive optimal control algorithms). An actor–critic learning system contains two distinct subsystems, one (value) to estimate the long-term utility for each state and another (policy) to learn to choose the optimal action in each state. A Q-learning system maintains estimates of utilities for all state-action pairs and makes use of these estimates to select actions.

**Q-Learning**

The name “Q-learning” comes purely form Watkins’ notation.

Watkins-Dayan1992_Article_Q-learning

The objective is to find a control rule that maximizes at each time step the expected discounted sum of future reward. That is , at any time step * k*, the control rule should specify action

**so as to maximize**

*a*_{k}where γ , * 0 ≤ γ < 1* , is a discount factor.

Reinforcement learning methods such as Q-learning do not estimate a system model. The basic idea in Q-learning is to estimate a * real-valued function*,

*, of states and actions, where Q(x,a) is the expected discounted sum of future reward for performing action a in state x and performing optimally thereafter. This function satisfies the following recursive relationship (or “*

**Q***“):*

**functional equation**The Q-learning procedure maintains an * estimate* of the function

*Q*. At each transitions from step

*k*to

*k+1*, the learning system can observe

*x*. Based on these observations, is updated at time step

_{k}, a_{k}, r_{k}, and x_{k+1}**as follows: remains unchanged for all pairs (x, a) ≠ (**

*k+1**x*,

_{k}*a*)

_{k}(1)

where

: gain sequence (learning rate) such that

and

Watkins has shown that converges to ** Q **with probability one if all actions continue to be tried from all states.

Because it **does not** reply on an explicit model of the Markov Process, Q-learning is a **direct** adaptive.

The Q function, combines information about state transitions and estimates of future reward **without** replying on explicit estimates of state transition probabilities.

**Representing Q Function**

Like conventional DP methods, the Q-learning method given by (1) requires memory and overall computation proportional to the number of **state**–**action** **pairs**. In large problems, or in problems with continuous **state and action spaces** which must be quantized, these methods becomes **extremely complex** (**Bellman’s “curse of dimensionality”**). One approach to reducing the severity of this problem is to represent not as a look-up table, but as a parameterized structure such as a low-order polynomial, k-d tree, decision tree, or neural network. In general , the local update rule for given by (1) can be adapted for use with any method for adjusting parameters of function representations via *supervised learning* methods. One can define a general way of moving from a unit of experience (*x _{k} , a_{k}, r_{k}, and x_{k+1 }*as in (1) ) to a training example for :

This training example can then be input to any supervised learning method, such as parameter estimation procedure based on stochastic approximation. The choice of learning method will have a strong effect on generlizatoin, the speed of learning and the quality of the final result. This approach has been used successfully with supervised learning methods based on error backpropagation, CMACs, and nearest-neighbor methods.

**Reference about Optimal Control**

- Q-learning and Pontryagin’s Minimum Principle
- Lecture10_Pontryagin’s Minimum Principle
- Reinforcement Learning and Optimal Control Methods for Uncertain Nonlinear Systems
- RL_and_Optimal_Control_A_Selective_Overview_Slide
- Q-learning_for_Optimal_Control_of_Continuous-time_systems
- Slides_New Developments in Integral Reinforcement Learning Continuous-time Optimal Control and Games
- On_Stochasitc_Optimal_Control_and_Reinforcement_Learning_by_Approximate_Inference