Actor-Critic Algorithms

Math Analysis

Methods that learn approximations to both policy and value functions are often called actor–critic methods, where ‘actor’ is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value function.

这篇论文提出和分析了一类actor–critic算法，用于一参数化系列的随机平稳策略的马尔可夫决策过程（MDP）的基于仿真的优化。

Critic : 一个线性近似架构的TD学习, value function。

Actor : 使用Critic提供的信息，在一个近似梯度方向更新。决策, policy。

大多数强化学习和神经动态编程方法主要属于以下两类中的一类：

（a）Actor-only，一系列参数化的策略。性能梯度，对于actor参数的偏导，在提高方向上更新参数。可能的缺点是梯度估计者可能有大偏差，而且，策略改变后，新策略估计与过去估计无关，所以，没有“学习”，也就是没有积累和固化老信息。

（b）Critic-only，只依赖近似值函数，目的是学习Bellman公式的近似解，即希望规定一个近-优化的策略。这个方法不是在策略空间直接优化。这个方法可能可以找到值函数的“好的“近似函数，但是在结果策略的近-优化方面缺乏可信度。

Actor–critic方法结合了actor-only和critic-only的优点。Critic使用一个近似架构和仿真学习值函数，然后用来在性能提高方向上更新actor策略参数，这个方法是基于梯度，可以得到希望的收敛特性，critic-only只有在非常有限的设置才能保证收敛。对比actor-only方法，这种方法能更快地收敛，因为偏差降低了。

这篇论文提出actor-critic 算法，证明收敛。这算法基于重要的观察。因为actor的需要更新的参数数量相对状态数量来说很小，critic不用计算或近似准确的高维对象的值函数。实际上，critic理想化地计算值函数“投影”到一个低维的子空间，即一组完全由actor参数化决定的“基本函数”所在的子空间。

Download Actor-Critic Algorithms Flowchart

Notation

S : MDP finite state space

A : MDP finite action space

g : S x A → R given cost function

μ : randomized stationary policy (RSP), mapping μ that assigns to each state x a probability distribution over the action space A

$P=\left \{ \mu_{\theta};\theta \in R^{n} \right \}$ : a set of randomized stationary policies

θ : parameter in stationary policy

${\color{Red} \mu_{\theta}}({\color{DarkGreen} x},{\color{Red} u})$ : probability of taking action u when the state x is encountered under the policy corresponding to θ

$\theta \mapsto {\color{Red} \mu_{\theta}}({\color{DarkGreen} x},{\color{Red} u})$ : map probability, policy. 在有关 θ 的策略下，状态x，发生动作u的概率。

p_xy(u) : probability that the next state is y, given that the current state is x and the current action is u.

{X_n} : sequence of states.

{X_n , U_n } : state–action pairs, MDP, S x A.

${\color{Blue} \psi _{\theta}}(x,{\color{Red} u})$ : Rⁿ valued function

${\color{Red} \triangledown \mu_{\theta}}({\color{DarkGreen} x},u)=\mu_{\theta}({\color{DarkGreen} x},{\color{Red} u}){\color{Blue} \psi_{\theta}}({\color{DarkGreen} x},{\color{Red} u})$ : policy gradient

$\theta \mapsto {\color{Blue} \psi_{\theta}}({\color{blue} x},{\color{Red} u})$ : map value function

${\color{Red} \pi_{\theta}}(x)$ : Markov chains {X_n} and {X_n , U_n } stationary probability. 状态平稳分布的概率。

${\color{Red} \eta _{\theta}}(x,u)={\color{Red} \pi_{\theta}}(x){\color{Red} \mu_{\theta}}(x,u)$ : stationary probability. 在状态平稳分布的情况下，某状态发生动作u的概率。

${\color{Magenta} \lambda }(\theta)=\sum_{x\in S,u\in A} {\color{Magenta} g} (x,u){\color{Red} \eta _{\theta}}(x,u)$ : average cost function. $R^n \mapsto R$

${\color{Red} \pi_{\theta}}(x)$ : Markov chains {X_n} stationary probability. 状态平稳分布的概率。

${\color{Red} \mu_{\theta}}({\color{DarkGreen} x},{\color{Red} u})$ : probability of taking action u when the state x is encountered under the policy corresponding to θ

${\color{Red} \eta _{\theta}}(x,u)={\color{Red} \pi_{\theta}}(x){\color{Red} \mu_{\theta}}(x,u)$ : Markov chains {X_n , U_n } stationary probability. 在状态平稳分布的情况下，某状态发生动作u的概率。

${\color{Blue} \psi _{{\color{Red} \theta}}}(x,{\color{Red} u})$ $\begin{align*} {\color{Blue} \psi_{{\color{Red} \theta}}}(x,u) &=\frac{{\color{red} \mu_{\theta}}(x,u){\color{Blue} \psi_{{\color{Red} \theta}}}(x,u)}{{\color{Red} \mu_{\theta}}(x,u)} \\ &=\frac{{\color{Red} \bigtriangledown \mu_{\theta}}(x,u)}{\mu_{\theta}(x,u)} \\ &=\bigtriangledown ln \mu_{{\color{Red} \theta}}(x,u) \\ \end{align*}$ : Rⁿ valued function

${\color{Red} \triangledown \mu_{\theta}}({\color{DarkGreen} x},u)={\color{Red} \mu_{\theta}}({\color{DarkGreen} x},{\color{Red} u}){\color{Blue} \psi_{\theta}}({\color{DarkGreen} x},{\color{Red} u})$ : policy gradient，这个有关策略梯度的定义很巧妙，它将概率与值函数联系在一起。

${\color{Magenta} \lambda }(\theta)=\sum_{x\in S,u\in A} {\color{Magenta} g} (x,u){\color{Red} \eta _{\theta}}(x,u)=\sum_{x\in S,u\in A} {\color{Magenta} g} (x,u) {\color{Red} \pi_{\theta}}(x) {\color{Red} \mu_{\theta}}(x,u)$ : average cost function. $R^n \mapsto R$

${\color{Blue} q_{\theta}}(x,u)={\color{Magenta} g}(x,u)-{\color{Magenta} \lambda}(\theta)+\sum_y {\color{Golden} V_{\theta}}(y)p_{xy}(u)$ : q-function

$\begin{align*} {\color{Blue} \psi_{\theta}}(x,u) &=\frac{{\color{red} \mu_{\theta}}(x,u){\color{Blue} \psi_{\theta}}(x,u)}{{\color{Red} \mu_{\theta}}(x,u)} \\ &=\frac{{\color{Red} \bigtriangledown \mu_{\theta}}(x,u)}{\mu_{\theta}(x,u)} \\ &=\bigtriangledown ln \mu_{\theta}(x,u) \\ \end{align*}$

$\begin{align*} {\color{Magenta} \lambda }(\theta)+ {\color{Golden} V_{\theta}}(x)&=\sum_{x\in S,u\in A} {\color{Magenta} g} (x,u){\color{Red} \eta _{\theta}}(x,u)+ {\color{Golden} V_{\theta}}(x)\\ &=\sum_{x\in S,u\in A}{\color{Magenta} g} (x,u) {\color{Red} \pi_{\theta}}(x){\color{Red} \mu_{\theta}}(x,u)+\sum_{x\in S, u\in A} {\color{Golden} \sum_y V_{\theta}}(y) {\color{Red} \pi_{\theta}(x)}{\color{Red} \mu_{\theta}}(x,u)p_{x,y}(u) \\ &=\sum_{x\in S,u\in A}{\color{Red} \mu_{\theta}}(x,u) {\color{Red} \pi_{\theta}}(x)\left [ {\color{Magenta} g} (x,u) + {\color{Golden} \sum_y V_{\theta}}(y) p_{x,y}(u) \right ]\\ &=\sum_{u\in A}{\color{Red} \mu_{\theta}}(x,u) \left [ {\color{Magenta} g} (x,u) + {\color{Golden} \sum_yV_{\theta}}(y) p_{x,y}(u) \right ]\\ \end{align*}$

${\color{Golden} V_{\theta}}(x)$ : can be viewed as the “disadvantage” of state x, it is the expected excess cost – on top of the average cost – incurred if we start at state x. 作用与MDP值函数，总或打折cost 相似。

Theorem 1

$\frac{\partial }{\partial \theta_i}{\color{Magenta} \lambda }(\theta)=\sum_{x,u} {\color{red} \eta_{\theta}} (x,u){\color{Blue} q_{\theta}}(x,u) {\color{blue} \psi _{\theta}^i}(x,u)$

where

${\color{Blue} \psi_{\theta}^i}(x,u): the\ ith \ component\ of \ {\color{Blue} \psi_{\theta}}$

内积： $Define\ the\ inner\ product \\ \left \langle \cdot , \cdot \right \rangle_{\theta} of \ 2\ real \ valued\ functions\ {\color{Blue} q_1, q_2} \\ \left \langle {\color{Blue} q_1},{\color{Blue} q_2} \right \rangle_{\theta}=\sum_{x,u}\eta_{\theta}(x,u){\color{Blue} q_1}(x,u){\color{Blue} q_2}(x,u)$

$\frac{\partial }{\partial \theta_i}{\color{Magenta} \lambda }(\theta)=\left \langle {\color{Blue} q_{\theta}},{\color{Blue} \psi_{\theta}^i} \right \rangle_{\theta}, \ i=1, .. .,n.$

范数： $\left \| \cdot \right \|_{\theta}$

将学习q_θ转化为 q_θ 在子空间  投影的学习。

结论：计算“学习”值函数在子空间的投影足够了。

Actor-critic algorithms

我们把Actor-critic算法看成在actor参数空间的随机梯度算法，当actor参数向量是θ，critic的工作就是计算 ${\color{Golden} \prod _{\theta}}{\color{Blue} q_{\theta}}$ 在子空间 ${\color{Magenta} \Psi_{\theta}}$ 的投影的近似值，actor用这个投影的近似值在近似梯度方向更新它的策略。

在这篇论文的算法中，需要改变控制策略control policy与特征向量feature vectors，因为actor更新它的参数。

Critic

论文里面描述了两个actor-critic 算法，区别只在于critic更新的不同。critic是一个TD算法，q-函数的线性参数近似架构：

${\color{Blue} Critic}\ TD\ algorithm\ with\ a\ linearly\ {\color{red} parameterized}\ {\color{Magenta} approximation}\ \\ architecture for\ the\ {\color{Blue} q-function}, of\ the\ form\\ {\color{Blue} Q_r}^{{\color{Red} \theta}}(x,u)=\sum_{j=1}^m {\color{blue} r}^j {\color{Red}\phi_{ \theta}^j}(x,u),\\ where\\ {\color{blue} r=(r^1,...,r^m)}\in R^m, denotes\ the\ {\color{blue} parameter\ vector}\ of\ the\ {\color{Blue} critic}.\\ {\color{Red} \phi_{{ \theta}}^j,j=1,...,m},features, used\ by\ the\ {\color{Blue} critic}\ are\ dependent\ on\ the\ {\color{Red} actor\ parameter\ \theta}.\\ {\color{Red} \Phi_{\theta}}\ contains\ \Psi_{\theta}.$

${\color{Magenta} \lambda} _{k+1}=\lambda_k+\gamma _k({\color{Magenta} g}(X_k, U_k)-\lambda_k)$

${\color{Blue} r_{k+1}}={\color{Blue} r_k}+{\color{blue} \gamma_k}\left ({\color{Magenta} g}(X_k,U_k)-{\color{Magenta} \lambda_k}+{\color{Blue} Q}{\color{Blue}_{r_k}}{{\color{Red} ^{\theta_k}}} (X_{k+1},U_{k})-{\color{Blue} Q_{r_k}}^{{\color{Red} \theta_k}}(X_k, U_k)\right ){\color{Blue} z_k}$

${\color{blue} \gamma_k}: \ a \ positive\ stpesize\ {\color{Red} parameter}$

两个critic方法区别只在于更新z_k 方式不同。

TD(1) Critic: Let x* be a state in S.

$\begin{align*} {\color{Blue} z_{k+1}} &= z_k+\phi_{\theta_k}(X_{k+1},U_{k+1}), if\ X_{k+1} \neq x*,\\ &=\phi_{\theta_k}(X_{k+1},U_{k+1}), \ otherwise. \end{align*}$

TD(α) Critic, $0 \leq \alpha < 1:$

${\color{Blue} z_{k+1}} = {\color{Magenta} \alpha} z_k+ \phi_{\theta_k}(X_{k+1},U_{k+1})$

Actor

Finally, the actor updates its parameter vector by letting

${\color{Red} \theta_{k+1}}=\theta_k - {\color{Red} \beta_k }{\color{Magenta} \Gamma (r_k)}{\color{Blue} Q{\color{Red} ^{\theta_k}}_{r_k}} (X_{k+1},U_{k+1}){\color{Blue} \psi_{{\color{Red} \theta_k}}}(X_{k+1},U_{k+1})\\ {\color{Magenta} \Gamma(r_k)}>0\ normalization \ factor.\\ \Gamma(\cdot)is\ Lipschitz\ continuous. \\ There\ exists\ C>0\ such\ that\ \\ {\color{Magenta} \Gamma}\leq \frac{C}{1+\left \| r \right \|}\\ {\color{Red} \beta _k}: a\ positive\ stepsize.$

Convergence of actor-critic algorithms

actor–critic算法是基于梯度，不能证明全局优化策略是收敛的，证明cost $\bigtriangledown \lambda (\theta) n \to 0$ 。

因为 $\frac{ {\color{Red} \beta_k}}{{\color{Blue} \gamma_k}} \to 0$ , 对比critic更新的尺寸，actor的尺寸更新可以忽略不计，所以，当考虑critic的时候，actor是平稳的。也就是说解决了actor-only偏差大的问题。

Policy Gradient vs Q-learning

Download: pdf

Dr. Pei

Email Address:

Blog Stats

State Action/Control

Meta

Actor-Critic Algorithms

Actor-Critic Algorithms

Math Analysis

Download Actor-Critic Algorithms Flowchart

Notation

Theorem 1

Actor-critic algorithms

Critic

Actor

Convergence of actor-critic algorithms

Policy Gradient vs Q-learning

Last posts