Actor-Critic Algorithms


Actor-Critic Algorithms

Math Analysis


Methods that learn approximations to both policy and value functions are often called actorcritic methods, where ‘actor’ is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value function.

这篇论文提出和分析了一类actorcritic算法,用于一参数化系列的随机平稳策略的马尔可夫决策过程(MDP)的基于仿真的优化。

Critic : 一个线性近似架构的TD学习, value function

Actor : 使用Critic提供的信息,在一个近似梯度方向更新。决策, policy

大多数强化学习和神经动态编程方法主要属于以下两类中的一类:

(a)Actor-only一系列参数化的策略。性能梯度,对于actor参数的偏导,在提高方向上更新参数。可能的缺点是梯度估计者可能有大偏差,而且,策略改变后,新策略估计与过去估计无关,所以,没有“学习”,也就是没有积累和固化老信息

(b)Critic-only只依赖近似值函数,目的是学习Bellman公式的近似解,即希望规定一个近-优化的策略。这个方法不是在策略空间直接优化。这个方法可能可以找到值函数的“好的“近似函数,但是在结果策略的近-优化方面缺乏可信度

Actorcritic方法结合了actor-only和critic-only的优点。Critic使用一个近似架构和仿真学习值函数,然后用来在性能提高方向上更新actor策略参数,这个方法是基于梯度,可以得到希望的收敛特性,critic-only只有在非常有限的设置才能保证收敛。对比actor-only方法,这种方法能更快地收敛,因为偏差降低了。

这篇论文提出actor-critic 算法,证明收敛。这算法基于重要的观察。因为actor的需要更新的参数数量相对状态数量来说很小,critic不用计算或近似准确的高维对象的值函数。实际上,critic理想化地计算值函数“投影”到一个低维的子空间,即一组完全由actor参数化决定的“基本函数”所在的子空间。


Download Actor-Critic Algorithms Flowchart


Notation

S : MDP finite state space

A : MDP finite action space

g : S x A → R given cost function

μ : randomized stationary policy (RSP), mapping μ that assigns to each state x a probability distribution over the action space A

P=\left \{ \mu_{\theta};\theta \in R^{n} \right \} : a set of randomized stationary policies

θ : parameter in stationary policy

{\color{Red} \mu_{\theta}}({\color{DarkGreen} x},{\color{Red} u}): probability of taking action u when the state x is encountered under the policy corresponding to θ

\theta \mapsto {\color{Red} \mu_{\theta}}({\color{DarkGreen} x},{\color{Red} u}) : map probability, policy. 在有关 θ 的策略下,状态x,发生动作u的概率。

pxy(u) : probability that the next state is y, given that the current state is x and the current action is u.

{Xn} : sequence of states.

{Xn , Un } : stateaction pairs, MDP, S x A.

{\color{Blue} \psi _{\theta}}(x,{\color{Red} u}) : Rn valued function

{\color{Red} \triangledown \mu_{\theta}}({\color{DarkGreen} x},u)=\mu_{\theta}({\color{DarkGreen} x},{\color{Red} u}){\color{Blue} \psi_{\theta}}({\color{DarkGreen} x},{\color{Red} u})  :  policy gradient

\theta \mapsto {\color{Blue} \psi_{\theta}}({\color{blue} x},{\color{Red} u}) : map value function

{\color{Red} \pi_{\theta}}(x) : Markov chains {Xn} and {Xn , Un } stationary probability. 状态平稳分布的概率。

{\color{Red} \eta _{\theta}}(x,u)={\color{Red} \pi_{\theta}}(x){\color{Red} \mu_{\theta}}(x,u) : stationary probability. 在状态平稳分布的情况下,某状态发生动作u的概率。

{\color{Magenta} \lambda }(\theta)=\sum_{x\in S,u\in A} {\color{Magenta} g} (x,u){\color{Red} \eta _{\theta}}(x,u) : average cost function. R^n \mapsto R


{\color{Red} \pi_{\theta}}(x) : Markov chains {Xn} stationary probability. 状态平稳分布的概率

{\color{Red} \mu_{\theta}}({\color{DarkGreen} x},{\color{Red} u}): probability of taking action u when the state x is encountered under the policy corresponding to θ

{\color{Red} \eta _{\theta}}(x,u)={\color{Red} \pi_{\theta}}(x){\color{Red} \mu_{\theta}}(x,u) : Markov chains {Xn , Unstationary probability. 状态平稳分布的情况下某状态发生动作u的概率

{\color{Blue} \psi _{{\color{Red} \theta}}}(x,{\color{Red} u})   \begin{align*} {\color{Blue} \psi_{{\color{Red} \theta}}}(x,u) &=\frac{{\color{red} \mu_{\theta}}(x,u){\color{Blue} \psi_{{\color{Red} \theta}}}(x,u)}{{\color{Red} \mu_{\theta}}(x,u)} \\ &=\frac{{\color{Red} \bigtriangledown \mu_{\theta}}(x,u)}{\mu_{\theta}(x,u)} \\ &=\bigtriangledown ln \mu_{{\color{Red} \theta}}(x,u) \\ \end{align*}: Rn valued function

{\color{Red} \triangledown \mu_{\theta}}({\color{DarkGreen} x},u)={\color{Red} \mu_{\theta}}({\color{DarkGreen} x},{\color{Red} u}){\color{Blue} \psi_{\theta}}({\color{DarkGreen} x},{\color{Red} u})  :  policy gradient,这个有关策略梯度的定义很巧妙,它将概率与值函数联系在一起。

{\color{Magenta} \lambda }(\theta)=\sum_{x\in S,u\in A} {\color{Magenta} g} (x,u){\color{Red} \eta _{\theta}}(x,u)=\sum_{x\in S,u\in A} {\color{Magenta} g} (x,u) {\color{Red} \pi_{\theta}}(x) {\color{Red} \mu_{\theta}}(x,u) : average cost function. R^n \mapsto R

{\color{Blue} q_{\theta}}(x,u)={\color{Magenta} g}(x,u)-{\color{Magenta} \lambda}(\theta)+\sum_y {\color{Golden} V_{\theta}}(y)p_{xy}(u) : q-function


\begin{align*} {\color{Blue} \psi_{\theta}}(x,u) &=\frac{{\color{red} \mu_{\theta}}(x,u){\color{Blue} \psi_{\theta}}(x,u)}{{\color{Red} \mu_{\theta}}(x,u)} \\ &=\frac{{\color{Red} \bigtriangledown \mu_{\theta}}(x,u)}{\mu_{\theta}(x,u)} \\ &=\bigtriangledown ln \mu_{\theta}(x,u) \\ \end{align*}

\begin{align*} {\color{Magenta} \lambda }(\theta)+ {\color{Golden} V_{\theta}}(x)&=\sum_{x\in S,u\in A} {\color{Magenta} g} (x,u){\color{Red} \eta _{\theta}}(x,u)+ {\color{Golden} V_{\theta}}(x)\\ &=\sum_{x\in S,u\in A}{\color{Magenta} g} (x,u) {\color{Red} \pi_{\theta}}(x){\color{Red} \mu_{\theta}}(x,u)+\sum_{x\in S, u\in A} {\color{Golden} \sum_y V_{\theta}}(y) {\color{Red} \pi_{\theta}(x)}{\color{Red} \mu_{\theta}}(x,u)p_{x,y}(u) \\ &=\sum_{x\in S,u\in A}{\color{Red} \mu_{\theta}}(x,u) {\color{Red} \pi_{\theta}}(x)\left [ {\color{Magenta} g} (x,u) + {\color{Golden} \sum_y V_{\theta}}(y) p_{x,y}(u) \right ]\\ &=\sum_{u\in A}{\color{Red} \mu_{\theta}}(x,u) \left [ {\color{Magenta} g} (x,u) + {\color{Golden} \sum_yV_{\theta}}(y) p_{x,y}(u) \right ]\\ \end{align*}

{\color{Golden} V_{\theta}}(x) : can be viewed as the “disadvantage” of state x, it is the expected excess cost – on top of the average cost – incurred if we start at state x. 作用与MDP值函数,总或打折cost 相似。


Theorem 1

\frac{\partial }{\partial \theta_i}{\color{Magenta} \lambda }(\theta)=\sum_{x,u} {\color{red} \eta_{\theta}} (x,u){\color{Blue} q_{\theta}}(x,u) {\color{blue} \psi _{\theta}^i}(x,u)

where

{\color{Blue} \psi_{\theta}^i}(x,u): the\ ith \ component\ of \ {\color{Blue} \psi_{\theta}}

内积Define\ the\ inner\ product \\ \left \langle \cdot , \cdot \right \rangle_{\theta} of \ 2\ real \ valued\ functions\ {\color{Blue} q_1, q_2} \\ \left \langle {\color{Blue} q_1},{\color{Blue} q_2} \right \rangle_{\theta}=\sum_{x,u}\eta_{\theta}(x,u){\color{Blue} q_1}(x,u){\color{Blue} q_2}(x,u)

so

\frac{\partial }{\partial \theta_i}{\color{Magenta} \lambda }(\theta)=\left \langle {\color{Blue} q_{\theta}},{\color{Blue} \psi_{\theta}^i} \right \rangle_{\theta}, \ i=1, .. .,n.

范数:\left \| \cdot \right \|_{\theta}

将学习qθ 转化为 qθ 在子空间 \Psi 投影的学习。
let\ {\color{Golden} \prod _{\theta}}: R^{\left | S \right |\left | A \right |} \mapsto \Psi_{\theta} \\ be\ the\ {\color{Golden} projection\ operator}\\ {\color{Golden} \prod _{\theta}}q=arg\ min_{\hat{q} \in \Psi_{\theta}} \left \| q- \hat q \right \|_{\theta}\\ since\\ \left \langle q_{\theta},\psi_{\theta} \right \rangle_{\theta}=\left \langle {\color{Golden} \prod _{\theta}}q_{\theta},\psi_{\theta} \right \rangle_{\theta}\\ it\ is\ enough\ to\ compute\\(learn)\ the\ {\color{Golden} projection}\ of \ {\color{Blue} q_{\theta}}\ onto\ {\color{Blue} \Psi_{\theta}}.
结论:计算“学习”值函数在子空间的投影足够了。

Actor-critic algorithms

我们把Actor-critic算法看成在actor参数空间随机梯度算法,当actor参数向量是θcritic的工作就是计算 {\color{Golden} \prod _{\theta}}{\color{Blue} q_{\theta}} 在子空间{\color{Magenta} \Psi_{\theta}}投影的近似值,actor用这个投影的近似值近似梯度方向更新它的策略

在这篇论文的算法中,需要改变控制策略control policy特征向量feature vectors,因为actor更新它的参数

Critic

论文里面描述了两个actor-critic 算法,区别只在于critic更新的不同。critic是一个TD算法,q-函数的线性参数近似架构:

{\color{Blue} Critic}\ TD\ algorithm\ with\ a\ linearly\ {\color{red} parameterized}\ {\color{Magenta} approximation}\ \\ architecture for\ the\ {\color{Blue} q-function}, of\ the\ form\\ {\color{Blue} Q_r}^{{\color{Red} \theta}}(x,u)=\sum_{j=1}^m {\color{blue} r}^j {\color{Red}\phi_{ \theta}^j}(x,u),\\ where\\ {\color{blue} r=(r^1,...,r^m)}\in R^m, denotes\ the\ {\color{blue} parameter\ vector}\ of\ the\ {\color{Blue} critic}.\\ {\color{Red} \phi_{{ \theta}}^j,j=1,...,m},features, used\ by\ the\ {\color{Blue} critic}\ are\ dependent\ on\ the\ {\color{Red} actor\ parameter\ \theta}.\\ {\color{Red} \Phi_{\theta}}\ contains\ \Psi_{\theta}.

 

{\color{Magenta} \lambda} _{k+1}=\lambda_k+\gamma _k({\color{Magenta} g}(X_k, U_k)-\lambda_k)

{\color{Blue} r_{k+1}}={\color{Blue} r_k}+{\color{blue} \gamma_k}\left ({\color{Magenta} g}(X_k,U_k)-{\color{Magenta} \lambda_k}+{\color{Blue} Q}{\color{Blue}_{r_k}}{{\color{Red} ^{\theta_k}}} (X_{k+1},U_{k})-{\color{Blue} Q_{r_k}}^{{\color{Red} \theta_k}}(X_k, U_k)\right ){\color{Blue} z_k}

{\color{blue} \gamma_k}: \ a \ positive\ stpesize\ {\color{Red} parameter}

两个critic方法区别只在于更新zk 方式不同

TD(1) Critic: Let x* be a state in S.

\begin{align*} {\color{Blue} z_{k+1}} &= z_k+\phi_{\theta_k}(X_{k+1},U_{k+1}), if\ X_{k+1} \neq x*,\\ &=\phi_{\theta_k}(X_{k+1},U_{k+1}), \ otherwise. \end{align*}

TD(α) Critic, 0 \leq \alpha < 1:

{\color{Blue} z_{k+1}} = {\color{Magenta} \alpha} z_k+ \phi_{\theta_k}(X_{k+1},U_{k+1})

Actor

Finally, the actor updates its parameter vector by letting

{\color{Red} \theta_{k+1}}=\theta_k - {\color{Red} \beta_k }{\color{Magenta} \Gamma (r_k)}{\color{Blue} Q{\color{Red} ^{\theta_k}}_{r_k}} (X_{k+1},U_{k+1}){\color{Blue} \psi_{{\color{Red} \theta_k}}}(X_{k+1},U_{k+1})\\ {\color{Magenta} \Gamma(r_k)}>0\ normalization \ factor.\\ \Gamma(\cdot)is\ Lipschitz\ continuous. \\ There\ exists\ C>0\ such\ that\ \\ {\color{Magenta} \Gamma}\leq \frac{C}{1+\left \| r \right \|}\\ {\color{Red} \beta _k}: a\ positive\ stepsize.


Convergence of actor-critic algorithms

actorcritic算法是基于梯度,不能证明全局优化策略是收敛的,证明cost \bigtriangledown \lambda (\theta) n \to 0

因为\frac{ {\color{Red} \beta_k}}{{\color{Blue} \gamma_k}} \to 0, 对比critic更新的尺寸,actor的尺寸更新可以忽略不计,所以,当考虑critic的时候,actor是平稳的。也就是说解决了actor-only偏差大的问题


Policy Gradient vs Q-learning

 

 

 

 


 

Sidebar