Policy Gradient Methods for Reinforcement Learning with Function Approximation

Math Analysis

Markov Decision Processes and Policy Gradient

So far in this book almost all the methods have been action-value methods; they learned the values of actions and then selected actions based on their estimated action values; their policies would not even exist without the action-value estimates. In this chapter we consider methods that instead learn a parameterized policy that can select actions without consulting a value function. A value function may still be used to learn the policy parameter, but is not required for action selection.

method

Value function

Policy

Action-value Methods

Value of actions

would not even exist

Policy Gradient Methods

without consulting a value function, or a value function may be used to learn the policy parameter, but is not required for action selection

learn a parameterized policy

$\pi (a \mid s, \theta) = Pr\{A_t=a \mid S_t=s, \theta_t=\theta\}$

<Reinforcement Learning, An Introduction> Richard S. Sutton and Andrew G. Barto

这篇论文提出的策略(Policy)用它本身的FA(Function Approximator)来表现，策略与值函数无关，通过期望回报与策略参数的梯度来更新策略。这篇论文主要的新成果是通过一个近似动作值或者高级函数，该梯度能写成适合估计的形式。

$\frac{\partial {\color{Red} \rho}}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a,\theta)}{\partial \theta}Q^{\pi}(s,a)$

Notation

$P_{ss'}^a=Pr\{s_{t+1}=s'\ \mid s_t=s, a_t=a \}$ : state transition probabilities.

$R_s^a=E\{ r_{t+1} \mied s_t=s,a_t=a \}$ : expected rewards. $\forall s,s' \in S, a \in A$

$\pi(s,a,{\color{Red} \theta})=Pr\{ a_t = a \mid s_t = s, {\color{Red} \theta} \}$ : A policy which the agent’s decision making procedure at each time. $\forall s \in S, a \in A, where\ {\color{Red} \theta} \in R^l,for \ l<<\left | S \right |, is\ a\ {\color{Red} paramter\ vector}$ . $\frac{\partial \pi(s,a)}{\partial {\color{Red} \theta}} \ exists,\ \pi(s,a)\ is\ for\ \pi(s,a,{\color{Red} \theta})$

${\color{Red} \rho }({\color{Magenta} \pi})$ : approximate action-value function, function approximation, long-term expected reward per step. two ways of formulating the agent’s objective: average reward formulation and start state formulation. ${\color{Red} \rho }({\color{Magenta} \pi})$ is independent of state.

$d^{\pi}(s)=lim_{tn \to \infty}Pr\{ s_t=s\mid s_0, \pi \}$ : stationary distribution of states under π.

${\color{Red} \gamma}$ : [0, 1] a discount rate. In start-state formulation, we define ${\color{Blue} d^{\pi}(s)}$ as a discounted weighting of states encountered starting at s₀ and then following π : ${\color{Blue} d^{\pi}(s)}=\sum_{t=0}^{\infty} {\color{Red} \gamma^t} Pr\{s_t=s\mid s_0,\pi\}$ .

Q^π : the value of a state-action pair given a policy

${\color{Magenta} \pi}$ : 某策略的近似者函数。

${\color{Magenta} f_w}$ : S x A → R be our approximation to Q^π, with parameter ω. 某值函数的近似函数。

${\color{Red} \hat{Q}^{\pi} (s_t,a_t)}$ : some unbiased estimator of Q^π(s_t, a_t ), perhaps R_t.

Proof Key Steps about Theorem 1 (Policy Gradient)

Define

${\color{Blue} V^{\pi}(s)}=\sum_a \pi (s,a){\color{magenta} Q^{\pi}(s,a)}$ (1)

For the start-state formulation:

${\color{magenta} Q^{\pi}(s,a)}=R_s^a+\sum_{s'} \gamma P_{ss'}^a {\color{Purple} V^{\pi}(s')}$ (2)

$\begin{align*} \frac{\partial {\color{magenta}Q^{\pi}(s,a)}}{\partial \theta}&=\frac{\partial \left [ R_s^a + \sum_{s'}\gamma P_{ss'}^a {\color{Purple} V^{\pi}(s')} \right ]}{\partial \theta} \\ &=\sum_{s'}\gamma P_{ss'}^a {\color{Purple} \frac{\partial V^{\pi}(s')}{\partial \theta}} \end{align*}$ (3)

Then, we consider (1) partial differential with respect to theta,

$\begin{align*} {\color{Purple}\frac{\partial V^{\pi}(s')}{\partial \theta}}&=\frac{\partial [\sum_{a'} \pi (s',a'){\color{magenta} Q^{\pi}(s',a')}]}{\partial \theta} \\ &=\sum_{a'} \left[ \frac{\partial \pi (s',a')}{\partial \theta} {\color{magenta} Q^{\pi}(s',a')} + \pi (s',a') \frac{\partial {\color{magenta} Q^{\pi}(s',a')}}{\partial \theta}\right ] \\ &=\sum_{a'} \left[ \frac{\partial \pi (s',a')}{\partial \theta} {\color{magenta} Q^{\pi}(s',a')} + \pi (s',a') \sum_{s''} \gamma P_{s's''}^{a'} {\color{Red} \frac{\partial V^{\pi}(s'')}{\partial \theta}} \right ] \\ \end{align*}$ (5)

$\begin{align*} {\color{red} \frac{\partial V^{\pi}(s'')}{\partial \theta}} &=\frac{\partial [\sum_{a''} \pi (s'',a''){\color{magenta} Q^{\pi}(s'',a'')}]}{\partial \theta} \\ &=\sum_{a''} \left[ \frac{\partial \pi (s'',a'')}{\partial \theta} {\color{magenta} Q^{\pi}(s'',a'')} + \pi (s'',a'') \frac{\partial {\color{magenta} Q^{\pi}(s'',a'')}}{\partial \theta}\right ] \\ &=\sum_{a''} \left[ \frac{\partial \pi (s'',a'')}{\partial \theta} {\color{magenta} Q^{\pi}(s'',a'')} + \pi (s'',a'') \sum_{s'''} \gamma P_{s''s'''}^{a''} {\color{DarkRed}\frac{\partial V^{\pi}(s''')}{\partial \theta}} \right ] \\ \end{align*}$ (6)

$\begin{align*} {\color{DarkRed} \frac{\partial V^{\pi}(s''')}{\partial \theta}} &=\frac{\partial [\sum_{a'''} \pi (s''',a'''){\color{magenta} Q^{\pi}(s''',a''')}]}{\partial \theta} \\ &=\sum_{a'''} \left[ \frac{\partial \pi (s''',a''')}{\partial \theta} {\color{magenta} Q^{\pi}(s''',a''')} + \pi (s''',a''') \frac{\partial {\color{magenta} Q^{\pi}(s''',a''')}}{\partial \theta}\right ] \\ &=\sum_{a'''} \left[ \frac{\partial \pi (s''',a''')}{\partial \theta} {\color{magenta} Q^{\pi}(s''',a''')} + \pi (s''',a''') \sum_{s''''} \gamma P_{s'''s''''}^{a'''} {\color{Golden}\frac{\partial V^{\pi}(s'''')}{\partial \theta}} \right ] \\ \end{align*}$ (7)

. . .

Substitute (7) into (6) get 76, then, substitute 76 into (5), then, substitute 765 into (4),

$\begin{align*} \frac{\partial {\color{Blue} V^{\pi}(s)}}{\partial \theta} &=\frac{\partial [\sum_a\pi (s,a){\color{magenta} Q^{\pi}(s,a)}]}{\partial \theta} \\ &=\sum_{a} \left[ \frac{\partial \pi (s,a)}{\partial \theta} {\color{magenta} Q^{\pi}(s,a)} + \pi (s,a) \frac{\partial {\color{magenta} Q^{\pi}(s,a)}}{\partial \theta}\right ] \\ &=\sum_{\color{Red} x} \sum_{k=0}^{\infty} \gamma ^{\color{Magenta} k} Pr(s \to {\color{Red} x},{\color{Magenta} k},\pi)\sum_a \frac{\partial \pi({\color{Red} x},a)}{\partial \theta} Q^{\pi}({\color{Red} x},a) \end{align*}\\ where, \\ {\color{Red} x}:state, such\ as\ s'''\ or \ s'''', ...,\\ {\color{Magenta} k}:steps, from\ state\ s \ to\ state \ {\color{Red} x}\\ \pi: policy. \\ Pr(s \to {\color{Red} x}, {\color{Magenta} k}, \pi): the\ probability\ of\ going\\ from\ state\ s\ to\ state\ {\color{Red} x}\ in\ {\color{Magenta} k}\ steps\ under\ policy\ \pi.$

assume in start-state formulation:

${\color{Golden} d^{\pi}(s)=\sum_{k=0}^{\infty}{\color{Red} \gamma^k} Pr(s_0 \to s,k,\pi)}$ so

$\begin{align*} \frac{\partial \rho}{\partial \theta} &=\frac{\partial }{\partial \theta}E\left \{ \sum_{t=1}^{\infty} \gamma^{t-1}r_t\mid s_0,\pi\right \}=\frac{\partial {\color{Blue} V^{\pi}}}{\partial \theta}({\color{Red} s_0})\\ &=\sum_{\color{magenta} s}\sum_{k=0}^{\infty}\gamma^kPr({\color{Red} s_0} \to {\color{Magenta} s},k,\pi)\sum_a\frac{\partial \pi({\color{Magenta} s},a)}{\partial \theta}Q^{\pi}({\color{Magenta} s},a) \\ &=\sum_{\color{magenta} s}{\color{Golden} \sum_{k=0}^{\infty}\gamma^kPr({\color{Red} s_0} \to {\color{Magenta} s},k,\pi)}\sum_a\frac{\partial \pi({\color{Magenta} s},a)}{\partial \theta}Q^{\pi}({\color{Magenta} s},a) \\ &=\sum_{\color{Magenta} s}{\color{Golden} d^{\pi}(s)}\sum_a\frac{\partial \pi({\color{Magenta} s},a)}{\partial \theta}Q^{\pi}({\color{Magenta} s},a) \end{align*}$ (Q.E.D)

—————————-

Stationary Distribution

平稳分布

殊途同归

$\pi = \pi P^n$

${\color{Magenta} \pi} = {\color{Magenta} \pi} {\color{Red} P}$

where

P : 转移概率矩阵

π : 平稳概率分布

例：设状态空间为S={0, 1, 2,}的马尔可夫链，其一步转移概率矩阵为

${\color{Red} P}=\begin{bmatrix} 0.5 & 0.4 & 0.1\\ 0.3 & 0.4 & 0.3\\ 0.2 & 0.3 & 0.5 \end{bmatrix}$

试分析它的极限分布，平稳分布是否存在？并计算

解：易知此链为不可约遍历链。

故极限分布存在，平稳分布存在唯一，且平稳分布就是其极限分布。

$\left\{\begin{matrix} {\color{Red} \pi = \pi P}\\ {\color{Magenta} \pi_0 +\pi_1+\pi_2 =1} \end{matrix}\right.$ $\Rightarrow$ $\begin{align*} \pi_0 =\frac{21}{62} \\ \pi_1=\frac{23}{62} \\ \pi_2=\frac{18}{62} \\ \end{align*}$

$\Rightarrow \pi=(\pi_0,\pi_1,\pi_2)=\left ( \frac{21}{62},\ \frac{23}{62},\ \frac{18}{62}\right )$

用结果验证，

$\begin{align*} {\color{Red} \pi P} &= \left ( \frac{21}{62},\ \frac{23}{62},\ \frac{18}{62}\right )\begin{bmatrix} 0.5 & 0.4 & 0.1\\ 0.3 & 0.4 & 0.3\\ 0.2 & 0.3 & 0.5 \end{bmatrix}\\ &=\left( \frac{21}{62} \cdot 0.5 + \frac{23}{62} \cdot 0.3 + \frac{18}{62} \cdot 0.2 ,\ \frac{21}{62} \cdot 0.4 + \frac{23}{62} \cdot 0.4 + \frac{18}{62} \cdot 0.3,\ \frac{21}{62} \cdot 0.1 + \frac{23}{62} \cdot 0.3 + \frac{18}{62} \cdot 0.5 \right ) \\ &=\left ( \frac{12}{62},\ \frac{23}{62},\ \frac{18}{62}\right ) \\&= {\color{Red}\pi} \end{align*}$

——————

${\color{Magenta} \pi} = {\color{Magenta}\pi }P$

也就是说，可以将求平稳分布与求特征向量相“联系”起来。

$\begin{align*} {\color{Magenta} \pi} &= {\color{Magenta}\pi }P\\ \pi ^T &= {\color{Red} P^T} \pi ^T\\ \lambda x &= {\color{Red} A}x \\ {\color{Magenta} x}&={\color{Red} A} {\color{Magenta} x} \\ \end{align*} {\color{Blue} \Leftrightarrow} \ {\color{Magenta} Stationary\ Distribution \xleftarrow[eigenvalue=1]{eigenvector} State\ Transition\ Probability\ Matrix\ transpose}$

(A : n阶方阵，对应状态空间S的一步转移概率矩阵的转置，

λ: A的特征值，在这里为1，

x: 非零向量，对应转移概率矩阵的转置的特征值为1的特征向量，即平稳分布)

结论：求某状态空间S的马尔可夫链的平稳分布也就是
求其一步转移概率矩阵P^T的特征值为1的特征向量。

————————

在状态空间S中，考虑到所有的动作a，进入到下一个状态S’，在本论文中平稳分布是d^π，根据以上有关状态空间S的平稳分布的说明， ${\color{Magenta} \pi} = {\color{Magenta} \pi} {\color{Red} P}$ ，则可以得出以下关系式：

For the average-reward formulation:

$\sum_S {\color{Red} d^{\pi}}(s)\sum_a \pi (s,a)\sum_{S'}P_{s{\color{Red} s'}}^a=\sum_{S'} {\color{Red} d^{\pi}}({\color{Red} s'})$

—————————

Stationary distribution d^π, so sum of probability equals 1:

${\color{Blue} \sum_S d^{\pi}(s)}=1$

$\frac{\partial \rho (\pi)}{\partial \theta}\ \ is \ independent\ of \ s,{\color{Blue} \sum_S d^{\pi}(s)}=1$

${\color{Blue} \sum_S d^{\pi}(s)} \frac{\partial \rho}{\partial \theta}={\color{Blue} 1}\cdot \frac{\partial \rho}{\partial \theta}$

$\begin{align*} {\color{Blue} \sum_sd^{\pi}(s)}\frac{\partial \rho}{\partial \theta}&=\sum_sd^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a){\color{Orange} + \sum_{s'}d^{\pi}(s')\frac{\partial V^{\pi}(s')}{\partial \theta}} {\color{DarkOrange} -\sum_s d^{\pi}(s)\frac{\partial V^{\pi}(s)}{\partial \theta}}\\ \frac{\partial \rho}{\partial \theta}&=\sum_sd^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a) \end{align*}$

(Q.E.D)

Methods that learn approximations to both policy and value functions are often called actor–critic methods, where ‘actor’ is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value function.

1. Policy Gradient Theorem

Theorem 1 (Policy Gradient)

For any MDP, in either the average-reward or start-state formulations,

	average-reward formulation	start-state formulation
$\rho(\pi)$	$\pi(s,a,{\color{Red} \theta})=Pr\{ a_t = a \mid s_t = s, {\color{Red} \theta} \}$ $\begin{align} \rho(\pi)&=lim_{n \to \infty}\frac{1}{n}E\{r_1+r_2+...+r_n\mid \pi \}\\ &=\sum_s {\color{Red} d^{\pi}(s)} \sum_a\pi(s,a)R_s^a \end{align}$ $d^{\pi}(s)=lim_{n \to \infty}Pr\{ s_t=s\mid s_0, \pi \}$	$\pi(s,a,{\color{Red} \theta})=Pr\{ a_t = a \mid s_t = s, {\color{Red} \theta} \}$ $\rho(\pi)=E\left \{\sum_{t=1}^{\infty} \gamma ^{t-1} r_t\mid s_0, \pi \right \}$ ${\color{Blue} d^{\pi}(s)}=\sum_{t=0}^{\infty}{\color{Red} \gamma ^t}Pr\{ s_t=s\mid {\color{DarkGreen} s_0}, {\color{Magenta} \pi} \}$ Define ${\color{Blue} d^{\pi}(s)}$ as a discounted weighting of states encountered starting at s₀ and then following π
$Q^\pi(s,a)$	$Q^{\pi}(s,a)=\sum_{t=1}^{\infty}E\{r_t-\rho(\pi) \mid s_0=s, a_0=a,\pi\}, \forall s\in S, a\in A$	$Q^{\pi}(s,a)=E \left \{ \sum_{t=1}^{\infty} \gamma ^{t-1} r_{t+k}\mid s_t=s,a_t=a, \pi \right \}$
Policy Gradient	$\frac{\partial \rho}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a)$	$\frac{\partial \rho}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a)$

In any event, the key aspect of both expressions for the gradient is that their are no terms of the form   : 
the effect of policy changes on the distribution of states does not appear.

换句话说就是，策略变化对于状态分布没有影响。

2. Policy Gradient with Approximation

Theorem 2 (Policy Gradient with Function Approximation)

If ${\color{Magenta} f_w}$ satisfies

$\sum_s d^{\pi}(s)\sum_a \pi (s,a){\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] \frac{\partial {\color{Magenta} f_{\omega} (s,a)}}{\partial \omega}}=0$ (2a)

———————-

这里简单提下公式(2a)的来源，其实就是学习的近似值 ${\color{Magenta} f_w}$ （对应值函数的真实值Q^π），通过策略π，通过下式的规则

$\begin{align*} \triangle \omega_t &\propto \frac{\partial }{\partial \omega}{\color{magenta} \left [ \hat{Q}^{\pi} (s_t,a_t)-f_{\omega}(s_t,a_t) \right ]^2}\\ &\propto {\color{Red} \left [ \hat{Q}^{\pi} (s_t,a_t)-f_{\omega}(s_t,a_t) \right ]\frac{\partial }{\partial \omega}f_{\omega}(s_t,a_t)} \end{align*}$

来更新ω，（近似值 ${\color{Magenta} f_w}$ 与真实值Q^π的差的平方求ω偏导成正比），上式红色部分，当过程收敛到一个局部最佳，得到(2a)。

———————-

and is compatible with the policy parameterization in the sense that

${\color{Red} \frac{\partial f_{\omega}(s,a)}{\partial \omega}=\frac{\partial \pi(s,a)}{\partial \theta}\cdot \frac{1}{\pi(s,a)}}$ (2b)

这个兼容条件compatibility condition很重要，起到‘桥梁’作用，可能是由发现者从结论“反推”得到的

then

${\color{Blue} \frac{\partial \rho}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}f_{\omega}(s,a)}$ (2c)

Proof:

Combining (2a) and (2b)，

$\sum_s d^{\pi}(s)\sum_a \pi (s,a){\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] \frac{\partial \pi(s,a)}{\partial \theta}\cdot \frac{1}{\pi(s,a)}}=0$

$\sum_s d^{\pi}(s)\sum_a {\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] \frac{\partial \pi(s,a)}{\partial \theta}}=0$

${\color{Golden} \sum_s d^{\pi}(s)\sum_a \frac{\partial \pi(s,a)}{\partial \theta}{\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] }}=0$ (2d)

we use the theorem 1 – equation (2d)

$\frac{\partial \rho}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a)$

get

$\begin{align*} \frac{\partial \rho}{\partial \theta}&=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a) \\ &=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a)-0 \\ &=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a)-{\color{Golden} \sum_s d^{\pi}(s)\sum_a \frac{\partial \pi(s,a)}{\partial \theta}{\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] }} \\ &=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}{\color{Magenta} f_{\omega}(s,a)} \end{align*}$ (Q.E.D)

3. Application to Deriving Algorithm and Advantages

Consider a policy that is a Gibbs distribution in a linear combination of features:

${\color{Golden} \pi} (s,a)=\frac{e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}}}{\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}}}$

$\begin{align*} \frac{\partial }{\partial \theta}{\color{Golden} \pi} (s,a)\cdot \frac{1}{\pi(s,a)} &=\frac{\partial }{\partial \theta}\frac{e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}}}{\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}}} \cdot \frac{1}{\pi(s,a)}\\ &=\frac{{\color{Red}\phi _{sa}}e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}}(\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}}) - e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}} (\sum_b {\color{Red} \phi_{sb}} e^{{\color{Blue} \theta^T}{\color{Red} \phi_{sb}}})} {(\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}})^2} \cdot \frac{\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}} } {e^{\color{Blue} \theta^T {\color{Red}\phi _{sa}}}}\\ &=\phi_{sa} -\frac{\sum_b\ \ e^{\theta^T\phi_{sb}}\ \ \phi_{sb}}{\sum_be^{\theta^T \phi_{sb}}} \\ &=\phi_{sa} - \sum_b {\color{Golden} \pi(s,b)}\phi_{sb} \end{align*}$

$f_{\omega}(s,a)=\omega^T\left [ \phi_{sa} - \sum_b {\color{Golden} \pi(s,b)} \phi_{sb}\right ]$

也就是说，除了每个状态normalized为均值0（为什么？）之外， ${\color{Magenta} f_w}$ 还与策略同样的特征必须是线性关系。

In other words, f_w must be linear in the same features as the policy, except normalized to be mean zero (why?) for each state.

4. Convergence of Policy Iteration with Function Approximation

Theorem 3 (Policy Iteration with Function Approximation)

${\color{Magenta} \pi}$ , ${\color{Magenta} f_w}$ 分别是任何的某策略和某价值函数的可微的近似者函数。同时它们满足公式（2b）即兼容条件，序列

$\{\rho (\pi_k)\}^{\infty}_{k=0}$

由下面定义：任何θ₀， $\pi_k=\pi(.\ ,.\ ,\theta_k)$ , and

$\omega_k=\omega \ such \ that$

$\sum_s d^{\pi_k}(s)\sum_a \pi_k (s,a){\color{Red} \left [ Q^{\pi_k} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] \frac{\partial {\color{Magenta} f_{\omega} (s,a)}}{\partial \omega}}=0$ 对应(2a)

$\theta_{k+1}=\theta_{k}+\alpha_k{\color{Blue} \sum_s d^{\pi_k}(s)\sum_a\frac{\partial \pi_k(s,a)}{\partial \theta}f_{\omega}(s,a)}$ 对应(2c)

收敛至

${\color{Blue} lim_{k\to \infty} \frac{\partial \rho (\pi_k)}{\partial \theta}=0}$

意味着存在局部最优

Soft max function, Softargmax, or Normalized Exponential Function

归一化指数函数

${\color{Golden} \pi} (s,a)=\frac{e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}}}{\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}}}$

where ${\color{Red} \phi _{sa}}$ : an L-dimensional feature vector characterizing state-action pair s, a. 表征状态-动作对的一个L维特征向量。