Policy Gradient Methods for Reinforcement Learning with Function Approximation

 


Policy Gradient Methods for Reinforcement Learning with Function Approximation

Math Analysis


Markov Decision Processes and Policy Gradient


So far in this book almost all the methods have been action-value methods; they learned the values of actions and then selected actions based on their estimated action values; their policies would not even exist without the action-value estimates. In this chapter we consider methods that instead learn a parameterized policy that can select actions without consulting a value function. A value function may still be used to learn the policy parameter, but is not required for action selection.

method Value function Policy
Action-value Methods Value of actions would not even exist
Policy Gradient Methods without consulting a value function, or a value function may be used to learn the policy parameter, but is not required for action selection learn a parameterized policy

\pi (a \mid s, \theta) = Pr\{A_t=a \mid S_t=s, \theta_t=\theta\}

<Reinforcement Learning, An Introduction> Richard S. Sutton and Andrew G. Barto


这篇论文提出的策略(Policy)用它本身的FA(Function Approximator)来表现,策略与值函数无关,通过期望回报策略参数梯度来更新策略。这篇论文主要的新成果是通过一个近似动作值或者高级函数,该梯度能写成适合估计的形式。

\frac{\partial {\color{Red} \rho}}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a,\theta)}{\partial \theta}Q^{\pi}(s,a)


Notation

P_{ss'}^a=Pr\{s_{t+1}=s'\ \mid s_t=s, a_t=a \} : state transition probabilities.

R_s^a=E\{ r_{t+1} \mied s_t=s,a_t=a \} :  expected rewards. \forall s,s' \in S, a \in A

\pi(s,a,{\color{Red} \theta})=Pr\{ a_t = a \mid s_t = s, {\color{Red} \theta} \} : A policy which the agent’s decision making procedure at each time. \forall s \in S, a \in A, where\ {\color{Red} \theta} \in R^l,for \ l<<\left | S \right |, is\ a\ {\color{Red} paramter\ vector}. \frac{\partial \pi(s,a)}{\partial {\color{Red} \theta}} \ exists,\ \pi(s,a)\ is\ for\ \pi(s,a,{\color{Red} \theta})

{\color{Red} \rho }({\color{Magenta} \pi}) : approximate action-value function, function approximation, long-term expected reward per step. two ways of formulating the agent’s objective: average reward formulation and start state formulation. {\color{Red} \rho }({\color{Magenta} \pi}) is independent of state.

d^{\pi}(s)=lim_{tn \to \infty}Pr\{ s_t=s\mid s_0, \pi \} : stationary distribution of states under π.

{\color{Red} \gamma} : [0, 1] a discount rate. In start-state formulation, we define {\color{Blue} d^{\pi}(s)} as a discounted weighting of states encountered starting at s0 and then following π : {\color{Blue} d^{\pi}(s)}=\sum_{t=0}^{\infty} {\color{Red} \gamma^t} Pr\{s_t=s\mid s_0,\pi\}.

Qπ : the value of a state-action pair given a policy

{\color{Magenta} \pi} : 某策略的近似者函数。

{\color{Magenta} f_w} : S x A → R be our approximation to Qπ, with parameter ω. 函数的近似函数

{\color{Red} \hat{Q}^{\pi} (s_t,a_t)} : some unbiased estimator of Qπ(st, at ), perhaps Rt.


Proof Key Steps about Theorem 1 (Policy Gradient) 

Define

{\color{Blue} V^{\pi}(s)}=\sum_a \pi (s,a){\color{magenta} Q^{\pi}(s,a)}   (1)

For the start-state formulation:

{\color{magenta} Q^{\pi}(s,a)}=R_s^a+\sum_{s'} \gamma P_{ss'}^a {\color{Purple} V^{\pi}(s')}   (2)

so

\begin{align*} \frac{\partial {\color{magenta}Q^{\pi}(s,a)}}{\partial \theta}&=\frac{\partial \left [ R_s^a + \sum_{s'}\gamma P_{ss'}^a {\color{Purple} V^{\pi}(s')} \right ]}{\partial \theta} \\ &=\sum_{s'}\gamma P_{ss'}^a {\color{Purple} \frac{\partial V^{\pi}(s')}{\partial \theta}} \end{align*}   (3)

Then, we consider (1) partial differential with respect to theta,

\begin{align*} \frac{\partial {\color{Blue} V^{\pi}(s)}}{\partial \theta}&=\frac{\partial [\sum_a \pi (s,a){\color{magenta} Q^{\pi}(s,a)}]}{\partial \theta} \\ &=\sum_{a} \left[ \frac{\partial \pi (s,a)}{\partial \theta} {\color{magenta} Q^{\pi}(s,a)} + \pi (s,a) \frac{\partial {\color{magenta} Q^{\pi}(s,a)}}{\partial \theta}\right ] \\ &=\sum_{a} \left[ \frac{\partial \pi (s,a)}{\partial \theta} {\color{magenta} Q^{\pi}(s,a)} + \pi (s,a) \sum_{s'} \gamma P_{ss'}^a {\color{Purple} \frac{\partial V^{\pi}(s')}{\partial \theta}} \right ] \\ \end{align*}(4)

so

\begin{align*} {\color{Purple}\frac{\partial V^{\pi}(s')}{\partial \theta}}&=\frac{\partial [\sum_{a'} \pi (s',a'){\color{magenta} Q^{\pi}(s',a')}]}{\partial \theta} \\ &=\sum_{a'} \left[ \frac{\partial \pi (s',a')}{\partial \theta} {\color{magenta} Q^{\pi}(s',a')} + \pi (s',a') \frac{\partial {\color{magenta} Q^{\pi}(s',a')}}{\partial \theta}\right ] \\ &=\sum_{a'} \left[ \frac{\partial \pi (s',a')}{\partial \theta} {\color{magenta} Q^{\pi}(s',a')} + \pi (s',a') \sum_{s''} \gamma P_{s's''}^{a'} {\color{Red} \frac{\partial V^{\pi}(s'')}{\partial \theta}} \right ] \\ \end{align*}(5)

so

\begin{align*} {\color{red} \frac{\partial V^{\pi}(s'')}{\partial \theta}} &=\frac{\partial [\sum_{a''} \pi (s'',a''){\color{magenta} Q^{\pi}(s'',a'')}]}{\partial \theta} \\ &=\sum_{a''} \left[ \frac{\partial \pi (s'',a'')}{\partial \theta} {\color{magenta} Q^{\pi}(s'',a'')} + \pi (s'',a'') \frac{\partial {\color{magenta} Q^{\pi}(s'',a'')}}{\partial \theta}\right ] \\ &=\sum_{a''} \left[ \frac{\partial \pi (s'',a'')}{\partial \theta} {\color{magenta} Q^{\pi}(s'',a'')} + \pi (s'',a'') \sum_{s'''} \gamma P_{s''s'''}^{a''} {\color{DarkRed}\frac{\partial V^{\pi}(s''')}{\partial \theta}} \right ] \\ \end{align*}     (6)

so

\begin{align*} {\color{DarkRed} \frac{\partial V^{\pi}(s''')}{\partial \theta}} &=\frac{\partial [\sum_{a'''} \pi (s''',a'''){\color{magenta} Q^{\pi}(s''',a''')}]}{\partial \theta} \\ &=\sum_{a'''} \left[ \frac{\partial \pi (s''',a''')}{\partial \theta} {\color{magenta} Q^{\pi}(s''',a''')} + \pi (s''',a''') \frac{\partial {\color{magenta} Q^{\pi}(s''',a''')}}{\partial \theta}\right ] \\ &=\sum_{a'''} \left[ \frac{\partial \pi (s''',a''')}{\partial \theta} {\color{magenta} Q^{\pi}(s''',a''')} + \pi (s''',a''') \sum_{s''''} \gamma P_{s'''s''''}^{a'''} {\color{Golden}\frac{\partial V^{\pi}(s'''')}{\partial \theta}} \right ] \\ \end{align*}(7)

. . .

Substitute  (7) into (6) get 76, then, substitute 76 into (5), then, substitute 765 into (4),

\begin{align*} \frac{\partial {\color{Blue} V^{\pi}(s)}}{\partial \theta} &=\frac{\partial [\sum_a\pi (s,a){\color{magenta} Q^{\pi}(s,a)}]}{\partial \theta} \\ &=\sum_{a} \left[ \frac{\partial \pi (s,a)}{\partial \theta} {\color{magenta} Q^{\pi}(s,a)} + \pi (s,a) \frac{\partial {\color{magenta} Q^{\pi}(s,a)}}{\partial \theta}\right ] \\ &=\sum_{\color{Red} x} \sum_{k=0}^{\infty} \gamma ^{\color{Magenta} k} Pr(s \to {\color{Red} x},{\color{Magenta} k},\pi)\sum_a \frac{\partial \pi({\color{Red} x},a)}{\partial \theta} Q^{\pi}({\color{Red} x},a) \end{align*}\\ where, \\ {\color{Red} x}:state, such\ as\ s'''\ or \ s'''', ...,\\ {\color{Magenta} k}:steps, from\ state\ s \ to\ state \ {\color{Red} x}\\ \pi: policy. \\ Pr(s \to {\color{Red} x}, {\color{Magenta} k}, \pi): the\ probability\ of\ going\\ from\ state\ s\ to\ state\ {\color{Red} x}\ in\ {\color{Magenta} k}\ steps\ under\ policy\ \pi.

 

so

\begin{align*} \frac{\partial \rho}{\partial \theta} &=\frac{\partial }{\partial \theta}E\left \{ \sum_{t=1}^{\infty} \gamma^{t-1}r_t\mid s_0,\pi\right \}=\frac{\partial {\color{Blue} V^{\pi}}}{\partial \theta}({\color{Red} s_0})\\ &=\sum_{\color{magenta} s}\sum_{k=0}^{\infty}\gamma^kPr({\color{Red} s_0} \to {\color{Magenta} s},k,\pi)\sum_a\frac{\partial \pi({\color{Magenta} s},a)}{\partial \theta}Q^{\pi}({\color{Magenta} s},a) \end{align*}

assume in start-state formulation:

{\color{Golden} d^{\pi}(s)=\sum_{k=0}^{\infty}{\color{Red} \gamma^k} Pr(s_0 \to s,k,\pi)}so

\begin{align*} \frac{\partial \rho}{\partial \theta} &=\frac{\partial }{\partial \theta}E\left \{ \sum_{t=1}^{\infty} \gamma^{t-1}r_t\mid s_0,\pi\right \}=\frac{\partial {\color{Blue} V^{\pi}}}{\partial \theta}({\color{Red} s_0})\\ &=\sum_{\color{magenta} s}\sum_{k=0}^{\infty}\gamma^kPr({\color{Red} s_0} \to {\color{Magenta} s},k,\pi)\sum_a\frac{\partial \pi({\color{Magenta} s},a)}{\partial \theta}Q^{\pi}({\color{Magenta} s},a) \\ &=\sum_{\color{magenta} s}{\color{Golden} \sum_{k=0}^{\infty}\gamma^kPr({\color{Red} s_0} \to {\color{Magenta} s},k,\pi)}\sum_a\frac{\partial \pi({\color{Magenta} s},a)}{\partial \theta}Q^{\pi}({\color{Magenta} s},a) \\ &=\sum_{\color{Magenta} s}{\color{Golden} d^{\pi}(s)}\sum_a\frac{\partial \pi({\color{Magenta} s},a)}{\partial \theta}Q^{\pi}({\color{Magenta} s},a) \end{align*}(Q.E.D)

—————————-

Stationary Distribution

平稳分布

殊途同归

\pi = \pi P^n

or

{\color{Magenta} \pi} = {\color{Magenta} \pi} {\color{Red} P}

where

P : 转移概率矩阵

π : 平稳概率分布

例:设状态空间为S={0, 1, 2,}的马尔可夫链,其一步转移概率矩阵

{\color{Red} P}=\begin{bmatrix} 0.5 & 0.4 & 0.1\\ 0.3 & 0.4 & 0.3\\ 0.2 & 0.3 & 0.5 \end{bmatrix}

试分析它的极限分布,平稳分布是否存在?并计算

解:易知此链为不可约遍历链。

故极限分布存在,平稳分布存在唯一,且平稳分布就是其极限分布。

\left\{\begin{matrix} {\color{Red} \pi = \pi P}\\ {\color{Magenta} \pi_0 +\pi_1+\pi_2 =1} \end{matrix}\right.  \Rightarrow  \begin{align*} \pi_0 =\frac{21}{62} \\ \pi_1=\frac{23}{62} \\ \pi_2=\frac{18}{62} \\ \end{align*}

\Rightarrow \pi=(\pi_0,\pi_1,\pi_2)=\left ( \frac{21}{62},\ \frac{23}{62},\ \frac{18}{62}\right )

用结果验证,

\begin{align*} {\color{Red} \pi P} &= \left ( \frac{21}{62},\ \frac{23}{62},\ \frac{18}{62}\right )\begin{bmatrix} 0.5 & 0.4 & 0.1\\ 0.3 & 0.4 & 0.3\\ 0.2 & 0.3 & 0.5 \end{bmatrix}\\ &=\left( \frac{21}{62} \cdot 0.5 + \frac{23}{62} \cdot 0.3 + \frac{18}{62} \cdot 0.2 ,\ \frac{21}{62} \cdot 0.4 + \frac{23}{62} \cdot 0.4 + \frac{18}{62} \cdot 0.3,\ \frac{21}{62} \cdot 0.1 + \frac{23}{62} \cdot 0.3 + \frac{18}{62} \cdot 0.5 \right ) \\ &=\left ( \frac{12}{62},\ \frac{23}{62},\ \frac{18}{62}\right ) \\&= {\color{Red}\pi} \end{align*}

——————

{\color{Magenta} \pi} = {\color{Magenta}\pi }P

也就是说,可以将求平稳分布与求特征向量相“联系”起来。

\begin{align*} {\color{Magenta} \pi} &= {\color{Magenta}\pi }P\\ \pi ^T &= {\color{Red} P^T} \pi ^T\\ \lambda x &= {\color{Red} A}x \\ {\color{Magenta} x}&={\color{Red} A} {\color{Magenta} x} \\ \end{align*} {\color{Blue} \Leftrightarrow} \ {\color{Magenta} Stationary\ Distribution \xleftarrow[eigenvalue=1]{eigenvector} State\ Transition\ Probability\ Matrix\ transpose}

(A : n阶方阵,对应状态空间S的一步转移概率矩阵的转置,

λ: A的特征值,在这里为1,

x: 非零向量,对应转移概率矩阵的转置的特征值为1的特征向量,即平稳分布)

结论:求某状态空间S的马尔可夫链的平稳分布也就是
求其一步转移概率矩阵PT特征值为1的特征向量

————————

在状态空间S中,考虑到所有的动作a,进入到下一个状态S’,在本论文中平稳分布是dπ根据以上有关状态空间S的平稳分布的说明,{\color{Magenta} \pi} = {\color{Magenta} \pi} {\color{Red} P},则可以得出以下关系式:

For the average-reward formulation:

\sum_S {\color{Red} d^{\pi}}(s)\sum_a \pi (s,a)\sum_{S'}P_{s{\color{Red} s'}}^a=\sum_{S'} {\color{Red} d^{\pi}}({\color{Red} s'})

—————————

Stationary distribution dπ, so sum of probability equals 1:

{\color{Blue} \sum_S d^{\pi}(s)}=1

\frac{\partial \rho (\pi)}{\partial \theta}\ \ is \ independent\ of \ s,{\color{Blue} \sum_S d^{\pi}(s)}=1

{\color{Blue} \sum_S d^{\pi}(s)} \frac{\partial \rho}{\partial \theta}={\color{Blue} 1}\cdot \frac{\partial \rho}{\partial \theta}

so

\begin{align*} {\color{Blue} \sum_sd^{\pi}(s)}\frac{\partial \rho}{\partial \theta}&=\sum_sd^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a){\color{Orange} + \sum_{s'}d^{\pi}(s')\frac{\partial V^{\pi}(s')}{\partial \theta}} {\color{DarkOrange} -\sum_s d^{\pi}(s)\frac{\partial V^{\pi}(s)}{\partial \theta}}\\ \frac{\partial \rho}{\partial \theta}&=\sum_sd^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a) \end{align*}

(Q.E.D)


Methods that learn approximations to both policy and value functions are often called actorcritic methods, where ‘actor’ is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value function.


1. Policy Gradient Theorem

Theorem 1 (Policy Gradient)

For any MDP, in either the average-reward or start-state formulations,

average-reward formulation start-state formulation
\rho(\pi) \pi(s,a,{\color{Red} \theta})=Pr\{ a_t = a \mid s_t = s, {\color{Red} \theta} \}

\begin{align*} \rho(\pi)&=lim_{n \to \infty}\frac{1}{n}E\{r_1+r_2+...+r_n\mid \pi \}\\ &=\sum_s {\color{Red} d^{\pi}(s)} \sum_a\pi(s,a)R_s^a \end{align*}

d^{\pi}(s)=lim_{n \to \infty}Pr\{ s_t=s\mid s_0, \pi \}

\pi(s,a,{\color{Red} \theta})=Pr\{ a_t = a \mid s_t = s, {\color{Red} \theta} \}

\rho(\pi)=E\left \{\sum_{t=1}^{\infty} \gamma ^{t-1} r_t\mid s_0, \pi \right \}

{\color{Blue} d^{\pi}(s)}=\sum_{t=0}^{\infty}{\color{Red} \gamma ^t}Pr\{ s_t=s\mid {\color{DarkGreen} s_0}, {\color{Magenta} \pi} \}

Define {\color{Blue} d^{\pi}(s)} as a discounted weighting of states encountered starting at s0 and then following π
Q^\pi(s,a) Q^{\pi}(s,a)=\sum_{t=1}^{\infty}E\{r_t-\rho(\pi) \mid s_0=s, a_0=a,\pi\}, \forall s\in S, a\in A Q^{\pi}(s,a)=E \left \{ \sum_{t=1}^{\infty} \gamma ^{t-1} r_{t+k}\mid s_t=s,a_t=a, \pi \right \}
Policy Gradient \frac{\partial \rho}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a) \frac{\partial \rho}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a)

 

In any event, the key aspect of both expressions for the gradient is that their are no terms of the form  {\color{blue} \frac{\partial d^{\pi}(s)}{\partial \theta}} : 
the effect of policy changes on the distribution of states does not appear.

换句话说就是,策略变化对于状态分布没有影响

2. Policy Gradient with Approximation

Theorem 2 (Policy Gradient with Function Approximation)

If {\color{Magenta} f_w} satisfies

\sum_s d^{\pi}(s)\sum_a \pi (s,a){\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] \frac{\partial {\color{Magenta} f_{\omega} (s,a)}}{\partial \omega}}=0   (2a)

———————-

这里简单提下公式(2a)的来源,其实就是学习的近似值{\color{Magenta} f_w}(对应值函数的真实值Qπ,通过策略π,通过下式的规则

\begin{align*} \triangle \omega_t &\propto \frac{\partial }{\partial \omega}{\color{magenta} \left [ \hat{Q}^{\pi} (s_t,a_t)-f_{\omega}(s_t,a_t) \right ]^2}\\ &\propto {\color{Red} \left [ \hat{Q}^{\pi} (s_t,a_t)-f_{\omega}(s_t,a_t) \right ]\frac{\partial }{\partial \omega}f_{\omega}(s_t,a_t)} \end{align*}

来更新ω,(近似值{\color{Magenta} f_w}与真实值Qπ的差的平方求ω偏导成正比),上式红色部分,当过程收敛到一个局部最佳,得到(2a)。

———————-

and is compatible with the policy parameterization in the sense that

{\color{Red} \frac{\partial f_{\omega}(s,a)}{\partial \omega}=\frac{\partial \pi(s,a)}{\partial \theta}\cdot \frac{1}{\pi(s,a)}}    (2b)

这个兼容条件compatibility condition很重要,起到‘桥梁’作用,可能是由发现者从结论“反推”得到的

then

{\color{Blue} \frac{\partial \rho}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}f_{\omega}(s,a)}   (2c)

Proof:

Combining (2a) and (2b),

\sum_s d^{\pi}(s)\sum_a \pi (s,a){\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] \frac{\partial \pi(s,a)}{\partial \theta}\cdot \frac{1}{\pi(s,a)}}=0

so

\sum_s d^{\pi}(s)\sum_a {\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] \frac{\partial \pi(s,a)}{\partial \theta}}=0

{\color{Golden} \sum_s d^{\pi}(s)\sum_a \frac{\partial \pi(s,a)}{\partial \theta}{\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] }}=0  (2d)

so

we use the theorem 1 – equation (2d)

\frac{\partial \rho}{\partial \theta}=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a)

get

\begin{align*} \frac{\partial \rho}{\partial \theta}&=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a) \\ &=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a)-0 \\ &=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^{\pi}(s,a)-{\color{Golden} \sum_s d^{\pi}(s)\sum_a \frac{\partial \pi(s,a)}{\partial \theta}{\color{Red} \left [ Q^{\pi} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] }} \\ &=\sum_s d^{\pi}(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}{\color{Magenta} f_{\omega}(s,a)} \end{align*}(Q.E.D)


3. Application to Deriving Algorithm and Advantages

Consider a policy that is a Gibbs distribution in a linear combination of features:

{\color{Golden} \pi} (s,a)=\frac{e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}}}{\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}}}

so

\begin{align*} \frac{\partial }{\partial \theta}{\color{Golden} \pi} (s,a)\cdot \frac{1}{\pi(s,a)} &=\frac{\partial }{\partial \theta}\frac{e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}}}{\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}}} \cdot \frac{1}{\pi(s,a)}\\ &=\frac{{\color{Red}\phi _{sa}}e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}}(\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}}) - e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}} (\sum_b {\color{Red} \phi_{sb}} e^{{\color{Blue} \theta^T}{\color{Red} \phi_{sb}}})} {(\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}})^2} \cdot \frac{\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}} } {e^{\color{Blue} \theta^T {\color{Red}\phi _{sa}}}}\\ &=\phi_{sa} -\frac{\sum_b\ \ e^{\theta^T\phi_{sb}}\ \ \phi_{sb}}{\sum_be^{\theta^T \phi_{sb}}} \\ &=\phi_{sa} - \sum_b {\color{Golden} \pi(s,b)}\phi_{sb} \end{align*}

so

f_{\omega}(s,a)=\omega^T\left [ \phi_{sa} - \sum_b {\color{Golden} \pi(s,b)} \phi_{sb}\right ]

也就是说,除了每个状态normalized为均值0(为什么?)之外,{\color{Magenta} f_w}还与策略同样的特征必须是线性关系。

In other words, fw must be linear in the same features as the policy, except normalized to be mean zero (why?) for each state.


4. Convergence of Policy Iteration with Function Approximation

Theorem 3 (Policy Iteration with Function Approximation)

{\color{Magenta} \pi}, {\color{Magenta} f_w} 分别是任何的某策略和某价值函数的可微的近似者函数。同时它们满足公式(2b)即兼容条件,序列

\{\rho (\pi_k)\}^{\infty}_{k=0}

由下面定义:任何θ0\pi_k=\pi(.\ ,.\ ,\theta_k), and

\omega_k=\omega \ such \ that

\sum_s d^{\pi_k}(s)\sum_a \pi_k (s,a){\color{Red} \left [ Q^{\pi_k} (s,a)-{\color{Magenta}f_{\omega }(s,a)} \right ] \frac{\partial {\color{Magenta} f_{\omega} (s,a)}}{\partial \omega}}=0   对应(2a)

\theta_{k+1}=\theta_{k}+\alpha_k{\color{Blue} \sum_s d^{\pi_k}(s)\sum_a\frac{\partial \pi_k(s,a)}{\partial \theta}f_{\omega}(s,a)}对应(2c)

收敛至

{\color{Blue} lim_{k\to \infty} \frac{\partial \rho (\pi_k)}{\partial \theta}=0}

意味着存在局部最优


Soft max function, Softargmax, or Normalized Exponential Function

归一化指数函数

{\color{Golden} \pi} (s,a)=\frac{e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sa}}}}{\sum_{\color{Magenta} b} e^{ {\color{Blue} \theta^T} {\color{Red}\phi _{sb}}}}

where {\color{Red} \phi _{sa}}: an L-dimensional feature vector characterizing state-action pair s, a. 表征状态-动作对的一个L维特征向量。

{\color{Blue} \theta ^T} {\color{Red} \phi_{sa}}: the inner product of  {\color{Blue} \theta } and {\color{Red} \phi_{sa}}.

Softmax Distribution Matlab Code@github Private Repository

Exponential Soft-max Distribution

Column is the vector, state.

 


 

 

 

 

 

 

 

 

 

 


Policy Gradient

Policy Gradient and Q-learning

Sidebar