#### Blog Stats

• 131,058 hits

#### Meta

In summary, I guess because
1. policy (probability of action) has the style: $\mathbf{e^{P(\theta)}}$,
2. obtain (or let's say 'math trick') $\boldsymbol{\pi(a\mid&space;s,&space;\theta)}$ in the objective function ( $J(\theta)$ i.e., value function $V_{\pi_\theta}(s_0)$ )'s gradient equation to
get an 'Expectation' form for $\triangledown&space;J(\theta)$: $\sum&space;\texttt{distribution&space;(state)&space;}&space;\sum&space;\texttt{policy(action}&space;\mid&space;\texttt{state)}&space;\left[\texttt{the&space;item&space;for&space;updating&space;}&space;\theta\right&space;]$,
assign 'ln' to policy before gradient for analysis convenience.

pg

Notation

J(θ): any policy objective function of θ (vector).

$\alpha$ : step-size parameter.

$\triangledown_{\theta}J(\theta)=\begin{pmatrix}&space;\frac{\partial&space;J(\theta)}{\partial&space;\theta_1}\\&space;.\\&space;.\\&space;.\\&space;\frac{\partial&space;J(\theta)}{\partial&space;\theta_n}&space;\\&space;\end{pmatrix}$

$\triangle&space;\theta&space;=&space;\alpha&space;\triangledown_{\theta}J(\theta)$: ascending the gradient of the policy.

$\pi_{\theta}(s,a)$: action policy.

Usually, probability of action looks like:

$\pi_{\theta}(a\mid&space;s,&space;\theta)\doteq&space;\frac{e^{h(s,a,\theta)}}{\sum_b&space;e^{h(s,b,\theta)}}=\frac{e^{\theta^Tx(s,a)}}{\sum_b&space;e^{h(s,b,\theta)}}$ (a)

Soft-max policy, weight actions use linear combination of features x

or

$\pi_{\theta}(a\mid&space;s,&space;\theta)\doteq&space;\frac{1}{\sigma&space;(s,\theta)\sqrt{2\pi}}e^{-\frac{[a-\mu(s,\theta)]^2}{2\sigma(s,\theta)^2}}$

\begin{align*}&space;\mu(s,\theta)&space;&\doteq&space;\theta_{\mu}^T&space;x_{\mu}(s)\\&space;\sigma(s,\theta)&space;&\doteq&space;e^{\theta_{\sigma}^Tx_{\sigma}(s)}&space;\end{align*}

(b)

continuous action, Gaussian distribution policy

so

\begin{align*}&space;\triangledown&space;_{\theta}&space;\left&space;[&space;\pi_{\theta}(a&space;\mid&space;s,\theta)&space;\right&space;]&space;&=&space;\pi_{\theta}(a&space;\mid&space;s,&space;\theta)\frac{\triangledown_\theta&space;[&space;\pi_\theta&space;(a\mid&space;s,\theta)&space;]&space;}{\pi_\theta(a&space;\mid&space;s,&space;\theta)}\\&space;&=\pi_\theta(a&space;\mid&space;s,&space;\theta)&space;\triangledown_\theta&space;\left&space;\{&space;\textup{ln}&space;\left&space;[&space;\pi_\theta(a&space;\mid&space;s,&space;\theta)\right&space;]&space;\right&space;\}&space;\end{align*}

so

We get score function $\triangledown_\theta&space;\{&space;\textup{ln}&space;\left&space;[&space;\pi_\theta(a&space;\mid&space;s,&space;\theta)\right&space;]&space;\}$ .

For (a)

$\triangledown_\theta\textup{ln}&space;\left&space;[&space;\pi_\theta(a&space;\mid&space;s,&space;\theta)\right&space;]=x(s,a)-\texttt{E}_{\pi_{\theta}}[x(s,&space;\cdot&space;)]$

For (b)\begin{align*}&space;\triangledown_{\theta_{\mu}}\textup{ln}&space;\left&space;[&space;\pi_\theta(a&space;\mid&space;s,&space;\theta_{\mu})\right&space;]&space;&=\triangledown_{\theta_{\mu}}&space;\textup{ln}&space;\left&space;[&space;\frac{1}{\sigma&space;(s,\theta)\sqrt{2\pi}}e^{-\frac{[a-\mu(s,\theta)]^2}{2\sigma(s,\theta)^2}}&space;\right&space;]&space;\\&space;&=\triangledown_{\theta_{\mu}}&space;\left&space;[&space;\textup{ln}1&space;-\textup{ln}&space;\left&space;[&space;\sigma(s,\theta)\sqrt{2\pi}\right&space;]&space;-&space;\frac{[a-\mu(s,\theta)]^2}{2\sigma(s,\theta)^2}\right&space;]\\&space;&=0-0-\triangledown_{\theta_{\mu}}\frac{[a-\mu(s,\theta)]^2}{2\sigma(s,\theta)^2}\\&space;&=-\frac{2[a-\mu(s,\theta)]}{2\sigma(s,\theta)^2}\left&space;[&space;-\frac{\partial&space;\mu(s,a)}{\partial&space;\theta_{\mu}}&space;\right&space;]&space;\\&space;&=-\frac{[a-\mu(s,\theta)]}{\sigma(s,\theta)^2}\left&space;[&space;-\frac{\partial&space;\theta_\mu^T&space;x_\mu(s)}{\partial&space;\theta_{\mu}}&space;\right&space;]\\&space;&=\frac{[a-\mu(s,\theta)]}{\sigma(s,\theta)^2}\left&space;[&space;x_\mu(s)&space;\right&space;]&space;\end{align*}

\begin{align*}&space;\triangledown_{\theta_{\sigma}}\textup{ln}&space;\left&space;[&space;\pi_\theta(a&space;\mid&space;s,&space;\theta_{\sigma})\right&space;]&space;&=&space;\triangledown_{\theta_{\sigma}}&space;\textup{ln}&space;\left&space;[&space;\frac{1}{\sigma&space;(s,\theta)\sqrt{2\pi}}e^{-\frac{[a-\mu(s,\theta)]^2}{2\sigma(s,\theta)^2}}&space;\right&space;]&space;\\&space;&=\triangledown_{\theta_{\sigma}}&space;\left&space;[&space;\textup{ln}1-&space;\textup{ln}&space;\left&space;[&space;\sigma(s,\theta)\sqrt{2\pi}&space;\right&space;]&space;-&space;\frac{[a-\mu(s,\theta)]^2}{2\sigma(s,\theta)^2}\right&space;]&space;\\&space;&=0-&space;\frac{1}{\sigma(s,\theta)\sqrt{2\pi}}\sqrt{2\pi}\frac{\partial&space;\sigma(s,\theta)}{\partial&space;\theta_\sigma}+\frac{[a-\mu(s,\theta)]^2}{\sigma(s,\theta)^3}&space;\frac{\partial&space;\sigma(s,\theta)}{\partial&space;\theta_\sigma}&space;\\&space;&=-&space;\frac{1}{\sigma(s,\theta)}\frac{\partial&space;e^{\theta_\sigma^T&space;x_\sigma(s)}}{\partial&space;\theta_\sigma}+\frac{[a-\mu(s,\theta)]^2}{\sigma(s,\theta)^3}&space;\frac{\partial&space;e^{\theta_\sigma^T&space;x_\sigma(s)}}{\partial&space;\theta_\sigma}&space;\\&space;\end{align*}

\begin{align*}&space;\triangledown_{\theta_{\sigma}}\textup{ln}&space;\left&space;[&space;\pi_\theta(a&space;\mid&space;s,&space;\theta_{\sigma})\right&space;]&space;&=-\frac{1}{\sigma(s,\theta)}\sigma(s,\theta)x_{\sigma}(s)+\frac{[a-\mu(s,\theta)]^2}{\sigma(s,\theta)^3}\sigma(s,\theta)x_\sigma(s)\\&space;&=\left&space;[&space;\frac{&space;[a-\mu(s,\theta)]^2}{\sigma(s,\theta)^2}&space;-&space;1&space;\right&space;]&space;x_\sigma(s)&space;\\&space;\end{align*}

The results are the same as the equations at the page 336 from Chapter 13. Policy Gradient Methods. Reinforcement Learning: an introduction, 2nd, Richard S. Sutton and Andrew G. Barto.

For one-step MDPs:

$J(\theta)=\texttt{E}[r]=\sum_{s\in&space;\texttt{S}}d(s)\sum_{a\in&space;\texttt{A}}\pi_{\theta}(a&space;\mid&space;s)\texttt{R}_{(s,a)}$

\begin{align*}&space;\triangledown_{\theta}J(\theta)&space;&=\sum_{s\in&space;\texttt{S}}d(s)\sum_{a\in&space;\texttt{A}}\pi_{\theta}(a&space;\mid&space;s)\triangledown_{\theta}\left&space;[&space;\texttt{ln}\pi_{\theta}(a&space;\mid&space;s)&space;\right&space;]&space;\texttt{R}_{(s,a)}&space;\\&space;&=\texttt{E}_{\pi_\theta}\left&space;[\triangledown_{\theta}&space;\texttt{ln}\pi_{\theta}(a\mid&space;s)r&space;\right&space;]&space;\end{align*}

$r=\texttt{R}_{(s,a)}$

$\texttt{r:instantaneous&space;reward.}$

For any of the policy objective functions:

$\triangledown_{\theta}J(\theta)=\texttt{E}_{\pi_\theta}\{&space;\triangledown_\theta&space;\left&space;[\texttt{ln}&space;\pi_\theta&space;(a&space;\mid&space;s)\right&space;]&space;q_\pi(s,a)\right&space;]&space;\}$

$q_\pi(s,a)\texttt{:long&space;term&space;value.}$

so

$\triangle&space;\theta_t=\alpha&space;\triangledown&space;J(\theta)=\alpha&space;\triangledown&space;\left&space;[&space;\texttt{ln}\pi_\theta(a\mid&space;s,\theta_t)&space;\right&space;]&space;\texttt{v}_t$

$\texttt{v}_t&space;\texttt{:return,&space;it&space;is&space;an&space;unbiased&space;sample&space;of&space;}&space;q_\pi(s_t,a_t).$

so, we get algorithm for updating θ:

$\theta&space;\leftarrow&space;\theta&space;+&space;\alpha&space;\triangledown_\theta&space;\left&space;[&space;\texttt{ln}&space;\pi_\theta&space;(a_t&space;\mid&space;s_t)&space;\right&space;]\texttt{v}_t$

In summary, I guess because
1. policy (probability of action) has the style: $\mathbf{e^{P(\theta)}}$,
2. obtain (or let's say 'math trick') $\boldsymbol{\pi(a\mid&space;s,&space;\theta)}$ in the objective function ( $J(\theta)$ i.e., value function $V_{\pi_\theta}(s_0)$ )'s gradient equation to
get an 'Expectation' form for $\triangledown&space;J(\theta)$: $\sum&space;\texttt{distribution&space;(state)&space;}&space;\sum&space;\texttt{policy(action}&space;\mid&space;\texttt{state)}&space;\left[\texttt{the&space;item&space;for&space;updating&space;}&space;\theta\right&space;]$,
assign 'ln' to policy before gradient for analysis convenience.

Sidebar