#### Blog Stats

• 39,346 hits

#### Meta

In summary, I guess because
1. policy (probability of action) has the style: ,
2. obtain (or let's say 'math trick')  in the objective function (  i.e., value function  )'s gradient equation to
get an 'Expectation' form for : ,
assign 'ln' to policy before gradient for analysis convenience. pg

Notation

J(θ): any policy objective function of θ (vector).

: step-size parameter.

: ascending the gradient of the policy.

: action policy.

Usually, probability of action looks like:

(a)

Soft-max policy, weight actions use linear combination of features x

or

(b)

continuous action, Gaussian distribution policy

so

so

We get score function  .

For (a)

For (b)

The results are the same as the equations at the page 336 from Chapter 13. Policy Gradient Methods. Reinforcement Learning: an introduction, 2nd, Richard S. Sutton and Andrew G. Barto.

For one-step MDPs:

For any of the policy objective functions:

so

so, we get algorithm for updating θ:

In summary, I guess because
1. policy (probability of action) has the style: ,
2. obtain (or let's say 'math trick')  in the objective function (  i.e., value function  )'s gradient equation to
get an 'Expectation' form for : ,
assign 'ln' to policy before gradient for analysis convenience.

Sidebar