#### Blog Stats

• 129,880 hits

# Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

### Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

The key is that the policy over goals πg which makes expected Q-value with discounting maximum is the policy which the agent chooses, i.e., if the goal sequence g1-g3-g2-... 's Q-value is the maximum value among that of all kinds of goal sequences, the agent should assign goal1 firstly, goal3 secondly, then goal2, ...

${\color{Blue}&space;Q}_{2}^{*}(s,g)=max_{{\color{red}&space;\pi_g}}E\left&space;[&space;\sum_{t'=t}^{t+N}f_{t'}+\gamma&space;\&space;max_{g'}&space;{\color{Blue}&space;Q}_{2}^{*}\left&space;(&space;s_{t+N},g'&space;\right&space;)&space;|&space;s_t=s,g_t=g,{\color{red}&space;\pi_g}&space;\right&space;]$

1-3：agent始终只有一个，总控器选goal：钥匙，控制器根据当前位置和钥匙输出动作，下楼梯，往左走，但是碰到骷髅，挂了，critic判断终止。

4-6：总控器选下一个goal：右下梯子，控制器输出动作，agent实体走到右下梯子，critic判断goal达到。

7-9：总控器选下一个goal：钥匙，右上门，控制器输出动作，拿到钥匙，到达右上门，完成goal。

internal critic 以<entity 1, relation, entity 2>形式定义，例如：agent实体 到达 另一个实体 door，用两个实体的相对位置计算二进制reward。

agent首先学习“更简单”goals，比如到达右边的门或中间梯子，然后慢慢开始学习“更难”goals比如钥匙和底下梯子，这些（更难的goals）可以提供获得更高奖励的途径。

Sidebar