Reinforcement Learning Archives - Page 2 of 3 - Dr. Pei

Meta Learning Shared Hierarchies

Meta Learning Shared Hierarchies Notation S: state space. A: action space. MDP: transition function P(s’, r|s, a), (s’, r): next state and reward, (s,a): state and action. PM : distribution over MDPs M with the same state-action space (S, A). Agent: a function mapping from a multi-episode history (s0, a0, r0, s1, a2, r2, …… read more »

RL Math

Neural-network-based decentralized control of continuous-time nonlinear interconnected systems with unknown dynamics Global Value vs. Sub-goals by Policy Gradient Neuro-Dynamic Programming Gradient Methods Framework Policy Gradient Method for Hierarchical RL Policy Gradient HRL Policy Gradient HRL and Neuro-Dynamic Programming Policy Gradient Method for HRL The scanned draft files above contain handwritten mathematical formulas or tools, including… read more »

Decentralized Optimal Control of Distributed Interdependent Automata With Priority Structure

Decentralized Optimal Control of Distributed Interdependent Automata With Priority Structure Data Flowchart Notation : subsystem model, the plant P i , deterministic finite-state automaton. (1)      (2) (3)   (4) : P i  can be transitioned from state  into state  if the input l is applied.   (5)   It encodes with  that the transition  is possible with at least… read more »

Neural-network-based decentralized control of continuous-time nonlinear interconnected systems with unknown dynamics

  Neural-network-based decentralized control of continuous-time nonlinear interconnected systems with unknown dynamics – Math and Optimal Control Problem formulation Consider a continuous-time nonlinear large-scale system ∑ composed of N interconnected subsystems described by (1) where xi(t) ∈ Rni : state. The overall state of the large-scale system ∑ is denoted by  ui [ xi(t) ] ∈ Rmi : control input vector of the ith… read more »

Reinforcement Learning is Direct Adaptive Optimal Control

Reinforcement Learning is Direct Adaptive Optimal Control Stanford_cs229-notes12_Andrew_Ng Reinforcement Learning and Control How should Reinforcement learning be viewed from a control systems perspective? Control problems can be divided into two classes: regulation and tracking problems, in which the objective is to follow a reference trajectory. optimal control problems, which the objective is to extremize a… read more »

Decentralized Stabilization for a Class of Continuous-Time Nonlinear Interconnected Systems Using Online Learning Optimal Control Approach

Decentralized Stabilization for a Class of Continuous-Time Nonlinear Interconnected Systems Using Online Learning Optimal Control Approach Neural-network-based Online Learning Optimal Control Decentralized Control Strategy Cost functions (critic neural networks) – local optimal controllers Feedback gains to the optimal control policies – decentralized control strategy Optimal Control Problem (Stabilization) Hamilton-Jacobi-Bellman (HJB) Equations Apply Online Policy Iteration… read more »

Hierarchical Policy Gradient Algorithms

Hierarchical Policy Gradient Algorithms Math Notation M : the overall task MDP. {M0, M1, M2 , M3 , . . . , Mn } : a finite set of subtask MDPs. Mi : subtask, models a subtask in the hierarchy. M0 : root task and solving it solves the entire MDP M. i : non-primitive subtask, paper uses… read more »

Hierarchical Actor-Critic

Hierarchical Actor-Critic Download Hierarchical_Actor-Critic Flowchart Terminology Artificial  intelligence Optimization/decision/control a Agent Controller or decision maker b Action Control c Environment System d Reward of a stage (Opposite of) Cost of a stage e Stage value (Opposite of) Cost of a state f Value (or state-value) function (Opposite of) Cost function g Maximizing the value function… read more »

RL Other Useful Reference

RL Other Useful Reference   Function Approximation: FA http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/ AlphaGo_IJCAI AlphaGo-Zurich  

Policy Gradient and Q-learning

RL两大类算法的本质区别?(Policy Gradient 和 Q-learning) Q-learning 是一种基于值函数估计的强化学习方法,Policy Gradient是一种策略搜索强化学习方法。两者是求解强化学习问题的不同方法,如果熟悉监督学习,前者可类比Naive Bayes——通过估计后验概率来得到预测,后者可类比SVM——不估计后验概率而直接优化学习目标。 回答问题: 1. 这两种方法的本质上是否是一样的(解空间是否相等)?比如说如果可以收敛到最优解,那么对于同一个问题它们一定会收敛到一样的情况? 两者是不同的求解方法,而解空间(策略空间)不是由求解方法确定的,而是由策略模型确定的。两者可以使用相同的模型,例如相同大小的神经网络,这时它们的解空间是一样的。 Q-learning在离散状态空间中理论上可以收敛到最优策略,但收敛速度可能极慢。在使用函数逼近后(例如使用神经网络策略模型)则不一定。Policy Gradient由于使用梯度方法求解非凸目标,只能收敛到不动点,不能证明收敛到最优策略。 2. 在Karpathy的blog中提到说更多的人更倾向于Policy Gradient,那么它们两种方法之间一些更细节的区别是什么呢? 基于值函数的方法(Q-learning, SARSA等等经典强化学习研究的大部分算法)存在策略退化问题,即值函数估计已经很准确了,但通过值函数得到的策略仍然不是最优。这一现象类似于监督学习中通过后验概率来分类,后验概率估计的精度很高,但得到的分类仍然可能是错的,例如真实正类后验概率为 0.501,如果估计为0.9,虽然差别有0.3,如果估计为0.499,虽然差别只有0.002,但分类确是错的。 尤其是当强化学习使用值函数近似时,策略退化现象非常常见。可见 Tutorial on Reinforcement Learning slides中的例子。 Policy Gradient不会出现策略退化现象,其目标表达更直接,求解方法更现代,还能够直接求解stochastic policy等等优点更加实用。 (3. 有人愿意再对比一下action-critic就更好了(: Actor-Critic 就是在求解策略的同时用值函数进行辅助,用估计的值函数替代采样的reward,提高样本利用率。 ——————— 作者:ForABiggerWorld 来源:CSDN 原文:https://blog.csdn.net/zjucor/article/details/79200630 版权声明:本文为博主原创文章,转载请附上博文链接!  

Sidebar



×

Google Scholar