RL Archives - Dr. Pei

Protected: Resume

March 18, 2026

There is no excerpt because this is a protected post.

Matlab Code: club0 = {‘barcelona’ ‘bayern’ ‘realmadrid’ ‘manunited’ ‘liverpool’,‘mancity’,‘inter’,‘juventus’}; ll = {‘Barcelona’,‘Bayern’,‘Real Madrid’,‘Man United’,‘Liverpool’,‘Man City’,‘Inter’,‘Juventus’}; start = ’01-Jan-2020′; D = ‘./’; for i_club = 1:length(club0);club = club0{i_club}; url = sprintf(‘http://api.clubelo.com/%s’,club); % filename = sprintf(‘%sdata%d.csv’,D,i_club); % websave(filename, url); end S = dir(fullfile(D,‘data*.csv’)); datatotal = cell(1,length(club0)); for k = 1:numel(S) F = fullfile(D,S(k).name); datatotal{k} = readtable(F); end… read more »

Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms

April 22, 2026

Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms finite-sample convergence rates for q-learning and indirect algorithms

Solving H-horizon, Stationary Markov Decision Problems In Time Proportional To Log(H)

April 22, 2026

Solving H-horizon, Stationary Markov Decision Problems In Time Proportional To Log(H) Solving h-horizon, stationary markov decision problems in time proportional to log (h) Paul Tseng, Operations Reseserch Letters 9 (1990) 287-297.

Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Run Time

April 22, 2026

Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Run Time Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Run Time The nonlinear Bellman equation = linear programming problem: Primal-Dual LP Primal LP (1) Dual LP (2) Minmax Problem (3) Download: pdf

KL Divergence

July 14, 2019

KL Divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence Information entropy KL Divergence

The Asymptotic Convergence-Rate of Q-learning

April 22, 2026

The Asymptotic Convergence-Rate of Q-learning the-asymptotic-convergence-rate-of-q-learning The asymptotic rate of convergence of Q-learning is Ο( 1/tR(1-γ) ), if R(1-γ)<0.5, where R=Pmin/Pmax, P is state-action occupation frequency. |Qt (x,a) − Q*(x,a)| < B/tR(1-γ) Convergence-rate is the difference between True value and Optimum value, i.e., the smaller it is, the faster convergence Q-learning is. We hope the Ο( 1/tR(1-γ) ) should… read more »

Policy Gradient Methods

May 10, 2019

Policy Gradient Methods In summary, I guess because 1. policy (probability of action) has the style: , 2. obtain (or let’s say ‘math trick’) in the objective function ( i.e., value function )’s gradient equation to get an ‘Expectation’ form for : , assign ‘ln’ to policy before gradient for analysis convenience. pg Notation J(θ):… read more »

Actor-Critic Algorithms for Hierarchical Markov Decision Processes

April 22, 2026

Actor-Critic Algorithms for Hierarchical Markov Decision Processes

Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

April 22, 2026

Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation 当环境给的奖励少而延迟时，论文给出了一个解决方案：agent至始至终只有一个，但分两个阶段：1总控器阶段，选goal，2控制器，根据当前state和goal，输出action，critic判断goal是否完成或达到终态。重复1,2。总控器选一个新的goal，控制器再输出action，依次类推。我理解它把环境“分”出N个时序上的小环境，与每个小环境对应1个goal。agent实体在这种环境下可以等效为一个点。 The key is that the policy over goals πg which makes expected Q-value with discounting maximum is the policy which the agent chooses, i.e., if the goal sequence g1-g3-g2-… ‘s Q-value is the maximum value among that of all kinds of goal sequences, the agent should… read more »

Dr. Pei

Email Address:

Blog Stats

State Action/Control

Meta

RL

Protected: Resume

Club Elo

Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms

Solving H-horizon, Stationary Markov Decision Problems In Time Proportional To Log(H)

Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Run Time

KL Divergence

The Asymptotic Convergence-Rate of Q-learning

Policy Gradient Methods

Actor-Critic Algorithms for Hierarchical Markov Decision Processes

Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation