Example: barber

Reinforcement Learning: Theory and Algorithms

Reinforcement Learning: Theory and AlgorithmsAlekh AgarwalNan JiangSham M. KakadeWen SunJanuary 31, 2022 WORKING DRAFT:Please any typos or errors you appreciate it!iiContents1 Fundamentals31 Markov Decision (Infinite-Horizon) Markov Decision Processes .. objective, policies, and values .. Consistency Equations for Stationary Policies .. Optimality Equations .. Markov Decision Processes .. Complexity .. Iteration .. Iteration .. Iteration for Finite Horizon MDPs .. Linear Programming Approach .. Complexity and Sampling Models .. : Advantages and The Performance Difference Lemma .. Remarks and Further Reading ..202 Sample Complexity with a Generative : a naive model-based approach .. Sample Complexity .. Optimal Sample Complexity (and the Model Based Approach) .. Discounted Case .. Horizon Setting .. Lemmas .. the proof .. and Effective Horizon Dependencies.

Reinforcement Learning: Theory and Algorithms Alekh Agarwal Nan Jiang Sham M. Kakade Wen Sun November 11, 2021 WORKING DRAFT: We will be frequently updating the book this fall, 2021. Please email [email protected] with any typos or errors you find. We appreciate it!

Tags:

  Learning, Theory, Algorithm, Reinforcement, Reinforcement learning, Theory and algorithms

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Reinforcement Learning: Theory and Algorithms

1 Reinforcement Learning: Theory and AlgorithmsAlekh AgarwalNan JiangSham M. KakadeWen SunJanuary 31, 2022 WORKING DRAFT:Please any typos or errors you appreciate it!iiContents1 Fundamentals31 Markov Decision (Infinite-Horizon) Markov Decision Processes .. objective, policies, and values .. Consistency Equations for Stationary Policies .. Optimality Equations .. Markov Decision Processes .. Complexity .. Iteration .. Iteration .. Iteration for Finite Horizon MDPs .. Linear Programming Approach .. Complexity and Sampling Models .. : Advantages and The Performance Difference Lemma .. Remarks and Further Reading ..202 Sample Complexity with a Generative : a naive model-based approach .. Sample Complexity .. Optimal Sample Complexity (and the Model Based Approach) .. Discounted Case .. Horizon Setting .. Lemmas .. the proof .. and Effective Horizon Dependencies.

2 Remarks and Further Readings ..293 Linear Bellman Linear Bellman Completeness Condition .. LSVI algorithm .. with D-Optimal Design .. Design .. Guarantees .. Strong is Bellman Completion as a Modeling? .. Reinforcement learning .. learning .. Policy Evaluation .. Remarks and Further Readings ..374 Fitted Dynamic Programming (FQI) and Offline RL .. FQI algorithm .. Guarantees of FQI .. Policy-Iteration (FPI) .. Cases Without Assumption .. for Policy Evaluation .. Remarks and Further Readings ..435 Statistical Limits of learning .. : Binary Classification .. Sampling and a Reduction to Supervised learning .. Realizability .. Policy Evaluation with Linearly Realizable Values .. RealizableQ?.. Realizable ?.. : Studying Generalization in RL .. Remarks and Further Readings ..592 Strategic Exploration616 Multi-Armed & Linear Bandit Problem.

3 Upper Confidence Bound (UCB) algorithm .. Bandits: Handling Large Action Spaces .. LinUCB algorithm .. and Lower Bounds .. Analysis .. Analysis .. Analysis .. Remarks and Further Readings ..717 Strategic Exploration in Tabular The Need for Strategic Exploration .. UCB-VI algorithm .. of Lemma .. Improved Regret Bound .. Remarks and Further Readings ..848 Linearly Parameterized .. MDPs and Linear MDPs .. in Linear MDPs.. Transition using Ridge Linear Regression .. Convergence via Covering .. of UCBVI for Linear MDPs .. Optimism .. Decomposition .. the Final Regret Bound .. Remarks and Further Readings ..969 Generalization withBounded Bilinear Classes .. Bellman Rank .. that have smallQ-Bellman rank .. that have smallV-Bellman rank .. Classes .. with Bounded Bilinear Rank .. Complexity.

4 Eluder Dimension .. Remarks and Further Readings .. 11110 Deterministic MDPs with Linearly ParameterizedQ?1133 Policy Optimization11511 Policy Gradient Methods and Non-Convex Policy Gradient Expressions and the Likelihood Ratio Method .. (Non-convex) Optimization .. Gradient ascent and convergence to stationary points .. Monte Carlo estimation and stochastic gradient ascent .. Bibliographic Remarks and Further Readings .. 12212 Vanishing Gradients and Saddle Points .. Policy Gradient Ascent .. Log Barrier Regularization .. The Natural Policy Gradient .. Bibliographic Remarks and Further Readings .. 13213 Function Approximation and the Compatible function approximation and the NPG .. Examples: NPG andQ-NPG .. Log-linear Policy Classes and Soft Policy Iteration .. Neural Policy Classes.

5 The NPG Regret Lemma .. : Performance Bounds for Log-Linear Policies .. Analysis .. Sample Complexity .. Bibliographic Remarks and Further Readings .. 14214 CPI, TRPO, and Conservative Policy Iteration .. The CPI algorithm .. Trust Region Methods and Covariant Policy Search .. Proximal Policy Optimization .. Bibliographic Remarks and Further Readings .. 1514 Further Topics15315 Imitation Setting .. Offline IL: Behavior Cloning .. The Hybrid Setting: Statistical Benefit and algorithm .. Extension to Agnostic Setting .. Maximum Entropy Inverse Reinforcement learning .. MaxEnt IRL: Formulation and The Principle of Maximum Entropy .. algorithm .. Maximum Entropy RL: Implementing the Planning Oracle in Eq.. Interactive Imitation Learning: AggreVaTe and Its Statistical Benefit over Offline IL Setting.

6 Bibliographic Remarks and Further Readings .. 16516 Linear Quadratic The LQR Model .. Bellman Optimality:Value Iteration & The Algebraic Riccati Equations .. Planning and Finite Horizon LQRs .. Planning and Infinite Horizon LQRs .. Convex Programs to findPandK?.. The Primal for Infinite Horizon LQR .. The Dual .. Policy Iteration, Gauss Newton, and NPG .. Gradient Expressions .. Convergence Rates .. Gauss-Newton Analysis .. System Level Synthesis for Linear Dynamical Systems .. Bibliographic Remarks and Further Readings .. 17917 Partially Observable Markov Decision Processes181 Bibliography183A Concentration193viiiNotationThe reader might find it helpful to refer back to this notation section. We slightly abuse notation and let[K]denote the set{0,1,2,..K 1}for an integerK.

7 We let (X)denote the set of probability distribution over the setX. For a vectorv, we let(v)2, v, and|v|be the component-wise square, square root, and absolute value operations. Inequalities between vectors are elementwise, for vectorsv,v , we sayv v , if the inequality holdselementwise. For a vectorv, we refer to thej-th component of this vector by eitherv(j)or[v]j Denote the variance of any real valuedfunder a distributionDas:VarD(f) :=Ex D[f(x)2] (Ex D[f(x)])2 We overload notation where, for a distribution overS, we write:V ( ) =Es [V (s)]. It is helpful to overload notation and letPalso refer to a matrix of size(S A) Swhere the entryP(s,a),s is equal toP(s |s,a). We also will defineP to be the transition matrix on state-action pairs induced by adeterministic policy . In particular,P (s,a),(s ,a )=P(s |s,a)ifa = (s )andP (s,a),(s ,a )= 0ifa 6= (s ).With this notation,Q =r+ PV Q =r+ P Q Q = (I P ) 1r For a vectorQ R|S A|, denote the greedy policy and value as: Q(s) := argmaxa AQ(s,a)VQ(s) := maxa AQ(s,a).

8 For a vectorQ R|S A|, theBellman optimality operatorT:R|S A| R|S A|is defined as:TQ:=r+ PVQ.( ) For a vectorxand a matrixM, unless explicitly stated otherwise, we let x and M denote the Euclideanand spectral norms, respectively. We use the notation x M= x>Mx, wherexandMare assumed to be ofappropriate 1 Fundamentals3 Chapter 1 Markov Decision Discounted (Infinite-Horizon) Markov Decision ProcessesIn Reinforcement learning , the interactions between the agent and the environment are often described by an infinite-horizon, discounted Markov Decision Process (MDP)M= (S,A,P,r, , ), specified by: A state spaceS, which may be finite or infinite. For mathematical convenience, we will assume thatSis finiteor countably infinite. An action spaceA, which also may be discrete or infinite. For mathematical convenience, we will assume thatAis finite. A transition functionP:S A (S), where (S)is the space of probability distributions overS( , theprobability simplex).

9 P(s |s,a)is the probability of transitioning into states upon taking actionain usePs,ato denote the vectorP( s,a). A reward functionr:S A [0,1].r(s,a)is the immediate reward associated with taking actionain states. More generally, ther(s,a)could be a random variable (where the distribution depends ons,a). While welargely focus on the case wherer(s,a)is deterministic, the extension to methods with stochastic rewards areoften straightforward. A discount factor [0,1), which defines a horizon for the problem. An initial state distribution (S), which specifies how the initial states0is many cases, we will assume that the initial state is fixed ats0, is a distribution supported only The objective, policies, and a given MDPM= (S,A,P,r, , ), the agent interacts with the environment according to the followingprotocol: the agent starts at some states0 ; at each time stept= 0,1,2.]

10 , the agent takes an actionat A,obtains the immediate rewardrt=r(st,at), and observes the next statest+1sampled according tost+1 P( |st,at).The interaction record at timet, t= (s0,a0,r0,s1,..,st,at,rt),5is called atrajectory, which includes the observed state at the most general setting, a policy specifies a decision-making strategy in which the agent chooses actions adaptivelybased on the history of observations; precisely, a policy is a (possibly randomized) mapping from a trajectory to anaction, :H (A)whereHis the set of all possible trajectories (of all lengths) and (A)is the space ofprobability distributions overA. Astationarypolicy :S (A)specifies a decision-making strategy in whichthe agent chooses actions based only on the current state, ( |st). A deterministic, stationary policy is of theform :S now define values for (general) policies. For a fixed policy and a starting states0=s, we define thevalue functionV M:S Ras the discounted sum of future rewardsV M(s) =E[ t=0 tr(st,at) ,s0=s].


Related search queries