{"title": "Regret Bounds for Learning State Representations in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 12738, "page_last": 12748, "abstract": "We consider the problem of online reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent. At least one of these representations is assumed to induce a Markov decision process (MDP), and the performance of the agent is measured in terms of cumulative regret against the optimal policy giving the highest average reward in this MDP representation. We propose an algorithm (UCB-MS) with O(sqrt(T)) regret in any communicating Markov decision process. The regret bound shows that UCB-MS automatically adapts to the Markov model. This improves over the currently known best results in the literature that gave regret bounds of order O(T^(2/3)).", "full_text": "Regret Bounds for Learning State Representations in\n\nReinforcement Learning\n\nRonald Ortner\n\nMontanuniversit\u00e4t Leoben\n\nrortner@unileoben.ac.at\n\nMatteo Pirotta\n\nFacebook AI Research\n\npirotta@fb.com\n\nRonan Fruit\n\nSequel Team \u2013 INRIA Lille\nronan.fruit@inria.fr\n\nAlessandro Lazaric\nFacebook AI Research\n\nlazaric@fb.com\n\nOdalric-Ambrym Maillard\nSequel Team \u2013 INRIA Lille\n\nodalric.maillard@inria.fr\n\nAbstract\n\nWe consider the problem of online reinforcement learning when several state\nrepresentations (mapping histories to a discrete state space) are available to the\nlearning agent. At least one of these representations is assumed to induce a Markov\ndecision process (MDP), and the performance of the agent is measured in terms of\ncumulative regret against the optimal policy giving the highest average reward in\nT ) regret\nin any communicating MDP. The regret bound shows that UCB-MS automatically\nadapts to the Markov model and improves over the currently known best bound of\n\nthis MDP representation. We propose an algorithm (UCB-MS) with (cid:101)O(\norder (cid:101)O(T 2/3).\n\n\u221a\n\nIntroduction\n\n1\nIn reinforcement learning, an agent aims to learn a task while interacting with an unknown environ-\nment. We consider online learning (i.e., non-episodic) problems where the agent has to trade off the\nexploration needed to collect information about rewards and dynamics and the exploitation of the\ninformation gathered so far. In this setting, it is commonly assumed that the agent knows a suitable\nstate representation which makes the process on the state space Markovian. However, designing\nsuch a representation is often highly non-trivial since many \u201creasonable\u201d representations may lead to\nnon-Markovian models.\nThe task of selecting or designing a (suitable and compact) state representation of a dynamical\nsystem is a well-known problem in engineering, especially in robotics. This setting has received a\nlot of attention in recent years due to the growing number of applications using images or videos as\nobservations [e.g., 1, 2, 3]. It is possible to come up with different approaches for extracting features\nfrom such high-dimensional observation spaces, but not all of them describe the problem well or\nexhibit Markovian dynamics. Indeed, the Markovian assumption that transitions and rewards are\nindependent of history is frequently violated in real-world applications where there is often rather\na dependence on the last k > 1 observations. To deal with this scenario Markov models have been\nextended from \ufb01rst-order models to kth-order models. The problem of selecting the right order of the\nmodel is a special case of the selection of the correct state representation. In both cases, the goal is to\nperform as well as when the true order or compact features of the Markov model are known. For\nmore details and further examples we refer to [4, 5, 6].\nWe consider the following setting that was introduced by Hutter [7], where it was called feature\nreinforcement learning. The agent is provided with a \ufb01nite set \u03a6 of representations mapping histories\n(sequences of actions, observations, and rewards) to a \ufb01nite set of states, such that at least one model\n\u03c6\u25e6 \u2208 \u03a6 induces a Markov decision process (MDP) [8]. The goal of the agent is to learn to solve the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftask under an appropriate representation. The ability of testing and quickly discarding non-Markovian\nrepresentations (not compatible with the observed dynamics) is fundamental for learning ef\ufb01ciently.\nThis ef\ufb01ciency is measured in terms of cumulative regret, which compares the reward collected by\nthe learner to the one of an agent knowing the Markovian representation and playing the associated\noptimal policy (i.e., the policy giving the highest average reward).\nThis problem was initially studied by Maillard et al. [4]. Given a \ufb01nite set of representations \u03a6, after\nT steps the regret of the Best Lower Bound (BLB) algorithm w.r.t. any optimal policy associated\n\nto a Markov model is upper bounded by (cid:101)O((cid:112)|\u03a6|T 2/3). The BLB algorithm is based on random\n\nexploration of the models and uses properties of UCRL2 [9] \u2014an ef\ufb01cient algorithm for exploration-\nexploitation in communicating MDPs\u2014 to control the amount of time spent in non-Markovian\nmodels. BLB requires to estimate the diameter [9] of the true MDP, which leads to an additional\nadditive term in the regret bound that may be exponential in the true diameter. BLB was successively\nextended by Nguyen et al. [6] to the case of countably in\ufb01nite sets of models. The suggested IBLB\nalgorithm removes the guessing of the diameter \u2014thus avoiding the additional exponential term in\nthe regret\u2014 but its regret bound is still of order T 2/3. For the Optimistic Model Selection (OMS)\n\nalgorithm [5] a regret bound of (cid:101)O((cid:112)|\u03a6|T ) has been claimed that matches the optimal dependence\n\nin terms of T . However, algorithm and analysis were based on the REGAL.D algorithm [10], and\nrecently it has been pointed out that the proof of the regret bound for REGAL.D contains a mistake\nthat invalidates also the result for OMS, see App. A of [11]. Accordingly, it still has been an open\nquestion whether it is possible to achieve regret bounds of order\nIn this paper we introduce UCB-MS, an optimistic elimination algorithm that performs ef\ufb01cient\n\nexploration of the representations. For this algorithm we prove a regret bound of order (cid:101)O((cid:112)|\u03a6|T ).\n\nT in the considered setting.\n\n\u221a\n\nOur algorithm as well as our results are based on and generalize the regret bounds achieved for\nthe UCRL2 algorithm in [9]. In particular, if \u03a6 consists of a single Markov model we obtain the\nsame regret bound as for UCRL2. UCB-MS employs optimism to choose a model from \u03a6. To avoid\nsuffering too large regret from choosing a non-Markov model, the collected rewards are compared to\nregret bounds that are known to hold for Markov models. If a model fails to give suf\ufb01ciently high\nreward, it is eliminated. On the other hand, UCB-MS is happy to employ a non-Markov model as\nlong as it gives as much reward as it would be expected from a corresponding Markov model.\nWhile UCB-MS shares some basic ideas with BLB and OMS, it is simpler than OMS, however\nrecovers the same regret bounds that have been claimed for OMS. As BLB, UCB-MS has to guess the\ndiameter, however the guessing scheme we employ comes at little cost w.r.t. regret and in particular\ndoes not cause any additive constants in the bounds that are exponential in the diameter. We also\nshow how to modify the guessing scheme to guess diameter and the size of the state space of the\nMarkov model \u03c6\u25e6 at the same time. Last but not least, we introduce the notion of the effective size S\u03a6\nof the state space induced by the model set \u03a6 and give regret bounds in terms of S\u03a6. This yields\nimprovements depending on the structure of \u03a6 (like for hierarchical models).\nOverview. We start with a detailed description of the learning setting in the following section. In\nSection 3, we give some preliminaries concerning the UCRL2 algorithm. Our UCB-MS algorithm is\nintroduced in Section 4 where we also give the regret analysis in case the diameter of the underlying\nMarkov model is known. The following Section 5 shows how to guess the diameter otherwise.\nSection 6 gives some further improvements and also introduces the notion of effective state space.\n2 Setting\nThe details of the considered learning setting are as follows. At each time step t = 1, 2, . . ., the\nlearner receives an initial observation ot and has to choose an action at from a \ufb01nite set of actions A.\nIn return, the learner receives a reward rt taken from R = [0, 1] and the next observation ot+1.\nWe denote by O the set of observations and de\ufb01ne the history ht up to step t as the sequence\no1, a1, r1, o2, . . . , at, rt, ot+1 of observations, actions and rewards. The set Ht := O \u00d7 (A \u00d7 R \u00d7\nt\u22651 Ht to be the set of all\n\nO)t\u22121 contains all possible histories up to step t and we set H := (cid:83)\n\npossible histories.\n\n2.1 Models and MDPs\nA state-representation model (in the following short: model) \u03c6 is a function that maps histories to\nstates, that is, \u03c6 : H \u2192 S\u03c6. If a model \u03c6 induces a Markov decision process (MDP) we call it a\n\n2\n\n\fMarkov model. An MDP has the Markov property that any time t, the probability of reward rt and\nnext state st+1 = \u03c6(ht+1), given the past history ht, only depends on the current state st = \u03c6(ht)\nand the chosen action at, i.e., P (st+1, rt|ht, at) = P (st+1, rt|st, at). We assume that for MDPs this\nprobability is also independent of t.\nUsually, an MDP M with state space S and action space A is denoted as a tuple M = (S,A, r, p),\nwhere r(s, a) is the mean reward and p(s(cid:48)|s, a) the probability of a transition to state s(cid:48) \u2208 S when\nchoosing action a \u2208 A in state s \u2208 S. MDPs are called communicating if for any two states s, s(cid:48) it is\npossible to reach s(cid:48) from s with positive probability by choosing appropriate actions. The smallest\nexpected time it takes to connect any two states is called the diameter D of the MDP, cf. [9]. In\ncommunicating MDPs, the optimal average reward \u03c1\u2217 is independent of the initial state and will be\nachieved by a stationary deterministic policy \u03c0\u2217 \u2208 \u03a0SD that maps states to actions.\nFor a Markov model \u03c6, we write M (\u03c6) for the induced MDP. Its diameter and optimal average reward\nwill be denoted as D(\u03c6) and \u03c1\u2217(\u03c6), respectively. For all models \u03c6, we abbreviate S\u03c6 := |S\u03c6|.\n2.2 Problem setting\nThe learning setting we consider is the following. As already described before, the learner chooses\nactions at and obtains a reward rt and an observation ot+1 in return. We assume that the learner has\na \ufb01nite set \u03a6 of models at her disposal and at least one model \u03c6\u25e6 in \u03a6 is a Markov model. The goal is\nto provide algorithms that perform well with respect to the optimal policy \u03c0\u2217 in the MDP M (\u03c6\u25e6),\nthat is, the optimal strategy when the Markov model and the induced underlying MDP are completely\nknown. Accordingly, the performance of a learning algorithm will be measured by considering its\nregret after any T steps de\ufb01ned as (cf. [9, 10, 4])\n\nT \u03c1\u2217(\u03c6\u25e6) \u2212 T(cid:88)\n\nrt ,\n\nt=1\n\nwhere rt is the reward received by the learning algorithm at step t.\n3 UCRL2 Preliminaries\nThe algorithm we propose is based on the UCRL2 algorithm of [9]. Accordingly, we start with some\nrespective preliminaries.\nUCRL2 is an algorithm that generalizes the optimism in the face of uncertainty idea of UCB [12] from\nthe bandit setting to reinforcement learning in MDPs. In the following, we assume an underlying\nMDP M with S states, A actions, and diameter D. The UCRL2 algorithm uses con\ufb01dence intervals\nto de\ufb01ne a set of plausible MDPs M. That is, acting in the unknown MDP M, UCRL2 maintains\nestimates \u02c6r(s, a) and \u02c6p(\u00b7|s, a) of rewards and transition probabilities, respectively. The set Mt of\nsatisfying1\n\nplausible MDPs at step t contains all MDPs with rewards(cid:101)r(s, a) and transition probabilities(cid:101)p(\u00b7|s, a)\n\n(cid:12)(cid:12)\u02c6r(s, a) \u2212(cid:101)r(s, a)(cid:12)(cid:12) \u2264\n(cid:13)(cid:13)\u02c6p(\u00b7|s, a) \u2212(cid:101)p(\u00b7|s, a)(cid:13)(cid:13)1 \u2264\n\n(cid:114)\n(cid:114)\n\n7 log(4SAt3/\u03b4)\n\n2N (s,a)\n\n,\n\n14S log(4At3/\u03b4)\n\n2N (s,a)\n\n,\n\n(1)\n\n(2)\n\nwhere N (s, a) denotes the number of times a has been chosen in s (and is set to 1, if a has not been\nchosen in s so far). The true MDP M is in M with high probability.\nLemma 1 (Lemma 17 in the appendix of [9]2). With probability at least 1 \u2212 \u03b4\nMDP M is contained in the set Mt.\nalgorithm plays a \ufb01xed policy(cid:101)\u03c0k, which is chosen to maximize the optimal average reward of an\nThe UCRL2 algorithm proceeds in episodes k = 1, 2, . . .. In each episode k starting at step tk the\nMDP in Mk := Mtk. That is, writing \u03c1(\u03c0, M ) for the average reward of policy \u03c0 in MDP M we\n1The con\ufb01dence intervals shown here are the ones we use in the following and slightly differ from the\ncon\ufb01dence intervals given for UCRL2 in [9]. That is, the con\ufb01dence \u03b4 of the original values is replaced by \u03b4/2t2\nto guarantee smaller error probability, which is needed in our analysis.\n\n30t8 , at step t the true\n\n2As noted before, the error probability \u03b4 has been changed from \u03b4 to \u03b4/2t2 here.\n\n3\n\n\fLet vk(s, a) denote the number of times a has been chosen in s in episode k, while Nk(s, a) denotes\nthe number of times a has been chosen in s before episode k (i.e., in episodes 1 to k \u2212 1). If there\nwere no visits in (s, a) before episode k, then Nk(s, a) is set to 1. Episode k is terminated by UCRL2\n\nset(cid:101)\u03c1k := max\u03c0,M\u2208Mk{\u03c1(\u03c0, M )} = \u03c1((cid:101)\u03c0k,(cid:102)Mk), where(cid:102)Mk is an optimistic MDP chosen from Mk\nto maximize(cid:101)\u03c1k. As the true MDP M is in Mk with high probability, we also have that(cid:101)\u03c1k \u2265 \u03c1\u2217.\nwhen a state s is reached in which vk(s,(cid:101)\u03c0k(s)) = Nk(s,(cid:101)\u03c0k(s)).\n(cid:113)\n(cid:1).\nAT log(cid:0) 2T 3\n\nOne can show the following upper bound on the regret of UCRL2.\nTheorem 2 (Theorem 2 of [9]). With probability 1 \u2212 \u03b4, the regret of UCRL2 after any T steps is\nbounded by\n\n34DS\n\n\u03b4\n\nThe bound is based on an episode-wise decomposition of the regret, which we will use for our\nalgorithm. Let Tk be the (current) length of episode k. In the following, we abuse notation for Tk as\nwell as for vk(s, a) by using the same notation for the number of steps in a terminated episode as\nwell as for the current number of steps in an ongoing episode. Further, recall that tk denotes the time\nstep when episode k starts. The regret of UCRL2 in any episode k is bounded as follows.\nLemma 3. Consider an arbitrary episode k started at step tk. With probability 1 \u2212 \u03b4\nUCRL2 at each step Tk in this episode is bounded by\n\n, the regret of\n\n2t2\nk\n\n(cid:18)\n\n(cid:113)\n\n2D\n\n14S log(cid:0) 4At3\n\nk\n\n(cid:1) + 2\n\n\u03b4\n\n(cid:19)(cid:88)\n\ns,a\n\n(cid:113)\nTk log(cid:0) 16t2\n\nkTk\n\u03b4\n\n(cid:1) + D.\n\n\u221a\n\nvk(s,a)\nNk(s,a)\n\n+ 4D\n\nProof. The bound in Lemma 3 is not explicitly stated for single episodes in [9] but easily follows\nfrom equations (8), (9), (15)\u2013(17), and the argument given before equation (18), choosing con\ufb01dence\n\u03b4/t2 instead of \u03b4. For the sake of completeness, we give a brief proof sketch.\nFirst, by replacing the random rewards by the mean rewards r(s, a), the regret \u2206k of episode k can\nbe bounded by (cf. Eq. 8 in [9])\n\n\u2206k \u2264 (cid:88)\n\nvk(s, a)(\u03c1\u2217 \u2212 r(s, a)) +\n\n(cid:113)\n\n8 Tk log(cid:0) 16t2\n\n5\n\nkTk\n\u03b4\n\n(cid:1).\n\n(3)\n\ns,a\n\ndifference in the sum of (3) can be bounded and split up as\n\nLet \u02dcr and \u02dcp denote the rewards and transition probabilities in the optimistic MDP (cid:102)Mk. Then the\nterm can be written as(cid:101)\u03c1k \u2212 \u02dcr(s, a) =(cid:80)\n\n\u03c1\u2217 \u2212 r(s, a) \u2264 (cid:101)\u03c1k \u2212 r(s, a) \u2264 ((cid:101)\u03c1k \u2212 \u02dcr(s, a)) + (\u02dcr(s, a) \u2212 r(s, a)),\n(cid:17)\n(cid:0)\u02dcp(s(cid:48)|s, a) \u2212 w(s)\n\nwhere the last term is controlled by the con\ufb01dence intervals in (1), cf. Eq. (15) in [9]. The other\ns(cid:48) \u02dcp(s(cid:48)|s, a)w(s(cid:48)) \u2212 w(s) for the shifted value vector w (cf.\n\n(cid:0)\u02dcp(s(cid:48)|s, a) \u2212 p(s(cid:48)|s, a)(cid:1)w(s(cid:48)) +\n\np. 1576 of [9]) so that splitting up again one has (cf. Eq. 16 in [9])\n\n(cid:16)(cid:88)\n\n(cid:88)\n\n.\n\n(5)\n\n(4)\n\ns(cid:48)\n\ns(cid:48)\n\n(cid:113)\n\nThe \ufb01rst term is handled by the con\ufb01dence intervals in (2) and the fact the w(s) \u2264 D (cf. Eq. 17\nin [9]), while the second term can be written as martingale difference sequence and bounded by\n\n(cid:1) + D using Azuma-Hoeffding (cf. Eq. 18 in [9]). Finally taking into account\n\nD\nTk caused by failing con\ufb01dence intervals (cf. Eq. 9 in [9]) and\nan additional regret term of\ncombining (3)\u2013(5) gives the claimed bound, where the \ufb01rst term stems from the con\ufb01dence intervals\n(1) and (2).\n\n\u221a\n\n(cid:101)\u03c1k \u2212 \u02dcr(s, a) =\n2 Tk log(cid:0) 16t2\n\nkTk\n\u03b4\n\n5\n\n4 The UCB-MS Algorithm\nNow let us turn to the state representation learning setting introduced in Section 2. We start with\nthe simpler case when an upper bound \u00afD on the diameter D := D(\u03c6\u25e6) of the Markov model \u03c6\u25e6 is\nknown (i.e., \u00afD \u2265 D). The case when no bound on the diameter is known is dealt with in Section 5.\n\n4\n\n\fAlgorithm 1 UCB-Model Selection (UCB-MS)\n\nInput: set of models \u03a6, con\ufb01dence parameter \u03b4 \u2208 (0, 1), upper bound \u00afD on diameter\nInitialization: Let t := 1 be the current time step.\nfor episodes k = 1, 2, . . . do\n\nLet tk := t be the initial step of the current episode k, and let Mt,\u03c6 be the set of plausible\nMDPs de\ufb01ned via the con\ufb01dence intervals (1) and (2) for the estimates so far.\n\n(cid:66) For each \u03c6 \u2208 \u03a6, use Extended Value Iteration (EVI) to compute an optimistic MDP (cid:102)Mk,\u03c6\nin Mt,\u03c6 and a (near-)optimal policy(cid:101)\u03c0k,\u03c6 on(cid:102)Mk,\u03c6 with approximate average reward(cid:101)\u03c1k,\u03c6.\n\n(cid:66) Choose model \u03c6k \u2208 \u03a6 such that\n\n(cid:8)(cid:101)\u03c1k,\u03c6\n\n(cid:9) ,\n\nand set(cid:101)\u03c1k :=(cid:101)\u03c1k,\u03c6k,(cid:101)\u03c0k :=(cid:101)\u03c0k,\u03c6k, and Sk := S\u03c6k.\n\n(cid:66) Repeat till termination of the current episode k:\n\n\u03c6\u2208\u03a6\n\n\u03c6k = argmax\n\n\u25e6 Choose action at := \u03c0k(st), get reward rt and observe next state st+1 \u2208 Sk.\n\u25e6 Set t := t + 1.\n\u25e6 if vk(st, at) = Ntk (st, at) then terminate current episode.\n\u25e6 if\n\n(t \u2212 tk + 1)(cid:101)\u03c1k \u2212 t(cid:88)\n\n\u03c4 =tk\n\nr\u03c4 > \u0393t( \u00afD)\n\nthen set \u03a6 := \u03a6 \\ {\u03c6k} and terminate current episode.\n\nend for\n\n(8)\n\n(9)\n\n(6)\n\nThe UCB-MS algorithm we propose (shown as Alg. 1) basically performs the policy computation\nof UCRL2 for each model \u03c6. That is, in episodes k = 1, 2, . . ., UCB-MS constructs for each state\nrepresentation \u03c6 \u2208 \u03a6 a set of plausible MDPs Mk,\u03c6 and computes the optimistic average reward\n\n(cid:101)\u03c1k,\u03c6 =\n\nargmax\n\n\u03c0\u2208\u03a0SD,M\u2208Mk,\u03c6\n\n{\u03c1(\u03c0, M )}.\n\nThis problem can be solved using Extended Value Iteration (EVI) [9] up to an arbitrary accuracy.3\n\nAmong all the models, UCB-MS selects the one with highest average reward(cid:101)\u03c1k, cf. Eq. (8). The\nassociated optimistic policy(cid:101)\u03c0k is executed until the number of visits is doubled in at least one\nreward(cid:101)\u03c1k. We de\ufb01ne \u0393t according to Lemma 3 as\n(cid:33)(cid:88)\n\nstate-action pair (UCRL2 stopping condition) or this policy does not provide suf\ufb01ciently high average\nreward (see Eq. 9), in which case the model \u03c6k is eliminated.\nThe function \u0393t in Eq. (9) de\ufb01nes the allowed deviation from the promised optimistic average\n\n(cid:16) 4At3\n\n(cid:16) 16t2\n\n(cid:114)\n\n(cid:114)\n\n(cid:32)\n\n(cid:17)\n\n(cid:17)\n\nk(t)Tk(t)\n\n\u0393t(D) :=\n\n2D\n\n14S\u03c6t log\n\nk(t)\n\u03b4\n\n+ 2\n\n+ 4D\n\nTk(t) log\n\n\u03b4\n\n+ D,\n\n\u221a\n\nvk(t)(s,a)\nNk(t)(s,a)\n\ns,a\n\n(7)\nwhere k(t) denotes the episode in which step t occurs. In Eq. 9 we exploit the prior knowledge\n\u00afD \u2265 D in order to properly de\ufb01ne the condition for model elimination. We will see below in\nSection 5 that it is easy to adapt the algorithm to the case of unknown diameter.\nIf the set \u03a6 consists only of a single Markov model, UCB-MS basically coincides with UCRL with\nan additional checking step that will result in discarding the single model only with small probability.\nNote that UCB-MS shares the optimistic model selection and the idea of eliminating underachieving\nmodels with OMS, however its structure is much simpler.\nConcerning the computational complexity of our algorithm, note that the EVI subroutine we use\nfor policy computation works just as ordinary value iteration with the same convergence properties\nand the same computational complexity with an additional overhead of O(S2A) per iteration step,\ncf. [9]. Further, policy computation is performed for each model \u03c6 at most |\u03a6| + S\u03c6A log T times, cf.\nLemma 5 (c) below.\n\n\u221a\n3As for UCRL2, we set the accuracy in episode k to be 1/\n\ntk.\n\n5\n\n\fof state space of the largest model and S\u03a3 :=(cid:80)\n(cid:112)\n\nOur \ufb01rst result is the following regret bound for UCB-MS. Here Smax := max\u03c6 S\u03c6 denotes the size\n\u03c6 S\u03c6 the size of the total state space over all models.\n\nTheorem 4. With probability 1 \u2212 \u03b4, the regret of UCB-MS using \u00afD \u2265 D is bounded by\n\nSmaxS\u03a3AT log(cid:0) T\n\n(cid:1).\n\n\u03b4\n\nconst \u00b7 \u00afD\n\nNote that the bound of Theorem 4 holds for any Markov model in \u03a6. Thus, in case there is a\nMarkov model with smaller state space the regret bound shows that UCB-MS automatically adapts\nto this preferable model. When \u03a6 consists of a single Markov model we re-establish the bounds\nfor UCRL2 (however for an algorithm that unlike UCRL2 needs the diameter D as input). Most\nimportantly, the bound of Theorem 4 improves over the currently best known bound for BLB, which\n\nis of order (cid:101)O(T 2/3). If all models induce a state space of equal size S, the bound in Theorem 4 is\n(cid:101)O(DS(cid:112)|\u03a6|AT ), which also improves over the claimed regret bound of OMS, which is of order\n(cid:101)O(DS3/2A(cid:112)|\u03a6|T ). We note however that in other cases the state space dependence of the OMS\n\n\u221a\n\nbound may be better. In Section 6 below we show how to regain the OMS bound for our algorithm\nand how S\u03a3 in the bounds can be replaced by the effective size of the state space, which in some\ncases (like for hierarchical models) can be considerably smaller.\nA-dependence is optimal as for UCRL2, by using a re\ufb01ned analysis (see [13]) it is also\nWhile the\nD-dependence. On the other hand, the optimality in S and |\u03a6| is\npossible to obtain an optimal\nstill an open question. While the S-dependence can be reduced using Bernstein inequality, we are\nnot aware of any lower bound for |\u03a6| in this setting. The closest result we know is for aggregation\ntechniques with full information where it is possible to obtain bounds of order log(|\u03a6|). Obviously,\nin our setting we have less information and it is not clear if it is possible to obtain logarithmic\ndependence.\nNote that while the regret is measured w.r.t. the true Markov model \u03c6\u25e6, it is actually not necessary to\nidentify \u03c6\u25e6 to obtain the regret bound of Theorem 4. As long as a non-Markov model gives at least\nthe same reward that would be expected from a Markov model there is no need to discard it. Such a\nmodel could be, for example, a good (non-Markovian) approximation.\n\n\u221a\n\n4.1 Analysis (Proof of Theorem 4)\nThe following lemma collects some basic facts about UCB-MS.\nLemma 5. With probability 1 \u2212 \u03b4, all of the following statements hold:\n(a) The con\ufb01dence intervals (1) and (2) of the Markov model \u03c6\u25e6 hold for all time steps t = 1, . . . , T .\n(b) No Markov models are discarded in (9).\n(c) The number of episodes of UCB-MS is bounded by |\u03a6| + S\u03a3A log T .\n\nProof. (a) follows from Lemma 1 by summing over the error probabilities giving a total error\n\nprobability of(cid:80)\n\n\u03b4\n\n6.\n30t8 < \u03b4\n\nt\n\n6 , which proves (b).\n\nFor (b), if UCB-MS chooses a Markov model, then the regret in the respective episode is bounded\naccording to Lemma 3. The sum over the respective error probabilities \u03b4/2t2\nk over all episodes is\nbounded by 5\u03b4\nIf (b) holds, then there are at most |\u03a6|\u2212 1 episodes in which a model is discarded. For episodes which\nare terminated by doubling the number of visits, we can use Proposition 18 of [9], as the episode\ntermination criterion of UCB-MS is the same as for UCRL2. Since we have to take into account all\nstates of all models, the size of the state space to be considered is the sum over the sizes of the state\nspaces of the individual models.\n\nThe bound on the number of episodes in the worst case depends on S\u03a3. Under some assumptions on\nthe given models in \u03a6 (like having hierarchical models) this can be reduced, see Section 6 for details.\nProof of Theorem 4. We assume that the statements of Lemma 5 all hold, which is the case with\nprobability 1 \u2212 \u03b4. Let \u03c6\u25e6 be a Markov model in \u03a6 and consider any episode k. By Lemma 5 (a),\n\n6\n\n\fthe optimistic estimate (cid:101)\u03c1k,\u03c6\u25e6 \u2265 \u03c1\u2217(\u03c6\u25e6). By the optimism of the algorithm we further have that\n(cid:101)\u03c1k \u2265(cid:101)\u03c1k,\u03c6\u25e6. Hence, the regret \u2206k in each episode k is bounded by\n\n\u2206k := Tk \u00b7 \u03c1\u2217(\u03c6\u25e6) \u2212 tk+Tk(cid:88)\n\nr\u03c4 \u2264 Tk \u00b7(cid:101)\u03c1k \u2212 tk+Tk(cid:88)\n\n\u03c4 =tk\n\nr\u03c4 .\n\n\u03c4 =tk\n\nBy the de\ufb01nition of the algorithm, condition (9) does not hold at least up to the \ufb01nal step of the\nepisode, so that we obtain that (as rewards are upper bounded by 1)\n\n\u2206k \u2264 \u0393tk( \u00afD) + 1.\n\nUsing the de\ufb01nition of \u0393t( \u00afD) in (7) and summing over all K episodes, we obtain a regret bound of\n\n(cid:1)(cid:88)\nlog(cid:0) 16T 3\nTk \u2264 \u221a\n\n\u221a\n\n\u03b4\n\n(cid:112)\n\nTk + K \u00afD.\n\nk\nKT . Further, as\n\n(cid:88)\n\n\u2206k \u2264 (cid:88)\n(cid:18)\nUsing that(cid:80)\n\n\u2264\n\n2 \u00afD\n\nk\n\nk\n\nfor the analysis of UCRL2, we have that (cf. Eq. 20 of [9])\n\n(cid:88)\n\n(cid:19)(cid:88)\n\n(\u0393tk( \u00afD) + 1)\n\n(cid:1) + 2\n\n(cid:113)\n(cid:113)\n14Smax log(cid:0) 4AT 3\nk Tk = T together with Jensen\u2019s inequality, we have(cid:80)\n(cid:88)\n(cid:88)\n\nvk(s,a)\nNk(s,a)\n\n+ 4 \u00afD\n\n\u221a\n\ns,a\n\nk\n\n\u03b4\n\n\u221a\n\nvk(s,a)\nNk(s,a)\n\nk\n\ns,a\n\nk\n\nS\u03a3AT .\n\n2 + 1(cid:1)(cid:112)\n\u2264 (cid:0)\u221a\n(cid:113)\nS\u03a3AT (log T )(cid:0) log T\n\n(cid:113)\n\n\u03b4\n\n(cid:1) + const2 \u00b7 \u00afD\n\nSmaxS\u03a3AT log(cid:0) T\n\n(cid:1) + const3 \u00b7 \u00afDS\u03a3A log T,\n\nSummarizing, we obtain using the bounds on the number of episodes of Lemma 5 (c) and noting that\n|\u03a6| \u2264 S\u03a3 after some simpli\ufb01cations a regret bound of\nconst1 \u00b7 \u00afD\nwhich completes the proof of the theorem.\n5 Unknown Diameter\nIf the diameter is unknown we suggest the following guessing scheme. We run UCB-MS with some\ninitial value \u00afD \u2265 1. If at some step all models have been eliminated then double the value of \u00afD and\nrestart the algorithm, that is, start a new episode now considering all models again.\nOne can show that the regret of this doubling scheme is basically bounded as before unless D is very\nlarge compared to T .\nTheorem 6. With probability 1 \u2212 \u03b4, the regret of UCB-MS guessing D by doubling is bounded by\n\n\u03b4\n\n(cid:113)(cid:0)SmaxS\u03a3A + |\u03a6| log D(cid:1)T log(cid:0) T\n\n(cid:1).\n\n\u03b4\n\nconst \u00b7 D\n\nProof. Let Dk denote the parameter \u00afD used in episode k as an estimate for D. As in the proof of\nTheorem 4 we have that a Markov model will not be eliminated with high probability once Dk \u2265 D.\nHence, in total there cannot be more than (cid:100)|\u03a6| log2 D(cid:101) episodes that are terminated by discarding a\nmodel.\nLet \u0393t(D) be de\ufb01ned as in (7). Then the same argument as in the proof of Theorem 4 shows that the\nregret in each episode k is bounded by \u0393tk(Dk) + 1.\nThe rest of the proof can be rewritten from Theorem 4 using that Dk < 2D for all k with high\nprobability. The only difference is that the bound on the number of episodes has an additional term of\n(cid:100)|\u03a6| log2 D(cid:101), so that one obtains a regret bound of\n\n(cid:113)\n\nSmaxS\u03a3AT log(cid:0) T\n\n(cid:1) + const2 \u00b7 D\n\n\u03b4\n\n(cid:114)(cid:16)\n\nconst1 \u00b7 D\n\n(cid:17)\n\nS\u03a3A(log T ) + |\u03a6| log D\n\nconst3 \u00b7(cid:0)DS\u03a3A + |\u03a6| log D(cid:1) log T.\n\nT log T\n\n\u03b4 +\n\nSummarizing the terms gives the claimed bound.\n\nTheorem 6 shows that the cost of the guessing scheme w.r.t. the regret is small and, in particular, does\nnot result in any additive constant in the bound that is exponential in the diameter (in contrast to the\nbound for BLB [4]). Thus, the improvements over OMS discussed after Theorem 4 hold also for\nUCB-MS with guessing the diameter.\n\n7\n\n\fImproving the Bounds\n\n6\nIn this section, we consider further improvements of our bounds and introduce the notion of the\neffective size of the state space for a set of models \u03a6.\n\nImproving on the Number of Episodes\n\n6.1\nThe regret bounds we obtain for UCB-MS are basically of the same order as for standard reinforcement\nlearning in MDPs (i.e. with a given Markov model) as achieved e.g. by [9]. However, the state space\ndependence seems not completely satisfactory, as the bounds do not only depend on the state space\nsize of the Markov model, but on the total state space size S\u03a3 over all models.\nThe appearance of the parameter S\u03a3 in the bounds is due to the bound on the number of episodes\nin Lemma 5 (c). In the worst case, this bound cannot be improved. That is, without any further\nassumptions on the way models in \u03a6 aggregate histories one cannot say how visits in a state under\nsome model \u03c6 translate into state visits under some other model \u03c6(cid:48). For example, when under some\nmodel \u03c6 all states have been visited so far, the respective histories may be mapped to just a single\nstate under some other model \u03c6(cid:48). Consequently, one basically has to assume that the states of different\nmodels \u03c6, \u03c6(cid:48) are completely independent of each other, which leads to the bound of Lemma 5 (c).\nHowever, if there is some particular structure on the set of given models \u03a6, the bound on the number\nof episodes can be improved to not depend on the total number of states S\u03a3.\nDe\ufb01nition 7. Let \u03a6 be a set of state representation models. We de\ufb01ne the effective size S\u03a6 of the\nstate space of \u03a6 to be the number of states that are suf\ufb01cient to cover all states under \u03a6 in the sense\nthat visits in all S\u03a6 covering states induce visits in all other states.\n\nA simple example is when models are hierarchical. That is, there is some model \u03c6 in \u03a6, such that\nall other models \u03c6(cid:48) aggregate the states of \u03c6, i.e., it holds that if \u03c6(h) = \u03c6(h(cid:48)) then \u03c6(cid:48)(h) = \u03c6(cid:48)(h(cid:48))\nfor all histories h, h(cid:48) in H. In this case, S\u03a6 = S\u03c6, while S\u03a3 could be of order 2S\u03c6, as each subset\nof S\u03c6 may correspond to an aggregated state in some other model of \u03a6. Note that when considering\ndifferent orders for an MDP, this also results in a hierarchical model set.\nIn general, we obviously have that S\u03a6 \u2264 S\u03a3 and the bound on the number of episodes of Lemma 5 (c)\ncan be improved to depend on S\u03a6 instead of S\u03a3 (with the same proof).\nLemma 8. The number of episodes of UCB-MS terminated by the doubling criterion is bounded by\nS\u03a6A log T .\nAccordingly, we can strengthen the results of Theorems 4 and 6 as follows.\nTheorem 9. The regret bounds of Theorems 4 and 6 hold with S\u03a3 replaced by S\u03a6.\n\nImproving Further on the State Space Dependence\n\n6.2\nEven after replacing S\u03a3 by S\u03a6, there is still room for improvement of the bounds with respect to the\nsize of the state space. In principle, one would like to have a dependence on the size of the state space\nof the Markov model \u03c6\u25e6. As we have seen, with the current analysis the dependence on the effective\nnumber of states S\u03a6 is unavoidable. However, we can improve over the second appearing state space\nterm Smax by guessing the right size of the state space (i.e., S\u03c6\u25e6). We distinguish between two cases,\ndepending on whether a bound on the diameter is known.\n6.2.1 Diameter Known\nIf there is a known bound on the diameter, we can guess the size of the state space by the same\nscheme we have suggested for guessing the diameter in Section 5. That is, starting with S := 1 or\n\nS := min\u03c6 S\u03c6 we compare the collected rewards to the optimistic average reward(cid:101)\u03c1k of the current\n\nepisode k, as before eliminating underachieving models. As comparison term we choose now (in\naccordance with the regret bound for UCRL2 in Theorem 2)\n\n(10)\n\nFor this guessing scheme one can show the following regret bound.\nTheorem 10. With probability 1 \u2212 \u03b4, the regret of UCB-MS guessing S by doubling is bounded by\n\n(cid:113)\n\nA(t \u2212 tk + 1) log(cid:0) 2t3\n\n(cid:1).\n(S\u03a6A log T + |\u03a6| log S\u03c6\u25e6 ) AT log(cid:0) T\n\n\u03b4\n\n(cid:1).\n\n\u03b4\n\n\u0393t(S) := 34DS\n\n(cid:113)\n\nconst \u00b7 DS\u03c6\u25e6\n\n8\n\n\fProof. The proof is like that for Theorem 6 only that now S instead of D is guessed and the\ncomparison term \u0393t is different. That is, any Markov model \u03c6\u25e6 will not be discarded with high\nprobability once S \u2265 S\u03c6\u25e6. Therefore, there will be at most (cid:100)|\u03a6| log2 S\u03c6\u25e6(cid:101) episodes that are terminated\nby discarding a model.\nLet Sk be the guess for the size of the state space in episode k. Then, similar to the proofs of\nTheorems 4 and 6, the regret in each episode k is bounded by \u0393tk(Sk) + 1. As Sk \u2264 2S\u03c6\u25e6 w.h.p.,\nsumming over all \u2264 (cid:100)|\u03a6| log2 S\u03c6\u25e6(cid:101) + S\u03a6A log T episodes, Jensen\u2019s inequality gives the claimed\nregret bound.\n\nWe see that replacing Smax with S\u03c6\u25e6 comes at a cost of worse dependence on the number of states\nand actions, as the summing over episodes in the proof has to be done differently. Still, if Smax is\nquite large, the bound of Theorem 10 can be an improvement over the previously presented bounds.\n6.2.2 Unknown Diameter\nIf the diameter is not known, one can do the guessing for both D and S at the same time. More\nprecisely, in the comparison term one does not guess D and S separately but the factor DS instead.\n\nThat is, one starts with setting(cid:103)DS := 1 or some other \ufb01xed value like(cid:103)DS := min\u03c6 S\u03c6 and de\ufb01nes\n\nthe comparison term as\n\n\u0393t((cid:103)DS) := 34(cid:103)DS\n\n(cid:113)\nA(t \u2212 tk + 1) log(cid:0) 2t3\n\n(cid:1).\n\n\u03b4\n\n(11)\n\nThis leads to the following regret bound, which basically corresponds to the bound that has been\nclaimed for OMS, only with S\u03a3 replaced by the potentially smaller S\u03a6.\nTheorem 11. With probability 1 \u2212 \u03b4, the regret of UCB-MS guessing both D and S by doubling is\nbounded by\n\n(S\u03a6A log T + |\u03a6| log(DS\u03c6\u25e6 )) AT log(cid:0) T\n\nconst \u00b7 DS\u03c6\u25e6\n\n(cid:113)\n\n(cid:1).\n\n\u03b4\n\nProof. The proof is like that for Theorem 10. W.h.p. there will be at most (cid:100)|\u03a6| log2(DS\u03c6\u25e6 )(cid:101) episodes\n\u0393tk ((cid:103)DSk) + 1, where(cid:103)DSk \u2264 2DS\u03c6\u25e6 is the guess for episode k. A sum over the episodes gives the\nthat are terminated by eliminating a model, while the regret in each episode k is bounded by\n\nclaimed bound.\n\n7 Discussion\nWhile we have decided to use UCRL2 as reference algorithm for the de\ufb01nition of our UCB-MS\nstrategy, our approach can actually serve as a blueprint for adapting any optimistic algorithm with\nknown regret bounds to the state representation setting considered in this paper.\nIn particular,\nimproved regret bounds (possible w.r.t. the parameters S and D, cf. [9]) for UCRL2 or a variation of\nit (such as the recent [13]) automatically result in improved bounds for a corresponding variant of\nUCB-MS.\nThe OMS algorithm [5] employs some form of regularization so that models with large state space\nare less appealing. However, this did not avoid the dependence of the claimed bounds of [5] on S\u03a3.\nIt is an interesting question whether some improved regularization approach can give bounds that\nonly depend on S\u03c6\u25e6. In general, the right dependence of regret bounds on the size of the model set \u03a6\nis also an open problem.\nAnother question that is still open also for the MDP setting is whether the diameter can be replaced\nby the bias span \u03bb\u2217 of the optimal policy [10, 14]. With an upper bound on \u03bb\u2217, one could replace\nUCRL2 by the SCAL algorithm of [14]. However, the guessing scheme we employ for the diameter\ndoes not work for SCAL, as chosen policies may not be optimistic anymore, if the guess for \u03bb\u2217 is too\nsmall.\nAnother direction for future research are generalizations to in\ufb01nite model sets, which for the case of\ndiscrete sets has already been done for the BLB algorithm [6]. Parametric sets of models would be an\ninteresting next step from there. In this context, it also makes sense to consider approximate Markov\nmodels, that is, the assumption that there is a Markov model is dropped. The results given in [15] for\nthis setting are also affected by the mentioned error in the proof of the OMS regret bound. We think\nthat our approach can be adapted, however the details still have to be worked out.\n\n9\n\n\fAcknowledgments\n\nThis work has been supported by the Austrian Science Fund (FWF): I 3437-N33 in the framework of\nthe CHIST-ERA ERA-NET (DELTA project). Odalric-Ambrym Maillard was supported by CPER\nNord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020, the French\nMinistry of Higher Education and Research, Inria Lille \u2013 Nord Europe, CRIStAL, and the French\nAgence Nationale de la Recherche, under grant ANR-16-CE40-0002 (project BADASS).\n\nReferences\n[1] Rico Jonschkowski and Oliver Brock. State representation learning in robotics: Using prior\n\nknowledge about physical interaction. In Robotics: Science and Systems, 2014.\n\n[2] Timoth\u00e9e Lesort, Natalia D\u00edaz-Rodr\u00edguez, Jean-Franois Goudou, and David Filliat. State\n\nrepresentation learning for control: An overview. Neural Networks, 2018.\n\n[3] Tim de Bruin, Jens Kober, Karl Tuyls, and Robert Babuska. Integrating state representation\nlearning into deep reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1394\u2013\n1401, 2018.\n\n[4] Odalric-Ambrym Maillard, R\u00e9mi Munos, and Daniil Ryabko. Selecting the state-representation\nin reinforcement learning. In Advances Neural Processing Systems 24, NIPS 2011, pages\n2627\u20132635, 2012.\n\n[5] Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, and Daniil Ryabko. Optimal\nregret bounds for selecting the state representation in reinforcement learning. In Proceedings\nof the 30th International Conference on Machine Learning, ICML 2013, volume 28 of JMLR\nWorkshop and Conference Proceedings, pages 543 \u2013 551, 2013.\n\n[6] Phuong Nguyen, Odalric-Ambrym Maillard, Daniil Ryabko, and Ronald Ortner. Competing\nwith an in\ufb01nite set of models in reinforcement learning. In Proceedings of the 16th Interna-\ntional Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2013, volume 31 of JMLR\nWorkshop and Conference Proceedings, pages 463\u2013471, 2013.\n\n[7] Marcus Hutter. Feature reinforcement learning: Part I. Unstructured MDPs. J. Arti\ufb01cial General\n\nIntelligence, 1(1):3\u201324, 2009.\n\n[8] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, Inc., New York, NY, USA, 1994.\n\n[9] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[10] Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement\nlearning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty\nin Arti\ufb01cial Intelligence, UAI 2009, pages 25\u201342, 2009.\n\n[11] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation in\nnon-communicating Markov decision processes. In Advances in Neural Information Processing\nSystems 31, NeurIPS 2018, pages 2998\u20133008, 2018.\n\n[12] Peter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi-armed\n\nbandit problem. Machine Learning, 47:235\u2013256, 2002.\n\n[13] Ronan Fruit, Alessandro Lazaric, and Matteo Pirotta. Regret minimization in in\ufb01nite-horizon\n\ufb01nite markov decision processes. Tutorial at ALT\u201919, 2019. URL https://rlgammazero.\ngithub.io/.\n\n[14] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Ef\ufb01cient bias-span-\nconstrained exploration-exploitation in reinforcement learning. In Proceedings of the 35th\nInternational Conference on Machine Learning, ICML 2018, volume 80 of JMLR Workshop\nand Conference Proceedings, pages 1573\u20131581, 2018.\n\n10\n\n\f[15] Ronald Ortner, Odalric-Ambrym Maillard, and Daniil Ryabko. Selecting near-optimal approxi-\nmate state representations in reinforcement learning. In Algorithmic Learning Theory - 25th\nInternational Conference, ALT 2014, pages 140\u2013154, 2014.\n\n11\n\n\f", "award": [], "sourceid": 6926, "authors": [{"given_name": "Ronald", "family_name": "Ortner", "institution": "Montanuniversitaet Leoben"}, {"given_name": "Matteo", "family_name": "Pirotta", "institution": "Facebook AI Research"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "Facebook Artificial Intelligence Research"}, {"given_name": "Ronan", "family_name": "Fruit", "institution": "Inria Lille"}, {"given_name": "Odalric-Ambrym", "family_name": "Maillard", "institution": "INRIA"}]}