A Tour of the Markov Decision Process

DP, POMDP, RL, and where you could learn more.

Devin Z
13 min readDec 26, 2020
Unified View of Reinforcement Learning (image source: [1])

Computer programmers are obsessed with determinism. We’d think of the world as a state machine, whose next state solely depends on its current state and the action we take. Such a world is enjoyably predictable and everything could be perfectly planned to our advantage. Yet life is not that easy. We are surrounded by a huge amount of uncertainty, from variations of stock prices to the occurrence of a global pandemic.

Such reality calls for the Markov decision process (MDP), a simple but powerful model to help us describe sequential decision making in a stochastic environment. Formally, an MDP consists of a state space, an action space, state transition probabilities p(s’|s, a), and expected immediate rewards r(s, a). State transitions of an MDP are nondeterministic, but the distribution of the next state is independent of the past given the current state and action. Taking an action from a state will result in a random immediate reward with a certain expectation, and whether an action is promising depends on the expected return in the long run.

Expected Return in an Infinite Horizon

This post will offer you a whirlwind tour around three aspects of the Markov decision process: dynamic programming, partial observability and reinforcement learning. I will mostly focus on introducing concepts, ideas, intuitions and their correlations, leaving out most mathematics and technical details. Readers seeking for more rigorous resources should find those listed in the reference section helpful.

Dynamic Programming

Model-based planning is the most fundamental problem about an MDP. Essentially, it is about finding optimal policies with the knowledge of the model (i.e. state transition probabilities, immediate rewards and long-term objective). A policy can be a deterministic one, which maps a state to a certain action, or a stochastic one, which maps a state to a distribution of various actions. Given a policy, we could determine a state-value function V(s), i.e. the expected long-term return starting from a state, and an action-value function Q(s, a), i.e. the expected long-term return after taking an action on a state. A policy is optimal if at any state it achieves the maximum value among all possible policies, and for any MDP there exists an optimal deterministic policy.

Such planning problem is an instance of dynamic programming (DP), as it satisfies the Bellman optimality principle, meaning an optimal plan from the current state includes a best immediate action and an optimal plan from the next state. Formally, an optimal policy and its value functions should satisfy the following Bellman optimality equations:

Bellman Optimality Equations

The Bellman optimality equations are more “declarative” than “imperative”. It gives us a sense what conditions an optimal policy should meet, but doesn’t tell us how to find one. As with many other computation problems, two general strategies can help us attack this: (1) to recursively decompose the original problem into subproblems that are more tractable; (2) to generate a naive solution and iteratively refine it in a principled way. In MDP planning, the first strategy leads to value iteration, and the second leads to policy iteration.

Value iteration solves a T-stage planning problem by first recursively solving a (T-1)-stage planning problem and then doing a single-step backup. Each backup updates the value function by applying a non-linear Bellman optimality operator. As the horizon T increases, optimal value functions of these recursively-structured problems gradually converge to the optimal value function of the infinite-horizon problem, for the Bellman optimality operator is a maximum-norm contraction². Intuitively, this can be implemented by means of tabulation, meaning to iteratively update a table of state values as we solve the finite-horizon problems in order. This generally requires two tables’ space, and the time taken by each iteration depends on the total number of possible state-action-state transitions. To improve time and space efficiency, one may try blurring the stages and asynchronously back up each single state in-place (in some descending priority), as long as all the states have a chance to be swept all the time.

Value Iteration Backup

Many problems in combinatorial optimization can be viewed as MDP planning with deterministic and acyclic state transitions. In such cases, we could evaluate all the states in one pass backwards (i.e. any reverse topological order). In real-time planning, where we only want to optimize a trajectory starting from a specific state, an alternative approach is by means of memoization. Basically, we do a forward search from the starting state, and explore all the descendants of a state before evaluating itself. Values are cached in a hash table to avoid repeated computations, and states are “lazily” evaluated on demand. By doing so, we could potentially avoid visiting lots of states that are irrelevant to the question of interest. Moreover, when states have a large branching factor (e.g. a game like Go), we may even want to trade optimality for speed by sampling actions and next states according to the dynamics and some rollout policy (i.e. sample backups) instead of enumerating all possibilities (i.e. full-width backups).

Unlike value iteration that exploits the nicely recursive structure of the problem, policy iteration is more like a local search, which walks through the solution space of the same problem and heuristically updates its current choice. Nevertheless, policy iteration is more than heuristic, for it is guaranteed to monotonically improve the value function. It is sort of like an “EM loop” alternating between policy evaluation and policy improvement. Policy evaluation determines the value function of the current policy by solving a set of linear equations (i.e. Bellman expectation equation) or repeatedly applying a linear operator (i.e. Bellman expectation operator) to the value function until convergence. Policy improvement updates the policy by greedily selecting actions w.r.t the current value function. These two steps might be reminiscent of the EM algorithm for maximizing incomplete-data likelihood, which iteratively infers the latent variable distribution based on the latest parameter estimates and then maximizes the ELBO over the parameters to give new estimates. But better than the EM algorithm, policy iteration is guaranteed to converge at the global optimum. Compared with value iteration, policy iteration usually takes fewer iterations to converge, at the expense of more intensive computations at each policy evaluation step. In a sense, we can consider value iteration as a variant of policy iteration that exits early at each policy evaluation step after applying the Bellman expectation operator exactly once.

Policy Evaluation and Policy Improvement

So far, we have assumed that the state space and the action space are finite. In general, for problems with infinite and even continuous states and actions, exact dynamic programming is computationally infeasible. One rare exception is the linear quadratic regulator (LQR), which has linear dynamics (with Gaussian error) and negative quadratic rewards. During backward value iterations, the optimal action is always linear in the state, and the optimal value function remains quadratic. An LQR thus has an exact analytical solution despite the continuous state and action spaces.

For continuous problems in general, we often resort to approximations. For example, we could discretize the state and action spaces, solve an approximate finite-world problem, and then interpolate the value of any state from its nearest discretized neighbors. Another technique we may consider is function approximation, which parameterizes the action-value function and iteratively updates parameters to minimize a residual over certain samples. Unfortunately, such algorithms are generally not guaranteed to converge. Further, if we consider continuous-time problems, in which state transitions are described in terms of time derivatives of states, we’ll step into the broader world of optimal control. [3] is a good introductory material on this subject.

As aforementioned, in real-time planning, we usually care about the immediate actions we should take given our current state (or our belief about the current state), rather than a policy governing all possible states. This is where trajectory optimization⁴ comes in. It transcribes an optimal control problem into a constrained optimization problem (ideally convex) over either the action sequence (a.k.a. shooting methods) or the state-action sequence (a.k.a. collocation methods). An example of shooting methods is the iterative LQR, which repeatedly refines an action sequence by locally approximating the model at each time step and applying LQR backups. To compensate real-world noise and error, trajectory optimization methods can be combined with the receding horizon strategy, which continuously replans over a longer horizon but only executes the first command of a plan each time.

Partial Observability

In a Markov decision process, the only uncertainty comes from the dynamics of the environment, and an agent has full perception of the current state. As an extension to the MDP, the partially observable Markov decision process (POMDP) takes account of both an environment’s inherent uncertainty and an agent’s limited observation. In a POMDP, the observation an agent obtains is not exactly a Markovian state, but a noisy projection of an underlying state. Best decisions are made by taking account of not only the latest observation, but the entire history of interactions.

Without exact knowledge about the current state, an agent needs to track a belief distribution of all possible states b(s) and takes actions based on it. Such belief tracking is often done by Bayes filtering⁵, which alternatively incorporates a measurement and updates on a control. A classic example would be the Kalman filter, which assumes a linear Gaussian system and maintains a Gaussian belief distribution. The particle filter, as another example, represents the belief distribution by samples and updates them by simulation.

Bayes Filtering

The POMDP planning problem is to find an optimal action for each belief distribution. In principle, this can be reduced to an MDP planning problem, for each belief distribution is equivalent to a state in a derived information space⁶. But computationally, exact dynamic programming is only feasible on a finite-world POMDP problem, in which the optimal value function is the upper hull of a finite set of linear constraints (you can think of a belief distribution over n discrete states as a point in an (n-1)-dimensional hyperplane). Even in a finite-world problem, the exact method soon becomes intractable as the state space scales up.

POMDP Planning

Once again, we resort to approximate dynamic programming. The QMDP algorithm, for example, backs up the belief state values by bootstrapping from the underlying MDP state values. Basically, it assumes full observability could be obtained in a few time steps ahead, and approximates the value of a belief state by the sum of the MDP state values weighted by posterior (an overestimate in fact). The AMDP algorithm, as another approximate technique, compresses a belief state into a low-dimensional representation and uses sampling methods to estimate the dynamics and rewards of an approximate MDP. One may want to read the book [5] to learn more about approximate POMDP planning and Bayes filtering.

Reinforcement Learning

Up to now, we have assumed that we know the model (i.e. dynamics and rewards) of an MDP (or POMDP) and discussed how we optimize our policies by virtue of such knowledge. Now let’s reconsider the fully observable MDP planning problem, but this time we know neither the state transition probabilities nor the expected immediate rewards. The only thing given to us is either historical experiences (i.e. state-and-reward sequences) or an interactive environment where we can generate such experiences. This introduces us to the realm of reinforcement learning (RL). If you are seeking for an introductory material on this topic, I highly recommend David Silver’s slides¹.

Reinforcement learning can be either model-free or model-based. In model-free RL, we directly learn a value function or a policy from the experiences. In model-based RL, we first estimate the model of an MDP, and then use the estimated model to do model-based planning or to generate simulated experiences for model-free RL. Estimating the model is more than ordinary supervised learning in that samples from experiences are far from i.i.d., so one might need to iteratively do training and data aggregation to combat the distribution shift. In comparison, model-based RL is usually more sample efficient than model-free RL, but is hampered by the error induced by model estimation. An effective way to intermix them is to bootstrap a model-free learner using a policy learned by a model-based learner.

One major task in model-free RL is model-free prediction, which is to evaluate a stationary policy given the experiences generated by it. TD(λ) is a classic model-free prediction algorithm, and is also an excellent example of bias-variance trade-off, a recurring theme in statistical learning. A model has a high bias if it is too simple (or rigid) to fit well even for the training data, and has a high variance if it is too complex (or flexible) to be generalized for data outside the training set. Monte Carlo sampling on complete episodes can give us an unbiased estimate of the expected return on a state, but is prone to the high variance (which also afflicts the particle filter). Thus we introduce the n-step return to balance it with bootstrapping (a.k.a. shallow backups), which is biased but can lower the variance. We geometrically blend different n-step returns into a tunable λ-return, such that a larger λ leads to a lower bias and a higher variance. During training, we use an eligibility trace to balance the frequency and the recency of different samples, and update the value function by weighted temporal differences.

n-Step Return and λ-Return
TD(λ) Algorithm

Another major task in model-free RL is to find an optimal policy, also known as model-free control. A model-free control algorithm is on-policy if it can only improve a policy by learning from the experiences generated by that policy (perhaps with Boltzmann exploration, etc), otherwise it’s off-policy. Off-policy learning tends to have better sample efficiency, and is also able to erase sample correlations by experience replay. A typical example of on-policy learning is SARSA(λ), which is pretty similar to TD(λ) except that it learns the action-value instead. Q-learning, on the other hand, is an example of off-policy learning. Both algorithms may suffer from the maximum bias, which is introduced when estimating the maximum using the maximum of the estimates. The double learning technique could be employed to mitigate that. It maintains two value functions, and each time randomly selects one to determine best actions and updates its values using estimates from the other. Moreover, an approximate maximizer could be trained for selecting actions from a high-dimensional continuous space (e.g. DDPG).

SARSA(λ) Algorithm
Q-learning Algorithm
Online Q-learning with Function Approximation

Both SARSA(λ) and Q-learning learn a value function, which implies a deterministic policy if there isn’t a tie among actions. The problem with such a policy is that it can go sour if all of a sudden the environment changes adversarially. For example, over time your opponent may figure out your strategy in a board game and consistently beat you thereafter until your value function catches up again. The more variational your action pattern is, the less likely your opponent could learn and exploit it. Policy gradient methods that directly model and learn stochastic policies can help with that. Such methods are similar to the policy iteration algorithm except each time they do one-step gradient ascent instead of seeking a global maximum. An example is the actor-critic algorithm, which consists of a critic who iteratively estimates a parametric value function, and an actor who iteratively does gradient ascent w.r.t the policy parameters. The policy gradient contains an advantage function, which can be estimated by the temporal difference based on the value estimates. Detailed derivations of the policy gradient formula can be found in Sergey Levine’s lecture slides⁷.

Policy Gradient Formula
Actor Critic Algorithm

In general, policy gradient methods are on-policy, because the policy gradient depends on the joint distribution of states and actions given a certain policy. Importance sampling can be used to adapt them to off-policy learning so as to improve sample efficiency, but the joint distribution ratio is not easy to obtain. Fortunately, when the target policy is sufficiently close to the behavior policy (e.g. the KL-divergence is bounded), monotonic improvement of the objective is guaranteed even if the state distribution mismatch is ignored and the joint distribution ratio is approximated by the state-conditioned action probability ratio. This entails restraining policy changes (rather than a parameter-space step size) at each update, and has been a key idea behind some advanced RL algorithms, such as TRPO and PPO.

The Anatomy of an RL Algorithm (image source: [7])

One limitation of RL is that it relies on reward functions that are easily defined and dense in time. Imitation learning, by contrast, learns policies from expert demonstrations without observing rewards. This can be done either via behavioral cloning, which uses supervised learning methods to predict expert actions on states, or via inverse reinforcement learning, which first speculates a reward function that justifies experts’ behavior (with or without knowing the dynamics) and then finds an optimal policy by RL. The first approach is simpler, but due to the distribution mismatch between the expert policy and a learned policy, an interactive demonstrator might be needed (see data aggregation and policy aggregation⁸). For the second one, it is not obvious how to pick a reward function when an expert’s policy is optimal w.r.t. many of them, and common choices include maximizing margins (analogous to SVM) and maximizing entropy (analogous to LR).

--

--