Control of Stochastic Systems - Department of Mathematics and

i
Control of Stochastic Systems
Serdar Yuksel
¨
Queen’s University, Mathematics and Engineering and Mathematics and Statistics
February 3, 2015
ii
This document is a set of supplemental lecture notes that has been used for Math 472/872: Control of Stochastic Systems,
at Queen’s University, since 2009. It has also been used at Bilkent University in Winter 2014. For non-commercial purposes
only, the notes can be freely used and downloaded. Please contact me if you find typos and errors/omissions.
Serdar Y¨uksel
Contents
1
Review of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2 Measurable Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2.1
Borel σ−field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2.2
Measurable Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2.3
Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2.4
The Extension Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.5
Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.6
Fatou’s Lemma, the Monotone Convergence Theorem and the Dominated Convergence Theorem . .
4
1.3 Probability Space and Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3.1
More on Random Variables and Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3.2
Independence and Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4 Stochastic Processes and Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4.1
2
Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Controlled Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1 Controlled Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.1
Fully Observed Markov Control Problem Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.2
Classes of Control Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Markov Chain Induced by a Markov Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Performance of Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Partially Observed Models and Reduction to a Fully Observed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3
Classification of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iv
Contents
3.1 Countable State Space Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1
Recurrence and Transience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2
Stability and Invariant Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.3
Invariant Measures via an Occupational Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Uncountable (Complete, Separable, Metric) State Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1
Invariant Distributions for Uncountable Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2
Existence of an Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3
Dobrushin’s Ergodic Coefficient for Uncountable State Space Chains . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4
Ergodic Theorem for Positive Harris Recurrent Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Further Conditions on the Existence of an Invariant Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4
Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1
More on Expectations and Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2
Some Properties of Conditional Expectation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3
Discrete-Time Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.4
Doob’s Optional Sampling Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.5
An Important Martingale Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.6
Proof of the Birkhoff Individual Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.7
This section is optional: Further Martingale Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.8
Azuma-Hoeffding Inequality for Martingales with Bounded Increments . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Stability of Markov Chains: Foster-Lyapunov Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1
Criterion for Positive Harris Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2
Criterion for Finite Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3
Criterion for Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.4
On small and petite sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.5
Criterion for Transience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.6
State Dependent Drift Criteria: Deterministic and Random-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Convergence Rates to Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1
Lyapunov conditions: Geometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.2
Subgeometric ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Contents
5
v
Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Bellman’s Principle of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Optimality of Deterministic Markov Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Existence of Minimizing Selectors and Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Infinite Horizon Optimal Control Problems: Discounted Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Linear Quadratic Gaussian Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6
Partially Observed Markov Decision Processes, Non-linear Filtering and the Kalman Filter . . . . . . . . . . . . . 67
6.1 Enlargement of the State-Space and the Construction of a Controlled Markov Chain . . . . . . . . . . . . . . . . . . . 67
6.2 Linear Quadratic Gaussian Case and Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.1
Separation of Estimation and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 Estimation and Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.1
Optimal Control of Partially Observed LQG Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4 Partially Observed Markov Decision Processes in General Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7
The Average Cost Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.1 Average Cost and the Average Cost Optimality Equation (ACOE) or Inequality (ACOI) . . . . . . . . . . . . . . . . . 77
7.1.1
The Discounted Cost Approach to the Average Cost Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.1.2
Polish State and Action Spaces, ACOE and ACOI [Optional] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 Linear Programming Approach to Average Cost Markov Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2.1
General State/Action Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.2
Extreme Points and the Optimality of Deterministic Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2.3
Sample-Path Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Constrained Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8
Learning and Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2 Proof of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9
Decentralized Stochastic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.1 Witsenhausen’s Intrinsic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.1.1
Static and dynamic information structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi
Contents
9.1.2
Classical, quasiclassical and nonclassical information structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.1.3
Solutions to Static Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.2 Dynamic Teams Quasi-Classical Information Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.2.1
Non-classical information structures, signaling and its effect on lack of convexity . . . . . . . . . . . . . . . 98
9.2.2
Expansion of information Structures: A recipe for identifying sufficient information . . . . . . . . . . . . . 99
9.3 Common Knowledge as Information State and the Dynamic Programming Approach to Team Decision
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.4 k-Stage Periodic Belief Sharing Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.4.1
k-stage periodic belief sharing pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A
Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.0.1 Banach Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.0.2 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A.0.3 Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
B
On the Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B.1 Limit Events and Continuity of Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B.2 Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B.3 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.3.1 Convergence almost surely (with probability 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.3.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.3.3 Convergence in Mean-square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.3.4 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
1
Review of Probability
1.1 Introduction
Before discussing controlled Markov chains, we first discuss some preliminaries about probability theory.
Many events in the physical world are uncertain; that is, with a given knowledge up to a certain time, the entire process
is not accurately predictable. Probability theory attempts to develop an understanding for such uncertainty in a consistent
way given a number of properties to be satisfied. The suggestion by probability theory may not be physically correct, but
it is mathematically precise and consistent.
Examples of stochastic processes include: a) The temperature in a city at noon in October 2014: This process takes values
in R31 , b) The sequence of outputs of a communication channel modeled by an additive scalar Gaussian noise when the
input is given by x = {x1 , . . . , xn } (the output process lives in Rn ), c) Infinite copies of a coin flip process (living in
{H, T }∞), d) The trajectory of a plane flying from point A to point B (taking values in C([t0 , tv ]), the space of continuous
paths in R3 with xt0 = A, xtf = B), e) The exchange rate between the Canadian dollar and the American dollar in a given
time index T .
Some of these processes take values in countable spaces, some do not. When the state space in which a random variable
takes values is uncountable, further technical intricacies arise. In particular, one needs to construct probability values by
first defining values for certain events and extending such probabilities to a larger class of events in a consistent fashion (in
particular, one does not first associate probability values to single points as we do in countable state spaces). These issues
are best addressed with a precise characterization of probability and random variables.
Hence, probability theory can be used to model uncertainty in the real world in a consistent way according some properties
that we expect such measures should admit. In the following, we will develop a rigorous definition for probability.
1.2 Measurable Space
Let X be a collection of points. Let F be a collection of subsets of X with the following properties such that F is a σ-field,
that is:
•
X∈F
•
If A ∈ F , then X \ A ∈ F
•
If Ak ∈ F , k = 1, 2, 3, . . . , then
∞
k=1
Ak ∈ F (that is, the collection is closed under countably many unions).
By De Morgan’s laws, the set has to be closed under countable intersections as well.
If the third item above holds for only finitely many unions or intersections, then, the collection of subsets is said to be a
field.
2
1 Review of Probability
With the above, (X, F ) is termed a measurable space (that is we can associate a measure to this space; which we will
discuss shortly). For example the full power-set of any set is a σ-field.
A σ−field J is generated by a collection of sets A, if J is the smallest σ−field containing the sets in A, and in this case,
we write J = σ(A).
A σ−field is to be considered as an abstract collection of sets. It is in general not possible to formulate the elements of a
σ-field through countably many operations of unions, intersections and complements, starting from a generating set.
Subsets of σ-fields can be interpreted to represent information about an underlying process. We will discuss this interpretation further and this will be a recurring theme in our discussions in the context of stochastic control problems.
1.2.1 Borel σ−field
An important class of σ-fields is the Borel σ−field on a metric (or more generally topological) space. Such a σ−field is the
one which is generated by open sets. The term open naturally depends on the space being considered. For this course, we
will mainly consider the space of real numbers (and spaces which are complete, separable and metric spaces; see Appendix
A). Recall that in a metric space with metric d, a set U is open if for every x ∈ U , there exists some ǫ > 0 such that
{y : d(x, y) < ǫ} ⊂ U . We note also that the empty set is a special open set.
The Borel σ−field on R is the one generated by sets of the form (a, b), that is ones generated by open intervals.
We will denote the Borel σ−field on a space X as B(X).
It is important to note that not all subsets of R are Borel sets, that is, elements of the Borel σ−field.
We can also define a Borel σ−field on a product space. Let X be a complete, separable, metric space (with some metric d)
such as R. Let XZ+ denote the infinite product of X so that if x = (x0 , x1 , x2 , · · · ) with xk ∈ X. If this space is endowed
∞
d(xi ,yi )
with the product metric (such a metric is defined as: d(x, y) = i=0 2−i 1+d(x
), the open sets are sets of the form
i ,yi )
A
,
where
only
finitely
many
of
these
sets
are
not
equal
to
X
and
the
remaining
ones are open. We define cylinder
i∈Z i
sets in this product space as:
B[Am ,m∈I] = {x ∈ XZ , xm ∈ Am , Am ∈ B(X)},
where I ⊂ Z with |I| < ∞, that is, the set has finitely many elements. Thus, in the above, if x ∈ B[Am ,m∈I] , then,
xm ∈ Am and the remaining terms are arbitrary. The σ−field generated by such cylinder sets forms the Borel σ−field on
the product space. Such a construction is very important for stochastic processes.
1.2.2 Measurable Function
If (X, B(X)) and (Y, B(Y)) are measurable spaces; we say a mapping from h : X → Y is measurable if
h−1 (B) = {x : h(x) ∈ B} ∈ B(X),
∀B ∈ B(Y)
To show that a function is measurable, it is sufficient to check the measurability of the inverses of sets that generate the σalgebra on the image space. Therefore, for Borel measurability, it suffices to check the measurability of the inverse images
of open sets.
1.2.3 Measure
A positive measure µ on (X, B(X)) is a map from B(X) to [0, ∞] which is countably additive such that for Ak ∈ B(X)
and Ak ∩ Aj = ∅:
∞
µ ∪∞
k=1 Ak
=
µ(Ak ).
k=1
1.2 Measurable Space
3
Definition 1.2.1 µ is a probability measure if it is positive and µ(X) = 1.
Definition 1.2.2 A measure µ is finite if µ(X) < ∞, and σ−finite if there exist a collection of subsets such that X =
∪∞
k=1 Ak with µ(Ak ) < ∞ for all k.
On the real line R, the Lebesgue measure is defined on the Borel σ−field (in fact on a somewhat larger field obtained
through adding all subsets of Borel sets of measure zero) such that for A = (a, b), µ(A) = b − a. Borel field of subsets is a
subset of Lebesgue measurable sets, that is there exist Lebesgue measurable sets which are not Borel sets. For a definition
and the construction of Lebesgue measurable sets, see [14].
1.2.4 The Extension Theorem
Theorem 1.2.1 (Carath´eodory’s Extension Theorem) Let M be a field, and suppose that there exists a measure P satisfying: (i) P (∅) = 0, (ii) there exists countably many sets An ∈ X such that X = ∪n An , with P (An ) < ∞, (iii) For
pairwise disjoint sets An , if ∪n An ∈ M, then P (∪n An ) = n P (An ). Then, there exists a unique measure P ′ on the
σ−field generated by M, σ(M), which is consistent with P on M.
The above is useful since, when one states that two measures are equal, it suffices to check if they are equal on the set of
sets which generate the σ−algebra, and not necessarily on the entire σ−field.
Theorem 1.2.2 (Kolmogorov’s Extension Theorem) Let X be a complete, separable metric space and let µn be a probability measure on Xn , the n product of X, for each n = 1, 2, . . ., such that
µn (A1 × A2 × · · · × An ) = µn+1 (A1 × A2 × · · · × An × X),
for every n and every sequence of Borel sets Ak . Then, there exists a unique probability measure µ on (X∞ , B(X∞ )) which
is consistent with each of the µn ’s.
Thus, if the σ−field on a product space is generated by the collection of finite dimensional cylinder sets, one can define a
measure in the product space which is consistent with the finite dimensional distributions.
Likewise, we can construct the Lebesgue measure on B(R) by defining it on finitely many unions and intersections of
intervals of the form (a, b), [a, b), (a, b] and [a, b], and the empty set, thus forming a field, and extending this to the Borel
σ−field. Thus, the relation µ(a, b) = b − a for b > a is sufficient to define the Lebesgue measure.
1.2.5 Integration
Let h be a non-negative measurable function from X, B(X) to R, B(R). The Lebesgue integral of h with respect to a
measure µ can be defined in three steps:
First, for A ∈ B(X), define 1{x∈A} (or 1(x∈A) , or 1A (x)) as an indicator function for event x ∈ A, that is the value that
the function takes is 1 if x ∈ A, and 0 otherwise. In this case, X 1x∈A µ(dx) =: µ(A).
Now, let us define simple functions h such that, there exist A1 , A2 , . . . , An all in B(X) and positive numbers b1 , b2 , . . . , bn
n
such that hn (x) = k=1 bk 1x∈Ak .
Now, for any given measurable h, there exists a sequence of simple functions hn such that hn (x) → h(x) monotonically,
that is hn+1 (x) ≥ hn (x) (for a construction, consider partitioning the positive real line to two intervals [0, n) and [n, ∞),
and partition [0, n) to n2n intervals, define hn (x) to be the lower floors of these partitions whenever x takes values in the
inverse image of an interval; construct the same for the negative real line). We define the limit as the Lebesgue integral:
n
lim
n→∞
bk µ(Ak )
hn (dx)µ(dx) = lim
n→∞
k=1
4
1 Review of Probability
There are three important convergence theorems which we will not discuss in detail, the statements of which will be given
in class.
1.2.6 Fatou’s Lemma, the Monotone Convergence Theorem and the Dominated Convergence Theorem
Theorem 1.2.3 (Monotone Convergence Theorem) If µ is a σ−finite positive measure on (X, B(X)) and {fn , n ∈ Z+ }
is a sequence of measurable functions from X to R which pointwise, monotonically, converge to f : that is, 0 ≤ fn (x) ≤
fn+1 (x) for all n, and
lim fn (x) = f (x),
n→∞
for µ−almost every x, then
fn (x)µ(dx)
f (x)µ(dx) = lim
n→∞
X
X
Theorem 1.2.4 (Fatou’s Lemma) If µ is a σ−finite positive measure on (X, B(X)) and {fn , n ∈ Z+ } is a sequence of
non-negative measurable functions from X to R, then
lim inf fn (x)µ(dx) ≤ lim inf
X n→∞
n→∞
fn (x)µ(dx)
X
Theorem 1.2.5 (Dominated Convergence Theorem) If (i) µ is a σ−finite positive measure on (X, B(X)), (ii) g(x) ≥ 0
is a Borel measurable function such that
X
g(x)µ(dx) < ∞,
and (iii) {fn , n ∈ Z+ } is a sequence of measurable functions from X to R which satisfy |fn (x)| ≤ g(x) for µ−almost
every x, and limn→∞ fn (x) = f (x), then:
fn (x)µ(dx)
f (x)µ(dx) = lim
X
n→∞
X
Note that for the monotone convergence theorem, there is no restriction on boundedness; whereas for the dominated
convergence theorem, there is a boundedness condition. On the other hand, for the dominated convergence theorem, the
pointwise convergence does not have to be monotone.
1.3 Probability Space and Random Variables
Let (Ω, F ) be a measurable space. If P is a probability measure, then the triple (Ω, F , P ) forms a probability space. Here
Ω is a set called the sample space. F is called the event space, and this is a σ−field of subsets of Ω.
Let (E, E) be another measurable space and X : (Ω, F , P ) → (E, E) be a measurable map. We call X an E−valued
random variable. The image under X is a probability measure on (E, E), called the law of X.
The σ-field generated by the events {{w : X(w) ∈ A}, A ∈ E} is called the σ−field generated by X and is denoted by
σ(X).
Consider a coin flip process, with possible outcomes {H, T }, heads or tails. We have a good intuitive understanding on
the environment when someone tells us that, a coin flip takes the value H with probability 1/2. Based on the definition of
a random variable, we view then a coin flip process as a deterministic function from some space (Ω, F , P ) to the binary
space of a head and a tail event. Here, P denotes the uncertainty measure (you may think of the initial condition of the
coin when it is being flipped, the flow of air, the surface where the coin touches etc.; we encoder all these aspects and all
the uncertainty in the universe with the abstract space (Ω, F , P )).
A useful fact about measurable functions (and thus random variables) is the following result.
1.3 Probability Space and Random Variables
5
Theorem 1.3.1 Let fn be a sequence of measurable functions from (Ω, F ) to a complete separable metric space (X, B(X).
Then, lim supn→∞ {fn (x)}, lim inf n→∞ {fn (x)} are measurable. Thus, if f (x) = limn→∞ fn (x) exists, then f is measurable.
This theorem implies that to verify whether a real valued mapping f is a Borel measurable function, it suffices to check if
f −1 (a, b) ∈ F for a < b since one can construct a sequence of simple functions which will converge to any measurable f ,
as discussed earlier. It suffices then to check if f −1 (−∞, a) ∈ F for a ∈ R.
1.3.1 More on Random Variables and Probability Density Functions
Consider a probability space (X, B(X), P ) and consider an R−valued random variable U measurable with respect to
(X, B(X)).
This random variable induces a probability measure µ on B(R) such that for some (a, b] ∈ B(R):
µ((a, b]) = P (U ∈ (a, b]) = P ({x : U (x) ∈ (a, b]})
When U is R−valued, the Expectation of U is defined as:
E[U ] =
µ(dx)x
R
We define F (x) = µ(−∞, x] as the cumulative distribution function of U . If the derivative of F (x) exists, we call this
derivative the probability density function of U , and denote it by p(x) (the distribution function is not always differentiable,
for example when the random variable takes values only on integers). If a density function exists, we can also write:
p(x)xdx
E[U ] =
R
If a probability density function p exists, the measure P is said to be absolutely continuous with respect to the Lebesgue
measure. In particular, the density function p is the Radon-Nikodym derivative of P with respect to the Lebesgue measure
in the sense that for all Borel A: A p(x)λ(dx) = P (A).
A probability density does not always exist. In particular, whenever there is a probability mass around a given point, then
a density does not exist; hence in R, if for some x, P ({x}) > 0, then we say there is a probability mass at x, and a density
function does not exist.
If X is countable, we can write P ({x = m}) = p(m), where p is called the probability mass function.
Some examples of commonly encountered random variables are as follows:
•
Gaussian (N (µ, σ 2 )):
•
Exponential:
•
Uniform on [a, b] (U ([a, b])):
•
Binomial (B(n, p)):
2
2
1
e−(x−µ) /2σ .
p(x) = √
2πσ
F (x) = 1 − e−λx ,
F (x) =
p(x) = λe−λx .
x−a
,
b−a
p(x) =
p(k) =
n k
p (1 − p)n−k
k
1
, x ∈ [a, b].
b−a
6
1 Review of Probability
If n = 1, we also call a binomial variable, a Bernoulli variable.
1.3.2 Independence and Conditional Probability
Consider A, B ∈ B(X) such that P (B) > 0. The quantity
P (A|B) =
P (A ∩ B)
P (B)
is called the conditional probability of event A given B. The measure P (·|B) defined on B(X) is itself a probability
measure. If
P (A|B) = P (A)
Then, A, B are said to be independent events.
A countable collection of events {An } is independent if for any finitely many sub-collections Ai1 , Ai2 , . . . , Aim , we have
that
P (Ai1 , Ai2 , . . . , Aim ) = P (Ai1 )P (Ai2 ) . . . P (Aim ).
Here, we use the notation P (A, B) = P (A ∩ B). A sequence of events is said to be pairwise independent if for any two
pairs (Am , An ): P (Am , An ) = P (Am )P (An ). Pairwise independence is weaker than independence.
1.4 Stochastic Processes and Markov Chains
One can define a sequence of random variables as a single random variable living in the product set, that is we can
consider {x1 , x2 , . . . , xN } as an individual random variable X which is an (E × E × . . . E)−valued random variable in
the measurable space (E × E × . . . E, B(E × E × . . . E)), where now the fields are to be defined on the product set. In this
case, the probability measure induced by a random sequence is defined on the σ−field B(E × E × . . . E). We can extend
the above to a case when N = Z+ , that is, when the process is infinite dimensional.
Let X be a complete, separable, metric space. Let B(X) denote the Borel sigma-field of subsets of X. Let Σ = XT denote
the sequence space of all one-sided or two-sided (countable or uncountably) infinite variables drawn from X. Thus, if
T = Z, x ∈ Σ then x = {. . . , x−1 , x0 , x1 , . . . } with xi ∈ X. Let Xn : Σ → X denote the coordinate function such
that Xn (x) = xn . Let B(Σ) denote the smallest sigma-field containing all cylinder sets of the form {x : xi ∈ Bi , m ≤
i ≤ n} where Bi ∈ B(X), for all integers m, n. We can define a probability measure by a characterization of these finite
dimensional cylinder sets, by (the extension) Theorem 1.2.2.
A similar characterization also applies for continuous-time stochastic processes, where T is uncountable. The extension
require more delicate arguments, since finite-dimensional characterizations are too weak to uniquely define a sigma-field on
a space of continuous-time paths which is consistent with such distributions. Such technicalities arise in the discussion for
continuous-time Markov chains and controlled processes, typically requiring a construction from a separable product space.
In this course, our focus will be on discrete-time processes; however, the analysis for continuous-time processes essentially
follows from similar constructions with further structures that one needs to impose on continuous-time processes.
1.4.1 Markov Chains
One can define a sequence of random variable as a single random variable living in the product set, that is we can consider
{x1 , x2 , . . . , xN } as an individual random variable X which is an (E × E × . . . E)−valued random variable in the measurable space (E × E × . . . E, B(E × E × . . . E)), where now the fields are to be defined on the product set. In this case,
the probability measure induced by a random sequence is defined on the σ−field B(E × E × . . . E). If the probability
measure for this sequence is such that
1.5 Exercises
7
P (xk+1 ∈ Ak+1 |xk , xk−1 , . . . , x0 ) = P (xk+1 ∈ Ak+1 |xk ) P.a.s.,
then {xk } is said to be a Markov chain. Thus, for a Markov chain, the immediate state is sufficient to predict the future;
past variables are not needed.
One way to construct a Markov chain is via the following: Let {xt , t ≥ 0} be a random sequence with state space
(X, B(X)), and defined on a probability space (Ω, F , P ), where B(X) denotes the Borel σ−field on X, Ω is the sample space, F a sigma field of subsets of Ω, and P a probability measure. For x ∈ X and D ∈ B(X), we let
P (x, D) := P (xt+1 ∈ D|xt = x) denote the transition probability from x to D, that is the probability of the event
{xt+1 ∈ D} given that xt = x. Thus, the Markov chain is completely determined by the transition probability and the
probability of the initial state, P (x0 ) = p0 . Hence, the probability of the event {xt+1 ∈ D} for any t can be computed
recursively by starting at t = 0, with P (x1 ∈ D) = x P (x1 ∈ D|x0 = x)P (x0 = x), and iterating with a similar
formula for t = 1, 2, . . ..
Hence, if the probability of the same event given some history of the past (and the present) does not depend on the past,
and hence is given by the same quantity, the chain is a Markov chain. As an example, consider the following linear system:
xt+1 = axt + wt ,
where wt is an independent random variable. This sequence is Markov.
We will continue our discussion on Markov chains after discussing controlled Markov chains.
1.5 Exercises
Exercise 1.5.1 a) Let Fβ be a σ−field of subsets of some space X for all β ∈ Γ where Γ is a family of indices. Let
F=
β∈Γ
Fβ
Show that F is also a σ−field.
For a space X, on which a distance metric is defined, the Borel σ−field is generated by the collection of open sets. This
means that, the Borel σ−field is the smallest σ−field containing open sets, and as such it is the intersection of all σ-fields
containing open sets. On R, the Borel σ−field is the smallest σ−field containing open intervals.
b) Is the set of rational numbers an element of the Borel σ-field on R? Is the set of irrational numbers an element?
c) Let X be a countable set. On this set, let us define a metric as follows:
d(x, y) =
0, if x = y
1, if x = y
Show that, the Borel σ−field on X is generated by the individual sets {x}, x ∈ X.
d) Consider the distance function d(x, y) defined as above defined on R. Is the σ−field generated by open sets according
to this metric the same as the Borel σ−field on R?
Exercise 1.5.2 Investigate the following limits in view of the convergence theorems.
a) Check if limn→∞
1
0
xn dx =
b) Check if limn→∞
1
0
nxn dx =
1
0
limn→∞ xn dx.
1
0
limn→∞ nxn dx.
c) Define fn (x) = n1{0≤x≤ n1 } . Find limn→∞
fn (x)dx and
limn→∞ fn (x)dx. Are these equal?
8
1 Review of Probability
Exercise 1.5.3 a) Let X and Y be real-valued random variables defined on a given probability space. Show that X 2 and
X + Y are also random variables.
b) Let F be a σ−field of subsets over a set X and let A ∈ F . Prove that {A ∩ B, B ∈ F } is a σ−field over A (that is a
σ−field of subsets of A).
Exercise 1.5.4 Consider a random variable X which takes values in [0, 1] and is uniformly distributed. We know that for
every set S = [a, b) 0 ≤ a ≤ b ≤ 1, the probability of a random variable X taking values in S is P (X ∈ [a, b)) =
U ([a, b]) = b − a. Consider now the following question: Does every subset S ⊂ [0, 1] admit a probability in the form
above? In the following we will provide a counterexample, known as the Vitali set.
We can show that a set picked as follows does not admit a measure: Let us define an equivalence class such that x ≡ y if
x − y ∈ Q. This equivalence definition divides [0, 1] into disjoint sets. There are countably many points in each equivalent
class. Let A be a subset which picks exactly one element from each equivalent class (here, we adopt what is known as the
Axiom of Choice [13]). Since A contains an element from each equivalence class, each point of [0, 1] is contained in the
union ∪q∈Q (A ⊕ q). For otherwise there would be a point x which were not in any equivalence class. But x is equivalent
with itself at least. Furthermore, since A contains only one point from each equivalence class, the sets A ⊕ q are disjoint;
for otherwise there would be two sets which could include a common point: A ⊕ q and A ⊕ q ′ would include a common
point, leading to the result that the difference x − q = z and x − q ′ = z are both in A, a contradiction, since there should
be at most one point which is in the same equivalence class as x − q = z. One can show that the uniform distribution
is shift-invariant, therefore P (A) = P (A ⊕ q). But [0, 1] = ∪q A ⊕ q. Since a countable sum of identical non-negative
elements can either become ∞, or 0, the contradiction follows: We can’t associate a number with this set and as a result,
this set is not a Lebesgue measurable set.
2
Controlled Markov Chains
In the following, we discuss controlled Markov models.
2.1 Controlled Markov Models
Consider the following model.
xt+1 = f (xt , ut , wt ),
(2.1)
where xt is a X-valued state variable, ut is U-valued control action variable, wt a W-valued an i.i.d noise process, and f is a
measurable function. We assume that X, U, W are subsets of complete, separable, metric (Polish) spaces. The model above
in (2.1) contains (see [18]) the class of all stochastic processes which satisfy the following for all Borel sets B ∈ B(X),
t ≥ 0, and all realizations x[0,t] , u[0,t] :
P (xt+1 ∈ B|x[0,t] = a[0,t] , u[0,t] = b[0,t] ) = T (xt+1 ∈ B|at , bt )
(2.2)
where T (·|x, u) is a stochastic kernel from X × U to X. A stochastic process which satisfies (2.2) is called a controlled
Markov chain.
2.1.1 Fully Observed Markov Control Problem Model
A Fully Observed Markov Control Problem is a five tuple
(X, U, {U (x), x ∈ X}, T, c),
where
•
X is the state space, a subset of a Polish space.
•
U is the action space, a subset of a Polish space.
•
K = {(x, u) : u ∈ U (x) ∈ B(U), x ∈ X} is the set of state, control pairs that are feasible. There might be different
states where different control actions are possible.
•
T is a state transition kernel, that is T (A|xt , ut ) = P (xt+1 ∈ A|xt , ut ).
•
c : K → R is the cost.
T −1
Consider for now that the objective to be minimized is given by: J(x0 , Π) := EνΠ0 [ t=0 c(xt , ut )], where ν0 is the initial
probability measure, that is x0 ∼ ν0 . The goal is to find a policy Π ∗ (to be defined next) so that
10
2 Controlled Markov Chains
J(x0 , Π ∗ ) ≤ J(x0 , Π), ∀Π,
such that Π is an admissible policy to be defined below. Such a Π ∗ is called an optimal policy. Here Π can also be called
a strategy, or a law.
2.1.2 Classes of Control Policies
Admissible Control Policies
Let H0 := X, Ht = Ht−1 × K for t = 1, 2, . . .. We let ht denote an element of Ht , where ht = {x[0,t] , u[0,t−1] }. A
deterministic admissible control policy Π is a sequence of functions {γt } from Ht → U; in this case ut = γt (ht ). A
randomized control policy is a sequence Π = {Πt , t ≥ 0} such that Π : Ht → P(U) (with P(U) being the space of
probability measures on U) such that
Πt (ut ∈ U (xt )|ht ) = 1, ∀ht ∈ Ht .
Markov Control Policies
A policy is randomized Markov if
PxΠ0 (ut ∈ C|ht ) = Πt (ut ∈ C|xt ),
C ∈ B(U),
for all t. Hence, the control action only depends on the state and the time, and not the past history. If the control strategy is
deterministic, that is if
Πt (ut = ft (xt )|xt ) = 1.
for some function ft , the control policy is said to be deterministic Markov.
Stationary Control Policies
A policy is randomized stationary if
PxΠ0 (ut ∈ C|ht ) = Π(ut ∈ C|xt ),
C ∈ B(U).
Hence, the control action only depends on the state, and not the past history or on time. If the control strategy is deterministic, that is if Π(ut = f (xt )|xt ) = 1 for some function f , the control policy is said to be deterministic stationary.
2.2 Markov Chain Induced by a Markov Policy
The following is an important result:
Theorem 2.2.1 Let the control policy be randomized Markov. Then, the controlled Markov chain becomes a Markov chain
in X, that is, the state process itself becomes a Markov chain:
PxΠ0 (xt+1 ∈ B|xt , xt−1 , . . . , x0 ) = QΠ
t (xt+1 ∈ B|xt ),
B ∈ B(X), t ≥ 1, P.a.s.
If the control policy is a stationary policy, the induced Markov chain is time-homogenous, that is the transition kernel for
the Markov chain does not depend on time.
2.3 Performance of Policies
11
Proof: Let us consider the case where U is countable. Let B ∈ B(X). It follows that,
PxΠ0 (xt+1 ∈ B|xt , xt−1 , . . . , x0 )
=
ut
=
ut
=
ut
=
ut
PxΠ0 (xt+1 ∈ B, ut |xt , xt−1 , . . . , x0 )
PxΠ0 (xt+1 ∈ B|ut , xt , xt−1 , . . . , x0 )PxΠ0 (ut |xt , xt−1 , . . . , x0 )
QΠ
x0 (xt+1 ∈ B|ut , xt )πt (ut |xt )
QΠ
x0 (xt+1 ∈ B, ut |xt )
= QΠ
t (xt+1 ∈ B|xt )
(2.3)
The essential issue here is that, the control only depends on xt , and since xt+1 depends stochastically only on xt and ut , the
desired result follows. In the above, we used properties from total probability and conditioning. If πt (ut |xt ) = π(ut |xt ),
that is, πt = π for all t values so that the policy is stationary, the resulting chain satisfies
PxΠ0 (xt+1 ∈ B|xt , xt−1 , . . . , x0 ) = QΠ (xt+1 ∈ B|xt ),
for some QΠ . Thus, the transition kernel does not depend on time and the chain is time-homogenous.
⋄
2.3 Performance of Policies
Consider a Markov control problem with an objective given as the minimization of
T −1
J(ν0 , Π) = EνΠ0 [
c(xt , ut )]
t=0
where ν0 denotes the distribution on x0 . If x0 = x, we then write
T −1
T −1
c(xt , ut )] = E Π [
J(x0 , Π) = ExΠ0 [
t=0
c(xt , ut )|x0 ]
t=0
Such a cost problem is known as a Finite Horizon Optimal Control problem.
We will also consider costs of the following form:
T −1
β t c(xt , ut )],
J(ν0 , Π) = EνΠ0 [
t=0
for some β ∈ (0, 1). This is called a Discounted Optimal Control Problem.
Finally, we will study costs of the following form:
T −1
1
J(ν0 , Π) = lim sup EνΠ0 [
c(xt , ut )]
T
T →∞
t=0
Such a problem is known as the Average Cost Optimal Control Problem.
Let ΠA denote the class of admissible policies, ΠM denote the class of Markov policies, ΠS denote the class of Stationary
policies. These policies can be both randomized or deterministic.
12
2 Controlled Markov Chains
In a general setting, we have the following:
inf J(ν0 , Π) ≤ inf J(ν0 , Π) ≤ inf J(ν0 , Π),
Π∈ΠA
Π∈ΠM
Π∈ΠS
since the set of policies is progressively shrinking
Π S ⊂ Π M ⊂ ΠA .
We will show, however, that for the optimal control of a Markov chain, under mild conditions, Markov policies are always
optimal (that is there is no loss in optimality in restricting the policies to be Markov); that is, it is sufficient to consider
Markov policies. This is an important result in stochastic control. That is,
inf J(ν0 , Π) = inf J(ν0 , Π)
Π∈ΠA
Π∈ΠM
We will also show that, under somewhat more restrictive conditions, stationary policies are optimal (that is, there is no loss
in optimality in restricting the policies to be Stationary). This will typically require T → ∞.
inf J(ν0 , Π) = inf J(ν0 , Π)
Π∈ΠA
Π∈ΠS
Furthermore, we will show that, under some conditions,
inf J(ν0 , Π)
Π∈ΠS
is independent of the initial distribution (or initial condition) on x0 .
The last two results are computationally very important, as there are powerful computational algorithms that allow one to
develop such stationary policies.
In the following set of notes, we will first consider further properties of Markov chains, since under a Markov control policy,
the controlled state becomes a Markov chain. We will then get back to the controlled Markov chains and the development
of optimal control policies.
The classification of Markov Chains in the next chapter will implicitly characterize the set of problems for which stationary
policies contain optimal admissible policies.
2.4 Partially Observed Models and Reduction to a Fully Observed Model
Consider the following model.
xt+1 = f (xt , ut , wt ),
yt = g(xt , vt )
Here, as before, xt is the state, ut ∈ U is the control, (wt , vt ) ∈ W × V are second order, zero-mean, i.i.d noise processes
and wt is independent of vt . In addition to the previous fully observed model, yt denotes an observation variable taking
values in Y, a subset of Rn in the context of this review. The controller only has causal access to the second component
{yt } of the process. An admissible policy {Π} is measurable with respect to σ({ys , s ≤ t}). We denote the observed
history space as: H0 := Y, Ht = Ht−1 × Y × U. Hence, the set of (wide-sense) causal control policies are such that
P (u(ht ) ∈ U|ht ) = 1 ∀ht ∈ Ht .
One could transform a partially observable Markov Decision Problem to a Fully Observed Markov Decision Problem via
an enlargement of the state space. In particular, we obtain via the properties of total probability the following dynamical
recursion (here, we assume that the state space is countable; the extension to more general spaces will be considered in
Chapter 6):
πt (A) : = P (xt ∈ A|y[0,t] , u[0,t−1] )
2.5 Exercises
X
=
X
πt−1 (xt−1 )P (ut−1 |y[0,t−1] , u[0,t−2] )P (yt |xt )P (xt |xt−1 , ut−1 )
,
X πt−1 (xt−1 )P (yt |xt )P (ut−1 |y[0,t−1] , u[0,t−2] )P (dxt |xt−1 , ut−1 )
πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )
,
X
X πt−1 (xt−1 )P (yt |xt )P (dxt |xt−1 , ut−1 )
=: F (πt−1 , ut−1 , yt )(A),
=
13
X
(2.4)
for some F . It follows that F : P(X) × U × Y → P(X) is a Borel measurable function, as we will discuss in further detail
in Chapter 6. Thus, the conditional measure process becomes a controlled Markov chain in P(X) (where P(X) denotes the
set of probability measures on X):
Theorem 2.4.1 The process {πt , ut } is a controlled Markov chain. That is, under any admissible control policy, given the
action at time t ≥ 0 and πt , πt+1 is conditionally independent from {πs , us , s ≤ t − 1}.
This follows from the observation that for any B ∈ B(P(X)),
P (πt+1 ∈ B|πs , us , s ≤ t) =
Let the cost function to be minimized be
y
1{F (πt ,ut ,yt )∈B} P (yt |xt )πt (xt )
T −1
ExΠ0 [c(xt , ut )],
t=0
where
ExΠ0 []
denotes the expectation over all sample paths with initial state given by x0 under policy Π.
We transform the system into a fully observed Markov model as follows. Define the new cost as c˜(π, u) =
P(X). The stochastic transition kernel q is given by:
P (x, y|x′ , u)π(x′ ),
q(x, y|π, u) =
X
X
c(x, u)π(x),
π ∈ P(X)
And, this kernel can be decomposed as q(x, y|π, u) = P (y|π, u)P (x|π, u, y).
The second term here is the filtering equation, mapping (π, u, y) ∈ (P(X)×U×Y) to P(X). It follows that (P(X), U, K, c˜)
defines a completely observable controlled Markov process. Here, we have
K(B|π, u) =
1{P (.|π,u,y)∈B}P (y|π, u),
Y
∀B ∈ B(P(X)),
with 1{.} denoting the indicator function.
2.5 Exercises
Exercise 2.5.1 Suppose that there are two decision makers DM1 and DM2 . Suppose that the information available to to
DM1 is a random variable y 1 and the information available to DM2 is y 2 , where these random variables are defined on a
probability space (Ω, F , P ). Suppose that y i is Yi -valued and these spaces admit a Borel -field B(Yi ).
Suppose that the sigma-field generated by y 1 is a subset of the sigma-field generated by y 2 , that is σ(y 1 ) ⊂ σ(y 2 ).
Further, suppose that the decision makers wish to minimize the following cost function:
E[c(ω, u)],
π∈
14
2 Controlled Markov Chains
where c : Ω × U → R+ is a measurable cost function (on F × B(U)), where B(U) is a σ-field over U. Here, for i = 1, 2,
ui = γ i (y i ) is generated by a measurable function γ i on the sigma-field generated by the random variable y i . Let Γ i
denote the space of all such policies.
Prove that
inf E[c(ω, u1 )] ≥ 2inf 2 E[c(ω, u2 )].
γ 1 ∈Γ 1
γ ∈Γ
Hint: Make the argument that every policy u1 = γ 1 (y 1 ) can be expressed as u2 = γ 2 (y 2 ) for some γ 2 ∈ Γ 2 . In particular,
for any B ∈ B(U), γ −1 (B) ∈ B(Yi ). Furthermore, (y 1 )−1 (γ −1 (B)) ∈ σ(y 1 ) ⊂ σ(y 2 ). Following such a reasoning, the
result can be obtained.
Exercise 2.5.2 An investor’s wealth dynamics is given by the following:
xt+1 = ut wt ,
where {wt } is an i.i.d. R+ -valued stochastic process with E[wt ] = 1. The investor has access to the past and current
wealth information and his previous actions. The goal is to maximize:
T −1
J(x0 , Π) = ExΠ0 [
t=0
√
xt − ut ].
The investor’s action set for any given x is: U(x) = [0, x].
Formulate the problem as an optimal stochastic control problem by clearly identifying the state, the control action spaces,
the information available at the controller, the transition kernel and a cost functional mapping the actions and states to R.
Exercise 2.5.3 Consider an investment problem in a stock market. Suppose there are N possible goods and suppose these
goods have prices {pit , i = 1, . . . , N } at time t taking values in a countable set P. Further, suppose that the stocks evolve
stochastically according to a Markov transition such that for every 1 ≤ i ≤ N
P (pit+1 = m|pit = n) = P i (n, m).
Suppose that there is a transaction cost of c for every buying and selling for each good. Suppose that the investor can only
hold on to at most M units of stock. At time t, the investor has access to his current and past price information and his
investment allocations up to time t.
When a transaction is made the following might occur. If the investor sells one unit of i, the investor spends c dollars for
the transaction, and adds pit to his account. If he only buys one unit of j, then he takes out pjt dollars per unit from his
account and also takes out c dollars. For every such a transaction per unit, there is a cost of c dollars. If he does not do
anything, then there is no cost. Assume that the investor can make at most one transaction per time stage.
Suppose the investor initially has an unlimited amount of money and a given stock allocation.
The goal is to maximize the total worth of the stock at the final time stage t = T ∈ Z+ minus the total transaction costs
plus the earned money in between.
Formulate the problem as a stochastic control problem by clearly identifying the state, the control actions, the information
available at the controller, the transition kernel and a cost functional mapping the actions and states to R.
Exercise 2.5.4 Consider a special case of the above problem, where the objective is to maximize the final worth of the
investment and minimize the transaction cost. Furthermore, suppose there is only one uniform stock price with a transition
kernel given by
P (pt+1 = m|pt = n) = P (n, m)
The investor has again a possibility of at most M units of goods. But, the investor might wish to purchase more, or sell
them.
2.5 Exercises
15
Formulate the problem as a stochastic control problem by clearly identifying the state, the control actions, the transition
kernel and a cost functional mapping the actions and states to R.
Exercise 2.5.5 A fishery manager annually has xt units of fish and sells ut xt of these where ut ∈ [0, 1]. With the remaining
ones, the next year’s production is given by the following model
xt+1 = wt xt (1 − ut ) + wt ,
with x0 is given and wt is an independent, identically distributed sequence of random variables and wt ≥ 0 for all t and
therefore E[wt ] = w
˜ ≥ 0.
The goal is to maximize the profit over the time horizon 0 ≤ t ≤ T − 1. At time T , he sells all of the fish.
Formulate the problem as an optimal stochastic control problem by clearly identifying the state, the control actions, the
information available at the controller, the transition kernel and a cost functional mapping the actions and states to R.
Exercise 2.5.6 Consider an unemployed person who will have to work for years t = 1, 2, ..., 10 if she takes a job at any
given t.
Suppose that each year in which she remains unemployed; she may be offered a good job that pays 10 dollars per year
(with probability 1/4); she may be offered a bad job that pays 4 dollars per year (with probability 1/4); or she may not be
offered a job (with probability 1/2). These events of job offers are independent from year to year (that is the job market is
represented by an independent sequence of random variables for every year).
Once she accepts a job, she will remain in that job for the rest of the ten years. That is, for example, she cannot switch from
the bad job to the good job.
Suppose the goal is maximize the expected total earnings in ten years, starting from year 1 up to year 10 (including year
10).
State the problem as a Markov Decision Problem, identify the state space, the action space and the transition kernel.
3
Classification of Markov Chains
3.1 Countable State Space Markov Chains
In this chapter, we first review Markov chains where the state takes values in a finite set or a countable space.
We assume that ν0 is an initial distribution for the Markov chain for x0 . Thus, the process Φ = {x0 , x1 , . . . , xn , . . . } is a
(time-homogeneous) Markov chain with the probability measure on the sequence space satisfying:
Pν0 (x0 = a0 , x1 = a1 , x2 = a2 , . . . , xn = an )
= ν0 (x0 = a0 )P (x1 = a1 |x0 = a0 )P (x2 = a2 |x1 = a1 ) . . . P (xn = an |xn−1 = an−1 )
(3.1)
If the initial condition is known to be x0 , we use Px0 (· · · ) in place of Pδx0 (· · · ). The initial condition probability and the
transition kernel uniquely identifies the probability measure on the product space XN , by the extension theorem. We could
represent the probabilistic evolution in terms of a matrix.
P (i, j) = P (xt+1 = j|xt = i) ≥ 0, ∀i, j ∈ X
Here P (·, ·) is also a probability transition kernel, that is for every i ∈ X, P (i, .) is a probability measure on X. Thus,
j P (i, j) = 1. By induction, we could verify that
P k (i, j) := P (xt+k = j|xt = i) =
P (i, m)P k−1 (m, j)
m∈X
In the following, we characterize Markov Chains based on transience, recurrence and communication. We then consider the
issue of the existence of an invariant distribution. Later, we will extend the analysis to uncountable space Markov chains.
Let us again consider a discrete-time, time-homogeneous Markov process living in a countable state space X.
Communication
If there exists an integer k such that P (xt+k = j|xt = i) = P k (i, j) > 0, and another integer l such that P (xt+l = i|xt =
j) = P l (j, i) > 0 then state i communicates with state j.
A set C ⊂ X is said to be communicating if every two elements (states) of C communicate with each other.
If every member of the set can communicate to every other member, such a chain is said to be irreducible.
The period of a state is defined to be the greatest common divisor of {k > 0 : P k (i, i) > 0}.
A Markov chain is called aperiodic if the period of all states is 1.
18
3 Classification of Markov Chains
In the following, throughout the notes, we will assume that the Markov process under consideration is aperiodic unless
mentioned otherwise.
Absorbing Set
A set C is called absorbing if P (i, C) = 1 for all i ∈ C. That is, if the state is in C, then the state cannot get out of the set
C.
A finite set Markov chain in space X is irreducible if the smallest absorbing set is the entire X itself.
A Markov chain in X is indecomposable if X does not contain two disjoint absorbing sets.
Occupation, Hitting and Stopping Times
For any set A ∈ X, the occupation time ηA is the number of visits of ψ to set A:
∞
ηA =
1{xt ∈A} ,
t=1
where 1E denotes the indicator function for an event E, that is, it takes the value 1 when E takes place, and is otherwise 0.
Define
τA = min{k > 0 : xk ∈ A},
is the first time that the state visits state i, known as the first return time.
We define also a hitting time:
σA = min{k ≥ 0 : xk ∈ A}.
A Brief Discussion on Stopping Times
The variable τA defined above is an example for stopping times:
Definition 3.1.1 A function τ from a measurable space (X ∞ , B(X ∞ )) to (N+ , B(N+ )) is a stopping time if for all n ∈
N+ , the event {τ = n} ∈ σ(x0 , x1 , x2 , . . . , xn ), that is the event is in the sigma-field generated by the random variables
up to time n.
Any realistic decision takes place at a time which is measurable. For example if an investor wants to stop investing when
he stops profiting, he can stop at the time when the investment loses value, this is a stopping time. But, if an investor claims
to stop investing when the investment is at the peak, this is not a stopping time because to find out whether the investment
is at its peak, the next state value should be known, and this is not measurable in a causal fashion.
One important property of Markov chains is the so-called Strong Markov Property. This says the following: Consider a
Markov chain which evolves on a countable set. If we sample this chain according to a stopping time rule, the sampled
Markov chain starts from the sampled instant as a Markov chain:
Proposition 3.1.1 For a (time-homogenous) Markov chain in a countable state space X, the strong Markov property holds,
that is the sampled chain itself is a Markov chain: If τ is a stopping time, then conditioned on the event that τ < ∞ then
for any m ∈ N+ :
P (xτ +m = a|xτ = b0 , xτ −1 = b1 , . . . ) = P (xτ +m = a|xτ = b0 ) = P m (b0 , a).
3.1 Countable State Space Markov Chains
19
3.1.1 Recurrence and Transience
Let us define
∞
∞
P t (x, A)
1(xt ∈A) ] =
U (x, A) := Ex [
t=1
t=1
and let us define
L(x, A) := Px (τA < ∞),
which is the probability of the chain visiting set A, once the process starts at state x.
These two are important characterizations for Markov chains.
Definition 3.1.2 A set A ⊂ X is recurrent if the Markov chain visits A infinitely often (in expectation), when the process
starts in A. This is equivalent to
Ex [ηA ] = ∞,
∀x ∈ A
(3.2)
That is, if the chain starts at a given state x ∈ A, it comes back to the set A, and does so infinitely often. If a state is not
recurrent, it is transient.
Definition 3.1.3 A set A ⊂ X is positive recurrent if the Markov chain visits A infinitely often, when the process starts in
A and in addition:
Ex [τA ] < ∞, ∀x ∈ A
Definition 3.1.4 A state α is transient if
U (α, α) = Eα [ηα ] < ∞,
The above is equivalent to the condition that
(3.3)
∞
i=1
P i (α, α) < ∞,
which in turn is implied by
Pi (τi < ∞) < 1.
While discussing recurrence, there is an equivalent, and often easier to check condition: If Ei [τi ] < ∞, then the state {i}
is positive recurrent. If L(i, i) = Pi (τi < ∞) = 1, but Ei [τi ] = ∞, then i is recurrent (also called, null recurrent).
The reader should connect the above with the strong Markov property: once the process hits a state, it starts from the state
as if the past never happened; the process recurs itself.
We have the following. The proof is presented later in the chapter, see Theorem 3.7.
Theorem 3.1. If Pi (τi < ∞) = 1, then Pi (ηi = ∞) = 1.
There is a more general notion of recurrence, named Harris recurrence, which is exactly the condition that Pi (ηi = ∞) = 1.
We will investigate this further below while studying uncountable state space Markov chains, however one needs to note
that even for countable state space chains Harris recurrence is stronger than recurrence.
For a finite state Markov chain it is not difficult to verify that (3.3) is equivalent to L(i, i) < 1, since P (τi (k) < ∞) =
P (τi (k − 1) < ∞)P (τi (1) < ∞) and that E[η] = k P (η ≥ k).
If every state in X is recurrent, then the chain is recurrent, if all are transient, then the chain is transient.
20
3 Classification of Markov Chains
3.1.2 Stability and Invariant Distributions
Stability is an important concept, but it has different meanings in different contexts. This notion will be made more precise
in the following two chapters.
If a Markov chain starts at a given time, in the long-run, the chain may forget its initial condition, that is, the probability
distribution at a later time will be less and less dependent on the initial distribution. Given an initial state distribution, the
probability distribution on the state at time 1 is given by:
π1 = π0 P
And for t > 1:
πt+1 = πt P = π0 P t+1
One important property of Markov chains is whether the above iteration leads to a fixed point in the set of probability
measures. Such a fixed point π is called an invariant distribution. A distribution in a countable state Markov chain is
invariant if
π = πP
This is equivalent to
π(j) =
∀j ∈ X
π(i)P (i, j),
i∈X
We note that, if such a π exists, it must be written in terms of π = π0 limt→∞ P t , for some π0 . Clearly, π0 can be π
itself, but often π0 can be any initial distribution under irreducibility conditions which will be discussed further. Invariant
distributions are especially important in networking problems and stochastic control, due to the Ergodic Theorem (which
shows that temporal averages converge to statistical averages), which we will discuss later in the semester.
3.1.3 Invariant Measures via an Occupational Characterization
Theorem 3.2. For a Markov chain, if there exists an element i such that Ei [τi ] < ∞; the following is an invariant measure:
µ(j) = E
τi −1
k=0
1xk =j
|x0 = i ,
Ei [τi ]
∀j ∈ X
The invariant measure is unique if the chain is irreducible.
Proof:
We show that
E
τi −1
k=0
1xk =j
|x0 = i =
E[τi ]
P (s, j)E
s
τi −1
k=0
1xk =s
|x0 = i
E[τi ]
Note that E[1xt+1 =j ] = P (xt+1 = j). Hence,
τi −1
k=0
1xk =s
|x0 = i = E
Ei [τi ]
P (s, j)E
s
E
τi −1
k=0
s
=
s P (s, j)1xk =s
|x0 = i
Ei [τi ]
1xk =s E[1xk+1 =j |xk = s, x0 = i]|x0 = i
Ei [τi ]
E
=
τi −1
k=0
τi −1
k=0
E[
s
1xk =s 1xk+1 =j |xk = s, x0 = i]|x0 = i
Ei [τi ]
3.1 Countable State Space Markov Chains
E
τi −1
k=0
=
s
1xk =s 1xk+1 =j |x0 = i
Ei [τi ]
= Ei
= Ei
τi −1
k=0
1xk+1 =j
Ei [τi ]
21
(3.4)
τi
k=1
E[1xk =j ]
Ei [τi ]
= Ei
τi −1
k=0
E[1xk =j ]
= µ(j),
Ei [τi ]
where we use the fact that the number of visits to a given set does not change whether we include or exclude time t = 0
or τi , since any state j = i is not visited at these times. Here, (3.4) follows from the law of the iterated expectations, see
Theorem 4.1.3. This concludes the proof.
⋄
Remark: If E[τi ] < ∞, then the above measure becomes a probability measure, as it follows that
E
µ(j) =
j
τi −1
k=0
1xk =j
|x0 = i = 1.
E[τi ]
⋄
For example, for a random walk E[τi ] = ∞, hence there does not exist a unique invariant probability measure. But, it has
an invariant measure: µ(i) = K, for a constant K. The reader can verify this.
Theorem 3.3. For an irreducible countable state space Markov chain, positive recurrence implies the existence of an
invariant probability measure with Π(i) = Ei1[τi ] . Furthermore the invariant probability measure is unique.
Theorem 3.4. Every finite state space Markov chain admits an invariant probability measure.
Invariant Distribution via Dobrushin’s Ergodic Coefficient
Consider the iteration πt+1 = πt P . We would like to know when this iteration converges to a limit. We first review some
notions from analysis.
Review of Vector Spaces
Definition 3.1.5 A linear space is a space which is closed under addition and multiplication by a scalar.
Definition 3.1.6 A normed linear space X is a linear vector space on which a functional (a mapping from X to R, that is a
member of RX ) called norm is defined such that:
•
||x|| ≥ 0 ∀x ∈ X,
||x|| = 0 if and only if x is the null element (under addition and multiplication) of X.
•
||x + y|| ≤ ||x|| + ||y||
•
||αx|| = |α|||x||,
∀α ∈ R,
∀x ∈ X
Definition 3.1.7 In a normed linear space X, an infinite sequence of elements {xn } converges to an element x if the
sequence {||xn − x||} converges to zero.
Definition 3.1.8 A sequence {xn } in a normed space X is Cauchy if for every ǫ, there exists an N such that ||xn −xm || ≤ ǫ,
for all n, m ≥ N .
The important observation on Cauchy sequences is that, every converging sequence is Cauchy, however, not all Cauchy
sequences are convergent: This is because the limit might not live in the original space where the sequence lives in. This
brings the issue of completeness:
22
3 Classification of Markov Chains
Definition 3.1.9 A normed linear space X is complete, if every Cauchy sequence in X has a limit in X. A complete normed
linear space is called Banach.
A map T from one complete normed linear space X to itself is called a contraction if for some 0 ≤ ρ < 1
||T (x) − T (y)|| ≤ ρ||x − y||, ∀x, y ∈ X.
Theorem 3.5. A contraction map in a Banach space has a unique fixed point.
Proof: {T n (x)} forms a Cauchy sequence, and by completeness, the Cauchy sequence has a limit.
⋄
Contraction Mapping via Dobrushin’s Ergodic Coefficient
Consider a countable state Markov Chain. Define
min(P (i, j), P (k, j))
δ(P ) = min
i,k
j
Observe that for two scalars a, b
|a − b| = a + b − 2 min(a, b).
Let us define for a vector v the l1 norm:
N
||v||1 =
i=1
|vi |.
It is known that the set of all countable index real-valued vectors (that is functions which map Z → R) with a finite l1 norm
{v : ||v||1 < ∞}
is a complete normed linear space, and as such, is a Banach space.
With these observations, we state the following:
Theorem 3.6 (Dobrushin). For any probability measures π, π ′ , it follows that
||πP − π ′ P ||1 ≤ (1 − δ(P ))||π − π ′ ||1
Proof: Let ψ(i) = π(i) − min(π(i), π ′ (i)) for all i. Further, let ψ ′ (i) = π ′ (i) − min(π(i), π ′ (i)). Then,
||π − π ′ ||1 = ||ψ − ψ ′ ||1 = 2||ψ||1 = 2||ψ ′ ||1 ,
since the sum
i
π(i) − π ′ (i) =
i
ψ(i) − ψ ′ (i)
is zero, and
i
|π(i) − π ′ (i)| = ||ψ||1 + ||ψ ′ ||1 .
Now,
||πP − π ′ P || = ||ψP − ψ ′ P ||
=
j
|
1
=
||ψ ′ ||1
i
ψ(i)P (i, j) −
j
|
k
i
ψ ′ (k)P (k, j)|
k
ψ(i)ψ ′ (k)P (i, j) − ψ(i)ψ ′ (k)P (k, j)|
(3.5)
3.2 Uncountable (Complete, Separable, Metric) State Spaces
≤
1
||ψ ′ ||1
=
1
||ψ ′ ||1
=
≤
j
k
i
ψ(i)ψ ′ (k)|P (i, j) − P (k, j)|
ψ(i)ψ ′ (k)
k
1
||ψ ′ ||1
k
1
||ψ ′ ||1
k
j
i
i
i
|ψ(i)||ψ ′ (k)|{
23
(3.6)
|P (i, j) − P (k, j)|
j
P (i, j) + P (k, j) − 2 min(P (i, j), P (k, j))}
|ψ(i)||ψ ′ (k)|(2 − 2δ(P ))
= ||ψ ′ ||1 (2 − 2δ(P ))
(3.7)
(3.8)
(3.9)
′
= ||π − π ||1 (1 − δ(P ))}
(3.10)
In the above, (3.5) follows from adding terms in the summation, (3.6) from taking the norm inside, (3.7) follows from the
relation ||a − b|| = a + b − 2 min(a, b), (3.8) from the definition of δ(P ) and finally (3.9) follows from the l1 norms of
ψ, ψ ′ .
As such, the process P is a contraction mapping if δ(P ) > 0. In essence, one proves that such a sequence is Cauchy, and
as every Cauchy sequence in a Banach space has a limit, this process also has a limit. In our setting {π0 P m } is a Cauchy
sequence. The limit is the invariant distribution.
⋄
It should also be noted that Dobrushin’s theorem tells us how fast the sequence of probability distributions {π0 P n } converges to the invariant distribution for any arbitrary π0 .
Ergodic Theorem for Countable State Space Chains
For a Markov chain which has a unique invariant distribution µ(i), we have that almost surely
1
T →∞ T
T
f (i)µ(i)
f (xt ) =
lim
t=1
i
∀ f : X → R.
This is called the ergodic theorem due to Birkhoff and is a very powerful result. This is a very important theorem, because,
in essence, this property is what makes the connection with stochastic control, and Markov chains in a long time horizon.
In particular, for a stationary control policy leading to a unique invariant distribution with bounded costs, it follows that,
almost surely,
T
1
lim
c(xt , ut ) =
c(x, u)µ(x, u),
T →∞ T
x,u
t=1
∀ bounded c.
3.2 Uncountable (Complete, Separable, Metric) State Spaces
We now briefly extend the above definitions and discussions to an uncountable state space setting.
Let {xt , t ∈ Z+ } be a Markov chain with a complete, separable, metric state space (X, B(X), and defined on a probability
space (Ω, F , P), where B(X) denotes the Borel σ−field on X, Ω is the sample space, F a sigma field of subsets of Ω, and
P a probability measure. Let P (x, D) := P (xt+1 ∈ D|xt = x) denote the transition probability from x to D, that is the
probability of the event {xt+1 ∈ D} given that xt = x.
Consider a Markov chain with transition probability given by P (x, D), that is
24
3 Classification of Markov Chains
P (xt+1 ∈ D|xt = x) = P (x, D)
We could compute P (xt+k ∈ D|xt = x) inductively as follows:
P (xt+k ∈ D|xt = x) =
...
P (xt , dxt+1 ) . . . P (xt+k−2 , dxt+k−1 )P (xt+k−1 , D)
As such, we have for all n ≥ 1, P n (x, A) = P (xt+n ∈ A|xt = x) =
X
P n−1 (x, dy)P (y, A).
Definition 3.2.1 A Markov chain is µ-irreducible, if for any set B ∈ B(X), such that µ(B) > 0, and ∀x ∈ X, there exists
some integer n > 0, possibly depending on B and x, such that P n (x, B) > 0, where P n (x, B) is the transition probability
in n stages, that is P (xt+n ∈ B|xt = x).
A maximal irreducibility measure ψ is an irreducibility measure such that for all other irreducibility measures φ, we have
ψ(B) = 0 ⇒ φ(B) = 0 for any B ∈ B(X ) (that is, all other irreducibility measures are absolutely continuous with
respect to ψ). In the text, whenever a chain is said to be irreducible, a maximal irreducible measure is implied. We also
define B + (X ) = {A ∈ B(X ) : ψ(A) > 0} where ψ is a maximal irreducibility measure.
Example: Linear system with a drift. Consider the following linear system:
xt+1 = axt + wt ,
This chain is Lebesgue irreducible if wt is a Gaussian variable.
A ψ-irreducible Markov chain is periodic with period d if there exists a partition of X = ∪di=1 Xi ∪ D so that P (x, Xi+1 ) =
1 for all x ∈ Xi and P (x, X1 ) for all x ∈ Xd , for some d > 1 with ψ(D) = 0.
The definitions for recurrence and transience follow those in the countable state space setting.
Definition 3.2.2 A Markov chain is called recurrent if it is µ−irreducible and
∞
Ex [
∞
1xt ∈A ] =
t=1
t=1
P t (x, A) = ∞,
∀x ∈ X,
whenever µ(A) > 0 and A ∈ B(X).
While studying chains in an infinite state space, we use a stronger recurrence definition: A set A ∈ B(X) is Harris
recurrent if the Markov chain visits A infinitely often with probability 1, when the process starts in A:
Definition 3.2.3 A set A ∈ B(X) is Harris recurrent if
Px (ηA = ∞) = 1, A ∈ B(X), ∀x ∈ A,
(3.11)
A Markov chain is Harris recurrent if the chain is ψ-irreducible,
Px (ηA = ∞) = 1, A ∈ B(X), ∀x ∈ X,
and ψ(A) > 0.
Theorem 3.7. Harris recurrence of a set A is equivalent to
Px [τA < ∞] = 1, ∀x ∈ A.
Proof: Let τA (1) be the first time the state hits A. By the Strong Markov Property, the Markov chain sampled at successive
intervals τA (1), τA (2) and so on is also a Markov chain. Let Q be the transition kernel for this sampled Markov Chain.
Now, the probability of τA (2) < ∞ can be computed recursively as
3.2 Uncountable (Complete, Separable, Metric) State Spaces
P (τA (2) < ∞) =
=
A
A
25
QxτA (1) (xτA (1) , dy)Py (τA (1) < ∞)
QxτA (1) (xτA (1) , dy) Py (τA (1) < ∞) = 1
(3.12)
By induction, for every n ∈ Z+
P (τA (n + 1) < ∞) =
A
QxτA (1) (xτA (1) , dy)Py (τA (n) < ∞) = 1
(3.13)
Now,
Px (ηA ≥ k) = Px (τA (k) < ∞),
since k times visiting a set requires k times returning to a set, when the initial state x is in the set. As such,
Px (ηA ≥ k) = 1, ∀k ∈ Z+
is identically equal to 1. Define Bk = {ω ∈ Ω : η(ω) ≥ k}, and it follows that Bk+1 ⊂ Bk . By the continuity of
probability P (∩Bk ) = limk→∞ P (Bk ), it follows that Px (ηA = ∞) = 1.
The proof of the other direction for showing equivalence is left as an exercise to the reader.
⋄
Definition 3.2.4 A Markov Chain is Harris recurrent if the chain is µ−irreducible and every set A ⊂ X is Harris recurrent
whenever µ(A) > 0. If the chain admits an invariant probability measure, then the chain is called positive Harris recurrent.
Remark: Harris recurrence is stronger than recurrence. In one, an expectation is considered, in the other, a probability
is considered. Consider the following example: Let P (1, 1) = 1 and P (x, x + 1) = 1 − 1/x2 and P (x, 1) = 1/x2 .
Px (τ1 = ∞) = t≥x (1 − 1/t2 ) > 0. This chain is not Harris recurrent. It is π−irreducible for the measure π(A) = 1 if
1 ∈ A. Hence, this chain is recurrent but not Harris recurrent. See [48].
⋄
If a set is not recurrent, it is transient. A set A is transient if
U (x, A) = Ex [ηA ] < ∞,
Note that, this is equivalent to
∀x ∈ A,
(3.14)
∞
i=1
P i (x, A) < ∞,
3.2.1 Invariant Distributions for Uncountable Spaces
Definition 3.2.5 For a Markov chain with transition probability defined as before, a probability measure π is invariant on
the Borel space (X, B(X)) if
P (x, D)π(dx),
π(D) =
X
∀D ∈ B(X) .
Theorem 3.2.1 Let {xt } be a ψ-irreducible Markov chain which admits an invariant probability measure. The invariant
measure is unique.
Proof. Let there be two invariant probability measures µ1 and µ2 . Then, there exists two mutually singular invariant
probability measures ν1 and ν2 , that is ν1 (x, B1 ) = 1 for all x ∈ B1 and ν2 (x, B2 ) = 1, B1 ∩ B2 = ∅ and that
P n (x, B1C ) = 0 for all x ∈ B1 and n ∈ Z+ and likewise P n (z, B1C ) = 0 for all z ∈ B1 and n ∈ Z+ . This then implies
that the irreducibility measure has zero support on B1C and zero support on B2C and thus on X, leading to a contradiction. ⋄
Uncountable chains act like countable ones when there is a single atom α ∈ X which satisfies a finite mean return property
to be discussed below.
26
3 Classification of Markov Chains
Definition 3.2.6 A set α is called an atom if there exists a probability measure ν such that
P (x, A) = ν(A),
∀x ∈ α, ∀A ∈ B(X).
If the chain is µ−irreducible and µ(α) > 0, then α is called an accessible atom.
In case there is an accessible atom α, we have the following:
Theorem 3.8. For an µ−irreducible Markov chain for which Eα [τα ] < ∞; the following is the invariant probability
measure:
τα −1
k=0 1xk ∈A
|x0 = α , ∀A ∈ B(X), µ(A) > 0
π(A) = Eα
E[τα ]
Small Sets and Nummelin and Athreya-Ney’s Splitting Technique
In case an atom is not present, one may construct an artificial atom:
Definition 3.2.7 A set A ⊂ X is n-small on (X, B(X)) if for some positive measure µ
P n (x, B) ≥ µ(B),
∀x ∈ A, and B ∈ B(X),
where B(X) denotes the (Borel) sigma-field on X.
Definition 3.2.8 (Meyn-Tweedie) A set A ⊂ X is µ − petite on (X, B(X)) if for some distribution T on N (set of natural
numbers), and some measure µ,
∞
n=0
P n (x, B)T (n) ≥ µ(B),
∀x ∈ A, and B ∈ B(X),
where B(X) denotes the (Borel) sigma-field on X.
Theorem 3.9. [Theorem 5.5.3 of [40]] For an aperiodic and irreducible Markov chain {xt } every petite set is small.
Theorem 3.10. For an aperiodic and irreducible Markov chain every petite set is petite with a maximal irreducibility
measure for a distribution with finite mean.
The results on recurrence apply to uncountable chains with no atom provided there is a small set or a petite set. In the
following, we construct an artificial atom through what is commonly known as the splitting technique, see [45] [46] (see
also [6]).
Suppose a set A is 1-small. Define a process zt = (xt , at ), zt ∈ X × {0, 1}. That is we enlarge the probability space.
Suppose that when xt ∈
/ A, (at , xt ) evolve independently from each other. However, when xt ∈ A, we pick a Bernoulli
random variable, and with probability δ the state visits A × {1} and with probability 1 − δ visits A × {0}. From A × {1},
t)
and from A × {0}, it visits the future time stage with probability
the transition for the next time stage is given by ν(dx
δ
P (dxt+1 |xt ) − ν(dxt+1 )
.
1−δ
Now, pick δ = ν(X). In this case, A × {1} is an accessible atom, and one can verify that the marginal distribution of the
original Markov process {xt } has not been altered.
The following can be established using the above construction.
3.2 Uncountable (Complete, Separable, Metric) State Spaces
27
Proposition 3.2.1 If
sup E[min(t > 0 : xt ∈ A)|x0 = x] < ∞
x∈A
then,
sup
z∈(A×{1})
E[min(t > 0 : zt ∈ (A × {1}))|z0 = z] < ∞.
Now suppose that a set A is m-small. Then, we can construct a split chain for the sampled process xmn , n ∈ N. Note that
this sampled chain has a transition kernel as P m . We replace the discussion for the 1-small case with the sampled chain
(also known as the m-skeleton of the original chain). If one can show that the sampled chain has an invariant measure πm ,
then (see Theorem 10.4.5 of [38]:
π(B) :=
1
m
m−1
πm (dx)P k (x, B)
(3.15)
k=0
is invariant for P . Furthermore, π is also invariant for the sampled chain with kernel P m . Hence if P m leads to a unique
invariant probability measure, π = πm .
3.2.2 Existence of an Invariant Distribution
We state the following very useful result on the existence of invariant distributions for Markov chains.
Theorem 3.11. (Meyn-Tweedie) Consider a Harris recurrent Markov process {xt } taking values in X. If there exists a set
A which is also a µ-petite set for some positive measure µ, and if the set satisfies
sup E[min(t > 0 : xt ∈ A)|x0 = x] < ∞,
x∈A
then the Markov chain is positive Harris recurrent and admits a unique invariant distribution.
See Remark 4.1 on a positive Harris recurrence discussion for an m-skeleton and split chains: When a Markov chain has an
invariant probability measure, the sampled chain (m-skeleton) also satisfies a drift condition, which then leads to the result
that an atom constructed through an m-skeleton has a finite return property, which can be used to establish the existence of
an invariant probability measure.
In this case, the invariant measure satisfies the following, which is a generalization of Kac’s Lemma [28]:
Theorem 3.12. For a µ-irreducible Markov chain with a unique invariant probability measure; the following is the invariant probability measure:
τC −1
π(A) =
π(dx)Ex [
C
1{xk ∈A} ],
k=0
∀A ∈ B(X), µ(A) > 0, π(C) > 0
The above can also be extended to compute that expected values of a function of the Markov states. The above can be
verified along the same lines used for the countable state space case (see Theorem 3.2).
Example 3.13. We will show later that the following linear system
xt+1 = axt + wt ,
is positive Harris recurrent if |a| < 1, is null-recurrent if |a| = 0 and is transient if |a| > 1.
28
3 Classification of Markov Chains
3.2.3 Dobrushin’s Ergodic Coefficient for Uncountable State Space Chains
We can extend Dobrushin’s contraction result for the uncountable state space case. By the property that |a − b| = a + b −
2 min(a, b), the Dobrushin’s coefficient is also related to the term (for the countable state space case):
δ(P ) = 1 −
1
max
2 i,k
j
|(P (i, j) − P (k, j)|
In case P (x, dy) is the stochastic transition kernel of a real-valued Markov process with transition kernels admitting
densities (that is P (x, A) = A P (x, y)dy admits a density), the expression
δ(P ) = 1 −
1
sup
2 x,z
R
|P (x, y) − P (z, y)|dy
is the Dobrushin’s ergodic coefficient for R−valued Markov processes.
As such, if δ(P ) > 0, then the iterations πt (.) =
πt−1 (z)P (z, .)dz converge to a unique fixed point.
3.2.4 Ergodic Theorem for Positive Harris Recurrent Chains
If µ is the invariant probability measure for a positive Harris recurrent Markov chain, it follows that:
1
T →∞ T
T
c(xt ) =
lim
t=1
c(x)µ(dx), ∀c ∈ L1 (µ) := {f :
f (x)µ(dx) < ∞},
almost surely.
3.3 Further Conditions on the Existence of an Invariant Probability Measure
This section uses certain properties of the space of probability measures, reviewed briefly in Section 6.4.
A Markov chain is weak Feller if
X
P (dz|x)v(z) is continuous in x for every continuous v on X.
Theorem 3.14. Let {xt } be a weak Feller Markov process living in a compact subset of a complete, seperable metric space.
Then {xt } admits an invariant distribution.
Proof. Proof follows the observation that the space of probability measures on a compact set is tight (that is, it is weakly
−1
µ0 P t , T ≥ 1. There exists a subsequence µTk which
sequentially pre-compact). Consider a sequence µT = T1 Tt=0
converges weakly to some µ∗ . It follows that for every continuous and bounded function
µTk , f :=
µTk (dx)f (x) → µ∗ , f
Likewise,
µTk , P f :=
µTk (dx)(
P (dy|x)f (y)) → µ∗ , P f .
Now,
(µTk − µTk P )(f ) =
1
Eµ
Tk 0
Tk −1
k=0
Tk −1
Pxk f −
Pxk+1 f
k=0
3.4 Exercises
=
1
Eµ f (x0 ) − f (xTk )
Tk 0
→ 0.
29
(3.16)
Thus,
(µTk − µTk P )(f ) = µTk , f − µTk P, f = µTk , f − µTk , P f → µ∗ − µ∗ P, f = 0.
Thus, µ∗ is an invariant probability measure.
⋄
Lasserre [35] gives the following example to emphasize the importance of the Feller property: Consider a Markov chain
evolving in [0, 1] given by: P (x, x/2) = 1 for all x = 0 and P (0, 1) = 1. This chain does not admit an invariant measure.
Remark 3.15. One can relax the weak Feller condition and instead consider spaces of probability measures which are
setwise sequentially pre-compact. The proof of this result follows from a similar observation as (3.16) but with weak
convergenced by setwise convergence. It can be shown (as in the proof of Theorem 3.14) that a (sub)sequence of occupation
measures which converges setwise, converges to an invariant probability measure. As an example, consider a system of the
form:
xt+1 = f (xt ) + wt
(3.17)
where wt admits a distribution with a bounded density function, which is positive everywhere and f is bounded. This
system admits an invariant probability measure which is unique.
3.4 Exercises
Exercise 3.4.1 For a countable state space Markov chain, prove that if {xt } is irreducible, then all states have the same
period.
Exercise 3.4.2 Prove that
Px (τA = 1) = P (x, A),
and for n ≥ 1,
Px (τA = n) =
i∈A
/
P (x, i)Pi (τA = n − 1)
Exercise 3.4.3 Consider a line of customers at a service station (such as at an airport, a grocery store, or a communication
network where customers are packets). Let Lt be the length of the line, that is the total number of customers waiting in the
line.
Let there be Mt servers, serving the customers at time t. Let there be a manager (controller) who decides on the number of
servers to be present. Let each of the servers be able to serve N customers for every time-stage. The dynamics of the line
can be expressed as follows:
Lt+1 = Lt + At − Mt N 1(Lt ≥N Mt ) ,
where 1(E) is the indicator function for event E, i.e., it is equal to zero if E does not occur and is equal to 1 otherwise. In
the equation above, At is the number of customers that have just arrived at time t. We assume {At } to be an independent
process, and to have an exponential distribution, with mean λ, that is for all k ∈ N
P (A(t) = k) =
λk e−λ
,
k!
k ∈ {0, 1, 2, . . . , }
The manager has only access to the information vector
It = {L0 , L1 , . . . , Lt ; A0 , A1 , . . . , At−1 },
30
3 Classification of Markov Chains
while implementing his policies. A consultant proposes a number of possible policies to be adopted by the manager:
According to Policy A, the number of servers is given by:
Mt =
Lt + Lt−1 + Lt−3
.
2
Mt =
Lt+1 + Lt + Lt−1
.
2
According to Policy B, the number of servers is
According to Policy C,
Mt = ⌈λ + 0.1⌉1t≥10 .
Finally, according to Policy D, the update is
Mt = ⌈λ + 0.1⌉
a) Which of these policies are admissible, that is measurable with respect to the σ−field generated by It ? Which are
Markov, or stationary?
b) Consider the model above. Further, suppose policy D is adopted by the manager. Consider the induced Markov chain
by the policy, that is, consider a Markov chain with the following dynamics:
Lt+1 = Lt + At − ⌈λ + 0.1⌉N 1(Lt≥⌈λ+0.1⌉)N .
Is this Markov chain irreducible?
Is there an absorbing set in this Markov chain? If so, what is it?
Exercise 3.4.4 Show that irreducibility of a Markov chain in a finite state space implies that every set A and every x
satisfies U (x, A) = ∞.
Exercise 3.4.5 Show that for an irreducible Markov chain, either the entire chain is transient, or recurrent.
Exercise 3.4.6 If Pa (τa < ∞) < 1, show that Ea [
∞
k=1
1{xk =a} ] < ∞.
Exercise 3.4.7 If Pa (τa = ∞) < 1, show that Pa (
∞
k=1
1{xk =a} = ∞) = 1.
Exercise 3.4.8 (Gambler’s Ruin) Consider an asymmetric random walk defined as follows: P (xt+1 = x+1|xt = x) = p
and P (xt+1 = x − 1|xt = x) = 1 − p for any integer x. Suppose that x0 = x is an integer between 0 and N . Let
τ = min(k > 0 : xk ∈
/ [1, N − 1]). Compute Px (xτ = N ) (you may use Matlab for your solution).
Hint: Observe that one can obtain a recursion as Px (xτ = N ) = pPx+1 (xτ = N ) + (1 − p)Px−1 (xτ = N ) for
1 ≤ x ≤ N − 1 with boundary value conditions PN (xτ = N ) = 1 and P0 (xτ = N ) = 0. One observes that
Px+1 (xτ = N ) − Px (xτ = N ) =
1−p
Px (xτ = N ) − Px−1 (xτ = N )
p
and in particular
PN (xτ = N ) − PN −1 (xτ = N ) = (
1 − p N −1
P1 (xτ = N ) − P0 (xτ = N )
)
p
4
Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
4.1 Martingales
In this chapter, we first discuss some martingale theorems. Only a few of these will be important within the scope of our
coverage, some others are presented for the sake of completeness.
These are very important for us to understand stabilization of controlled stochastic systems. These also will pave the way
to optimization of dynamical systems. The second half of this chapter is on the stabilization of Markov Chains.
4.1.1 More on Expectations and Conditional Probability
Let (Ω, F , P ) be a probability space and let G be a subset of F . Let X be a R−valued random variable measurable with
respect to (Ω, F ) with a finite absolute expectation that is
|X(ω)|P (dω) < ∞,
E[|X|] =
Ω
where ω ∈ Ω. We call such random variables integrable.
We say that Ξ is the conditional expectation random variable (and is also called a version of the conditional expectation)
E[X|G], of X given G if
i) Ξ is G-measurable. ii) For every A ∈ G,
E[1A Ξ] = E[1A X],
where
X(ω)1ω∈A P (dω)
E[1A Ξ] =
Ω
For example, if the information that we know about a process is whether an event A ∈ F happened or not, then:
XA := E[X|A] =
1
P (A)
P (dω)X(ω).
A
If the information we have is that A did not take place:
XAC := E[X|AC ] =
1
P (Ω \ A)
P (dω)X(ω).
Ω\A
Thus, the conditional expectation given by the sigma-field generated by A (which is FA = {∅, Ω, A, Ω \ A} is given by:
E[X|FA ] = XA 1ω∈A + XAC 1ω∈A
/ .
32
4 Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
It follows from the above that conditional probability can be expressed as
P (A|G) = E[1{x∈A} |G],
hence, conditional probability is a special case of conditional expectation.
The notion of conditional expectation is key for the development of stochastic processes which evolve according to a
transition kernel. This is useful for optimal decision making when a partial information is available with regard to a random
variable.
The following discussion is optional until the next subsection.
Theorem 4.1.1 (Radon-Nikodym) Let µ and ν be two σ−finite positive measures on (Ω, F ) such that ν(A) = 0 implies
that µ(A) = 0 (that is µ is absolutely continuous with respect to ν). Then, there exists a measurable function f : Ω → R+
such that for every A:
µ(A) =
f (ω)ν(dω)
A
The representation above is unique, up to points of measure zero. With the above discussion, the conditional expectation
X = E[X|F ′ ] exists for any sub-σ-field, subset of F .
It is a useful exercise to now consider the σ-field generated by an observation variable, and what a conditional expectation
means in this case.
Theorem 4.1.2 Let X be a X valued random variable, where X is a complete, separable, metric space and Y be another
Y−valued random variable, Then, a random variable X is FY (the σ−field generated by Y ) measurable if and only if
there exists a measurable function f : Y → X such that X = f (Y (ω)).
Proof: We prove the more difficult direction. Suppose X takes values in a countable space {a1 , a2 , . . . , an , . . . }. We can
write An = X −1 (an ). Since X is measurable on FY , An ∈ FY . Let us now make these sets disjoint: Define C1 = A1 and
n−1
Cn = An \ ∪i=1
Ai . These sets are disjoint, but they all are in FY . Thus, we can construct a measurable function defined
as f (ω) = an if ω ∈ Cn :
The issue is now what happens when X takes values in a more general complete, separable, metric space such as R. We can
define countable sets which cover the entire state space (such as intervals with rational endpoints to cover R). Let us construct a sequence of partitions Γn = {Bi,n , ∪i Bi,n = X} which gets finer and finer (such as for R: [⌊X(ω)⌋, ⌈X(ω)n⌉)/n).
We can define a sequence of measurable functions fn for every such partition. Furthermore, fn (ω) converges to a function
f (ω) pointwise, the limit is measurable.
⋄
With the above, the expectation E[X|Y = y0 ] can be defined, this expectation is a measurable function of Y .
4.1.2 Some Properties of Conditional Expectation:
One very important property is given by the following.
Iterated expectations:
Theorem 4.1.3 If H ⊂ G ⊂ F , and X is F −measurable, then it follows that:
E[E[X|G]|H] = E[X|H]
Proof: Proof follows by taking a set A ∈ H, which is also in G and F . Let η be the conditional expectation variable with
respect to H. Then it follows that
E[1A η] = E[1A X]
4.1 Martingales
33
Now let E[X|G] be η ′ . Then, it must be that E[1A η ′ ] = E[1A X] for all A ∈ G and hence for all A ∈ H. Thus, the two
expectations are the same.
⋄
4.1.3 Discrete-Time Martingales
Let (Ω, F , P ) be a probability space. An increasing family {Fn } of sub-σ−fields of F is called a filtration.
A sequence of random variables on (Ω, F , P ) is said to be adapted to Fn if Xn is Fn -measurable, that is Xn−1 (D) =
{w ∈ Ω : Xn (w) ∈ D} ∈ Fn for all Borel D. This holds for example if Fn = σ(Xm , m ≤ n), n ≥ 0.
Given a filtration Fn and a sequence of real random variables adapted to it, (Xn , Fn ) is said to be a martingale if
E[|Xn |] < ∞
and
E[Xn+1 |Fn ] = Xn .
We will occasionally take the sigma-fields to be Fn = σ(X1 , X2 , . . . , Xn ).
Let n > m ∈ Z+ . Then, since Fm ⊂ Fn , it must be that A ∈ Fm should also be in Fn . Thus, if Xn is a Martingale
sequence,
E[1A Xn ] = E[1A Xn−1 ] = · · · = E[1A Xm ].
Thus, E[Xn |Fm ] = Xm .
If we have that
E[Xn |Fm ] ≥ Xm
then {Xn } is called a submartingale.
And, if
E[Xn |Fm ] ≤ Xm
then {Xn } is called a supermartingale.
A useful concept related to filtrations is that of a stopping time, which we discussed while studying Markov chains. A
stopping time is a random time, whose occurrence is measurable with respect to the filtration in the sense that for each
n ∈ N, {T ≤ n} ∈ Fn .
4.1.4 Doob’s Optional Sampling Theorem
Theorem 4.1.4 Suppose (Xn , Fn ) is a martingale sequence, and ρ, τ < n are bounded stopping times with ρ ≤ τ . Then,
E[Xτ |Fρ ] = Xρ
Proof: We observe that
τ −1
E[Xτ − Xρ |Fρ ] = E[
k=ρ
Xk+1 − Xk |Fρ ]
τ −1
= E[
k=ρ
τ −1
E[Xk+1 − Xk |Fk ]|Fρ ] = E[
0|Fρ ] = 0
(4.1)
k=ρ
⋄
34
4 Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
Theorem 4.1.5 Let {Xn } be a sequence of Fn -adapted integrable real random variables. Then, the following are equivalent: (i) (Xn , Fn ) is a sub-martingale. (ii) If T, S are bounded stopping times with T ≥ S (almost surely), then
E[XS ] ≤ E[XT ]
Proof: Let S ≤ T ≤ n Now, note that,
n
E[XT − XS |FS ] =
k=1
E[1(T ≥k) 1(S<k) (Xk − Xk−1 )]
n
=E
k=1
E[E[1(T ≥k) 1(S<k) (Xk − Xk−1 )|Fk−1 ]]
≥ 0,
(4.2)
where we use the fact that 1(T ≥k) 1(S<k) ∈ Fk−1 (at time k − 1 it is known whether S or T are greater or not).
Now, taking the expectation with respect to a smaller σ−field F0 , the desired result follows.
⋄
In the above, the main properties we used were (i) the fact that the sub-fields are nested, (ii) n is bounded (this is important
while considering a martingale, since the summation of an infinite number of terms might not be well-defined) and (iii) the
random variables are integrable.
Let us try to see why boundedness of the times are important.
Let us consider the following game. One draws a fair coin; with equal probabilities of heads and tails. If we have a tail, we
win a dollar, and a head will let us lose a dollar. Suppose we have 0 dollars at time 0 and we decide to stop when we have 5
dollars, that is at time τ = min(n > 0 : Xn = 5). In this case, clearly E[Xτ ] = 5, as we will stop when we have 5 dollars.
But E[Xτ ] = X0 . There is something inconsistent in this example.
For this example, Xn = Xn−1 + Wn , where wn is either −1 or 1 with equal probabilities and Xn is the amount of money
we have. Clearly Xn is a martingale sequence for all finite n ∈ Z+ . The problem is that, one might have to wait for an
unbounded amount of time to be able to have the 5 dollars, until when E[|Xn |] will not necessarily be bounded and the
martingale sampling theorem will not apply.
4.1.5 An Important Martingale Convergence Theorem
We first discuss Doob’s upcrossing lemma. Let (a, b) be a non-empty interval. Let X0 ∈ (a, b). Define a sequence of
stopping times
T0 = min{N, min(0 ≤ n ≤ N, Xn ∈ (a, b))},
T1 = min{N, min(T0 ≤ n ≤ N, Xn ≤ a)}
T2 = min{N, min(T1 ≤ n ≤ N, Xn ≥ b)}
T3 = min{N, min(T2 ≤ n ≤ N, Xn ≤ a)}
T4 = min{N, min(T3 ≤ n ≤ N, Xn ≥ b)}
and for m ≥ 1:
T2m−1 = min{N, min(T2m−2 ≤ n ≤ N, Xn ≤ a)}
T2m = min{N, min(T2m−1 ≤ n ≤ N, Xn ≥ b)}
The number of upcrossings of (a, b) up to time N is the random variable ζN (a, b) = the maximum m for which the number
of times between 0 and N , {Xn } crosses the strip (a, b) from below a to above b.
Note that XT2 − XT1 has the expectation zero, if the sequence is a martingale!
Theorem 4.1.6 Let XT be a supermartingale sequence. Then,
E[ζN (a, b)] ≤
E[|XN |] + |a|
E[max(0, a − XN )]
≤
.
b−a
b−a
4.1 Martingales
35
Proof:
There are three possibilities that might take place: The process can end below a, between a and b and above b. If it crosses
above b, then we have completed an upcrossing. In view of this, we may proceed as follows: Let βN := max(m : T2m−1 ≤
N ), that is T2m = N or T2m−1 = N . Here, βN is measurable on σ(X1 , X2 , . . . , XN ), so it is a stopping time.
Then
βN
0 ≥ E[
i=1
XT2i − XT2i−1 ]
βN
= E[
i=1
XT2i − XT2i−1 1{T2βN −1 =N } 1{T2βN =N } ]
βN
+E[
i=1
XT2i − XT2i−1 1{T2βN −1 =N } 1{T2βN =N } ]
βN −1
= E[(
i=1
(XT2i − XT2i−1 ) + XN − XT2βN −1 )1{T2βN =N } ]
βN −1
+E[
i=1
(XT2i − XT2i−1 ) 1{T2βN −1 =N } ]
βN −1
=E
i=1
XT2i − XT2i−1
+ E[(XN − XT2βN −1 )1{T2βN =N } ]
(4.3)
Thus,
βN −1
E
i=1
XT2i − XT2i−1
Since, 0 ≥ E
βN −1
i=1
≤ −E[(XN − XT2βN +1 )1{T2βN =N } ] ≤ E[max(0, a − XN )1{T2βN =N }]≤E[max(0,a−XN (4.4)
)]
XT2i − XT2i−1
≥ E[βN − 1](b − a), it follows that:
ζN (a, b) = (βN − 1),
satisfies:
ζN (b − a) ≤ E[max(0, a − XN )] ≤ |a| + E[|XN |],
⋄
and the result follows.
Recall that a sequence of random variables defined on a probability space (Ω, F , P ) converges to X almost surely (a. s.) if
P (w : lim xn (w) = x(w)) = 1.
n→∞
Theorem 4.1.7 Suppose Xn is a supermartingale and supn≥0 E[−Xn ] < ∞. Then limn→∞ Xn = X exists (almost
surely). Note that, the same result applies for submartingales, by considering −Xn regarded as a supermartingale and the
condition supn≥0 E[max(0, Xn )] < ∞. A sufficient condition for both cases is that
sup E[|Xn |] < ∞.
n≥0
Proof: The proof follows from Doob’s upcrossing lemma. Suppose Xn does not converge to a variable almost surely. This
means that the set of sample paths ω for which lim sup Xn (ω) = lim inf Xn (ω) has a positive measure. In particular, there
exist reals a, b (depending on ω) such that lim sup Xn (ω) ≥ b(ω) and lim inf Xn (ω) ≤ a(ω) and b(ω) > a(ω).
Now, for any fixed a, b (independent of ω); by the upcrossing lemma we have that
36
4 Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
E[ζN (a, b)] ≤ E[max(0, a − XN )] ≤
E[|XN |] + |a|
.
(b − a)
This is true for every N . Since ζN (a, b) is a monotonically increasing sequence in N , by the monotone convergence
theorem it follows that
lim E[ζN (a, b)] = E[ lim βN (a, b)] < ∞.
N →∞
N →∞
Thus, for every fixed a, b, the number of up-crossings has to be finite and
P (ω : | lim sup Xn (ω) − lim inf Xn (ω)| > (b − a)) = 0
for otherwise the expectation would be unbounded. A continuity of probability argument then leads to
P (ω : | lim sup Xn (ω) − lim inf Xn (ω)| > 0) = 0.
⋄
We can also show that the limit variable has finite absolute expectation.
Theorem 4.1.8 (Submartingale Convergence Theorem) Suppose Xn is a submartingale and supn≥0 E[|Xn |] < ∞.
Then X := limn→∞ Xn exists (almost surely) and E[|X|] < ∞.
Proof: Note that, supn≥0 E[|Xn |] < ∞, is a sufficient condition both for a submartingale and a supermartingale in
Theorem 4.1.7. Hence Xn → X almost surely. For finiteness, suppose E[|X|] = ∞. By Fatou’s lemma,
lim sup E[|Xn |] ≥ E[lim inf |Xn |] = ∞.
But this is a contradiction as we had assumed that supn E[|Xn |] < ∞.
4.1.6 Proof of the Birkhoff Individual Ergodic Theorem
This will be discussed in class.
4.1.7 This section is optional: Further Martingale Theorems
This section is optional. If you wish not to read it, please proceed to the discussion on stabilization of Markov Chains.
Theorem 4.1.9 Let Xn be a martingale such that Xn converges to X in L1 that is E[|Xn − X|] → 0. Then,
Xn = E[X|Fn ],
n∈N
We will use the following while studying the convex analytic method. Let us define uniform integrability:
Definition 4.1.1 : A sequence of random variables {Xn } is uniformly integrable if
lim sup
K→∞ n
|Xn |≥K
|Xn |P (dXn ) = 0
This implies that
sup E[|Xn |] < ∞
n
Let for some ǫ > 0,
⋄
4.2 Stability of Markov Chains: Foster-Lyapunov Techniques
37
sup E[|X|1+ǫ
n ] < ∞.
n
This implies that the sequence is uniformly integrable as
sup
n
|Xn |≥K
|Xn |P (dXn ) ≤ sup
n
(
|Xn |≥K
|Xn | ǫ
1
) |Xn |P (dXn ) ≤ sup E[|Xn |1+ǫ ] →K→∞ 0.
K
n K
The following is a very important result:
Theorem 4.1.10 If Xn is a uniformly integrable martingale, then X = limn→∞ Xn exists almost surely (for all sequences
with probability 1) and in L1 , and Xn = E[X|Fn ].
Optional Sampling Theorem For Uniformly Integrable Martingales
Theorem 4.1.11 Let (Xn , Fn ) be a uniformly integrable martingale sequence, and ρ, τ are stopping times with ρ ≤ τ .
Then,
E[Xτ |Fρ ] = Xρ
Proof: By Uniform Integrability, it follows that {Xt } has a limit. Let this limit be X∞ . It follows that E[X∞ |Fτ ] = Xτ
and
E[E[X∞ |Fτ ]|Fρ ] = Xρ
which is also equal to E[Xτ |Fρ ] = Xρ .
⋄
4.1.8 Azuma-Hoeffding Inequality for Martingales with Bounded Increments
The following is an important concentration result:
Theorem 4.1.12 Let Xt be a martingale sequence such that |Xt − Xt−1 | ≤ c for every t, almost surely. Then for any
x > 0,
tx2
Xt − X0
≥ x) ≤ 2e− 2c
P(
t
As a result,
Xt
t
→ 0 almost surely.
4.2 Stability of Markov Chains: Foster-Lyapunov Techniques
A Markov chain’s stability can be characterized by drift conditions, as we discuss below in detail.
4.2.1 Criterion for Positive Harris Recurrence
Theorem 4.2.1 (Foster-Lyapunov for Positive Recurrence) [38] Let S be a petite set, b ∈ R and V (.) be a mapping
from X to R+ . Let {xn } be an irreducible Markov chain on X. If the following is satisfied for all x ∈ X:
E[V (xt+1 )|xt = x] =
X
P (x, dy)V (y) ≤ V (x) − 1 + b1{x∈S} ,
then the chain is positive Harris recurrent and a unique invariant probability measure exists for the Markov chain.
(4.5)
38
4 Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
¯ 0 := V (x0 ), and for t ≥ 1
Proof: We will first assume that S is such that supx∈S V (x) < ∞. Define M
t−1
¯ t := V (xt ) −
M
It follows that
(−1 + b1(xi ∈S) )
i=0
¯ (t+1) |xs , s ≤ t] ≤ M
¯ t,
E[M
N
∀t ≥ 0.
Define a stopping time: τ = min(τ, min(k > 0 : V (xk ) + k ≥ N )), where τ = min{i > 0 : xi ∈ S}. The stopping
¯ min(t,τ N ) . Then, E[|Mt |] ≤ ∞ for all t and it
time is bounded and V (xk ) is also bounded until τN . Define then Mt = M
follows that {Mt } is a supermartingale. Hence, we have, by the martingale optional sampling theorem:
E[Mτ N ] ≤ M0 .
Hence, we obtain
τ N −1
τ N −1
Ex0 [
i=0
] ≤ V (x0 ) + bEx0 [
1(xi ∈S) ]
i=0
Thus, Ex0 [τ N − 1 + 1] ≤ V (x0 ) + b, and by the monotone convergence theorem,
lim Ex0 [τ N ] = Ex0 [τ ] ≤ V (x0 ) + b
N →∞
Now, if we had that
sup V (x) < ∞,
(4.6)
x∈S
the proof would be complete in view of Theorem 3.11 (and the fact that Ex [τ ] < ∞ for any x ∈ X, leading to the Harris
recurrence of S) and the chain would be positive Harris recurrent.
Typically, condition (4.6) is satisfied. However, in case it is not easy to directly verify, we may need additional steps to
construct a petite set. In the following, we will consider this: Following [38], Chapter 11, define for some l ∈ Z+
VC (l) = {x ∈ C : V (x) ≤ l}
We will show that B := VC (l) is itself a petite set, which is recurrent and satisfies the uniform finite-mean-return property.
Since C is petite for some measure ν, we have that
Ka (x, B) ≥ 1{x∈C} ν(B),
where Ka (x, B) =
i
x ∈ X,
a(i)P i (x, B) and hence
1{x∈C} ≤
1
Ka (x, B)
ν(B)
Now, for x ∈ B,
τB −1
Ex [τB ] ≤ V (x) + bEx [
= V (x) + b
= V (x) + b
1
Ex [
ν(B)
1
ν(B)
k=0
τB −1
τB −1
1{xk ∈C} ] ≤ V (x) + bEx [
Ka (xk , B)] = V (x) + b
k=0
τB −1
a(i)Ex [
i
1{xk+i ∈B} ]
k=0
k=0
1
Ka (xk , B)]
ν(B)
1
Ex [
ν(B)
τB −1
a(i)P i (xk , B)]
k=0
i
4.2 Stability of Markov Chains: Foster-Lyapunov Techniques
≤ V (x) + b
1
ν(B)
39
a(i)(1 + i),
i
where (4.7) follows since at most once the process can hit B between 0 and τB − 1. Now, the petiteness measure can be
adjusted such that i ai i < ∞ (by Proposition 5.5.6 of [38]), leading to the result that
sup Ex [τB ] ≤ sup V (x) + b
x∈B
x∈B
1
ν(B)
i
a(i)(1 + i) < ∞.
Finally, since C is petite, so is B and it can be shown that Px (τB < ∞) = 1 for all x ∈ X. This concludes the proof.
⋄
Remark 4.1. We note that if xt is irreducible and such that for some small set supx∈A E[min(t > 0 : xt ∈ A)|x0 =
x] < ∞, then the sampled chain {xkm } is such that supx∈A E[min(km > 0 : xkm ∈ A)|x0 = x] < ∞, and the split
chain discussion in Section 3.2.1 applies (See Chapter 11 in [38]). The argument for this builds on the fact that, with
σC = min(k ≥ 0 : xk ∈ C),V(x):= 1 + Ex [σC ], it follows that E[V (xt+1 |xt = x)] ≤ V (x) − 1 + b1x∈C and iterating the
expectation m times we obtain that
m−1
E[V (xt+m )|xt = x] ≤ V (x) − mǫ + bEx [
1xk ∈C ].
k=0
m−1
By [38], it follows that Ex [ k=0 1xk ∈C ] ≤ m1x∈Cǫ + ǫ for some petite set Cǫ and ǫ > 0. As a result, we have a drift
condition for the m-skeleton, the return time for an artificial atom constructed through the split chain is finite and hence an
invariant probability measure for the m-skeleton, and thus by (3.15), an invariant probability measure for the original chain
exists.
⋄
There are other versions of Foster-Lyapunov criteria.
4.2.2 Criterion for Finite Expectations
Theorem 4.2.2 [Comparison Theorem] Let V : X → R+ , f, g : X → R+ such that E[f (x)] < ∞, E[g(x)] < ∞. Let
{xn } be a Markov chain on X. If the following is satisfied:
X
P (x, dy)V (y) ≤ V (x) − f (x) + g(x),
∀x ∈ X ,
then, for any stopping time τ :
τ −1
E[
t=0
τ −1
f (xt )] ≤ V (x0 ) + E[
g(xt )]
t=0
The proof for the above follows from the construction of a supermartingale sequence and the monotone convergence
theorem. The above also allows us the computation of useful bounds. For example if g(x) = b1{x∈A} , then one obtains
τ −1
that E[ t=0 f (xt )] ≤ V (x0 ) + b. In view of the invariant distribution properties, if f (x) ≥ 1, this provides a bound on
π(dx)f (x).
Theorem 4.2.3 [Criterion for finite expectations] [38] Let S be a petite set, b ∈ R and V (.) : X → R+ , f (.) : X → [1, ∞).
Let {xn } be a Markov chain on X. If the following is satisfied:
X
then,
P (x, dy)V (y) ≤ V (x) − f (x) + b1x∈S ,
∀x ∈ X ,
(4.7)
40
4 Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
1
T →∞ T
T −1
f (xt ) = lim E[f (xt )] =
lim
t→∞
t=0
µ(dx)f (x) < ∞,
almost surely, where µ is the invariant probability measure on X.
Theorem 4.2.4 Let (4.7) hold. Under every invariant probability measure π,
π(dx)f (x) ≤ b.
Proof.
By Theorem 4.2.2, with taking T to be a deterministic stopping time,
lim sup
T
1
Ex [
T 0
T
k=0
f (xk )] ≤ lim sup
T
1
(V (x0 ) + bT ) = b.
T
Now, suppose that π is any invariant probability measure. Fix N < ∞, let fN = min(N, f ), and apply Fatou’s Lemma as
follows,
n−1
1
π(fN ) = lim sup π
P t fN
n t=0
n→∞
≤ π lim sup
n→∞
1
n
n−1
t=0
P t fN ≤ b .
Fatou’s Lemma is justified to obtain the first inequality, because fN is bounded. The second inequality holds by (??) and
since fN ≤ f . The monotone convergence theorem then gives π(f ) ≤ bf .
⋄
We need to ensure that there exists an invariant probability measure, however. This is why we require that f : X → [1, ∞),
where 1 can be replaced with any positive number. If the existence of an invariant probability measure is known, then one
may take the range of f to be R+ .
Remark 4.2. The result above can be strengthened to the following.
imply, however,
occur is that
P n (x, dz) − π(dz) f (z) → 0. This does not
(π0 P n )(dz) − π(dz) f (z) → 0 for a random initial condition. A sufficient condition for the latter to
π0 (dz)V (z) < ∞ provided that Theorem 4.2.3 holds (see Theorem 14.3.5 in [38]).
4.2.3 Criterion for Recurrence
Theorem 4.2.5 (Foster-Lyapunov for Recurrence) Let S be a compact set, b < ∞ and V be an inf-compact functional
on X such that for all α ∈ R+ {x : V (x) ≤ α} is compact. Let {xn } be an irreducible Markov chain on X with a positive
irreducibility measure on S. If the following is satisfied:
X
P (x, dy)V (y) ≤ V (x) + b1x∈S ,
∀x ∈ X ,
(4.8)
then, with τS = min(t > 0 : xt ∈ S), Px (τS < ∞) = 1 for all x ∈ X.
Proof: Let τS = min(t > 0 : xt ∈ S). Define two stopping times: τS and τBN where BN = {x : V (x) ≥ N }. Note
that a sequence defined by Mt = V (xmin(t,τS ,τBN ) ) (which behaves as a supermartingale) is uniformly integrable until
min(τS , τBN ), and a variation of the optional sampling theorem (Theorem 4.1.11) applies. Note that, due to irreducibility,
/ S ∪ BN , since when exiting into BN , the minimum
min(τS , τBN ) < ∞ with probability 1. Now, it follows that for x ∈
value of the Lyapunov function is N :
V (x) = Ex [V (xmin(τS ,τBN ) )] ≥ Px (τBN < τS )N + Px (τBN ≥ τS )M,
4.2 Stability of Markov Chains: Foster-Lyapunov Techniques
41
for some finite positive M . Hence,
Px (τBN < τS ) ≤ V (x)/N
We also have that P (min(τS , τBN ) = ∞) = 0, since the chain is irreducible and it will escape any compact set in finite
time. As a consequence, we have that
Px (τS = ∞) ≤ P (τBN < τS ) ≤ V (x)/N
and taking the limit as N → ∞, Px (τS = ∞) = 0.
⋄
If S is further petite, then once the petite set is visited, any other set with a positive measure (under the irreducibility
measure) is visited with probability 1 infinitely often.
⋄
Exercise 4.2.1 Show that the random walk on Z is recurrent.
4.2.4 On small and petite sets
Establishing petiteness may be difficult to directly verify. In the following, we present two conditions that may be used to
establish the petiteness properties.
By [38], p. 131: For a Markov chain with transition kernel P and K a probability measure on natural numbers, if there
∞
exists for every E ∈ B(X), a lower semi-continuous function N (·, E) such that n=0 P n (x, E)K(n) ≥ N (x, E), for a
sub-stochastic kernel N (·, ·), the chain is called a T −chain.
Theorem 4.2.6 [38] For a T −chain which is irreducible, every compact set is petite.
For a countable state space, under irreducibility, every finite set S in (4.5) is petite.
The assumptions on petite sets and irreducibility can be relaxed for the existence of an invariant probability measure.
Tweedie [55] considers the following. If S is such that, the following uniform countable additivity condition
lim sup P (x, Bn ) = 0,
(4.9)
n→∞ x∈S
is satisfied for Bn ↓ ∅, then, (4.5) implies the existence of an invariant probability measure. There exists at most finitely
many invariant probability measures. By Proposition 5.5.5 (iii) of Meyn-Tweedie [38], that under irreducibility, the Harris
∞
recurrent component of the space can be expressed as a countable union of petite sets Cn with ∪∞
n=1 Cn , with ∪m Cm → ∅.
M
By Lemma 4 of Tweedie (2001), under uniform countable additivity, any set ∪i=1 Ci is uniformly accessible from S.
Therefore, if the Markov chain is irreducible, the condition (4.9) implies that the set S is petite. This may be easier
to verify for a large class of applications, than using the condition of establishing a T −chain property. Under further
conditions (such as if S is compact and V used in a drift criterion has compact level sets), one can take Bn to be subsets of
a compact set, making the verification even simpler.
4.2.5 Criterion for Transience
Criteria for transience is somewhat more difficult to establish. One convenient way is to construct a stopping time sequence
and show that the state does not come back to some set infinitely often. We state the following.
Theorem 4.2.7 ( [38], [30]) Let V : X → R+ . If there exists a set A such that E[V (xt+1 )|xt = x] ≤ V (x) for all x ∈
/A
and ∃¯
x∈
/ A such that V (¯
x) < inf z∈A V (z), then {xt } is not recurrent, in the sense that Px¯ (τA < ∞) < 1.
Proof: Let x = x¯. Proof now follows from observing that V (x) ≥
V (y)P (x, dy) ≥ (inf z∈A V (z))P (x, A). It thus follows that
y ∈A
/
y
V (y)P (x, dy) ≥ (inf z∈A V (z))P (x, A) +
42
4 Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
P (τA < 2) = P (x, A) ≤
V (x)
(inf z∈A V (z))
Likewise,
V (¯
x) ≥
V (y)P (¯
x, dy)
y
≥ ( inf V (z))P (¯
x, A) +
z∈A
(
y ∈A
/
V (s)P (y, ds))P (¯
x, dy)
s
≥ ( inf V (z))P (¯
x, A) +
y ∈A
/
≥ ( inf V (z))P (¯
x, A) +
y ∈A
/
z∈A
z∈A
P (¯
x, dy)(( inf V (s))P (y, A) +
s∈A
V (s)P (y, ds))
s∈A
/
P (¯
x, dy)(( inf V (s))P (y, A))
s∈A
P (¯
x, dy)P (y, A) .
= ( inf V (z)) P (¯
x, A) +
z∈A
(4.10)
y ∈A
/
Thus, observing that, P ({ω : τA (ω) < 3}) =
A
P (¯
x, dy) +
Px¯ (τA < 3) ≤
y ∈A
/
P (¯
x, dy)P (y, A), we observe that:
V (¯
x)
.
(inf z∈A V (z))
V (¯
x)
Thus, this follows for any n: Px¯ (τA < n) ≤ (inf z∈A
V (z)) < 1. Continuity of probability measures (by defining: Bn = {ω :
τA < n} and observing Bn ⊂ Bn+1 and that limn P (τA < n) = P (∪n Bn ) = P (τA < ∞) < 1) now leads to transience.
⋄
Observe the difference with the inf-compactness condition leading to recurrence and the above condition, leading to nonrecurrence.
We finally note that, a convenient way to verify instability or transience is to construct an appropriate martingale sequence.
4.2.6 State Dependent Drift Criteria: Deterministic and Random-Time
It is also possible that, in many applications, the controllers act on a system intermittently. In this case, we have the
following results [63]. These extend the deterministic state-dependent results presented in [38], [39]: Let τz , z ≥ 0 be a
sequence of stopping times, measurable on a filtration, possible generated by the state process.
Theorem 4.2.8 [63] Suppose that x is a ϕ-irreducible Markov chain. Suppose moreover that there are functions V : X →
(0, ∞), δ : X → [1, ∞), f : X → [1, ∞), a petite set C on which V is bounded, and a constant b ∈ R, such that the
following hold:
E[V (xτz+1 ) | Fτz ] ≤ V (xτz ) − δ(xτz ) + b1{xτz ∈C}
τz+1 −1
E
k=τz
f (xk ) | Fτz ≤ δ(xτz ) ,
z ≥ 0.
Then X is positive Harris recurrent, and moreover π(f ) < ∞, with π being the invariant distribution.
(4.11)
⋄
By taking f (x) = 1 for all x ∈ X, we obtain the following corollary to Theorem 4.2.8.
Corollary 4.2.1 [63] Suppose that X is a ϕ-irreducible Markov chain. Suppose moreover that there is a function V :
X → (0, ∞), a petite set C on which V is bounded,, and a constant b ∈ R, such that the following hold:
E[V (xτz+1 ) | Fτz ] ≤ V (xτz ) − 1 + b1{xτz ∈C}
sup E[τz+1 − τz | Fτz ] < ∞.
z≥0
(4.12)
4.3 Convergence Rates to Equilibrium
43
⋄
Then X is positive Harris recurrent.
More on invariant probability measures
Without the irreducibility condition, if the chain is weak Feller, if (4.5) holds with S compact, then there exists at least one
invariant probability measure as discussed in Section 3.3.
Theorem 4.2.9 [63] Suppose that X is a Feller Markov chain, not necessarily ϕ-irreducible. Then,
If (4.11) holds with C compact then there exists at least one invariant probability measure. Moreover, there exists
c < ∞ such that, under any invariant probability measure π,
Eπ [f (x)] =
X
π(dx)f (x) ≤ c.
(4.13)
4.3 Convergence Rates to Equilibrium
In addition to obtaining bounds on the rate of convergence through Dobrushin’s coefficient, one powerful approach is
through the Foster-Lyapunov drift conditions.
Regularity and ergodicity are concepts closely related through the work of Meyn and Tweedie [40], [41] and Tuominen
and Tweedie [54].
Definition 4.3.1 A set A ∈ B(X ) is called (f, r)-regular if
τB −1
sup Ex [
x∈A
k=0
r(k)f (xk )] < ∞
for all B ∈ B + (X ). A finite measure ν on B(X ) is called (f, r)-regular if
τB −1
Eν [
k=0
r(k)f (xk )] < ∞
for all B ∈ B + (X ), and a point x is called (f, r)-regular if the measure δx is (f, r)-regular.
This leads to a lemma relating regular distributions to regular atoms.
Lemma 4.3. If a Markov chain {xt } has an atom α ∈ B + (X ) and an (f, r)-regular distribution λ, then α is an (f, r)regular set.
Definition 4.3.2 (f -norm) For a function f : X → [1, ∞) the f -norm of a measure µ defined on (X , B(X )) is given by
µ
f
= sup |
µ(dx)g(x)|.
g≤f
The total variation norm is the f -norm when f = 1, denoted by .
TV
.
Coupling inequality and moments of return times to a small set The main idea behind the coupling inequality is to
bound the total variation distance between the distributions of two random variables by the probability they are different
44
4 Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
Let X and Y be two jointly distributed random variables on a space X with distributions µx , µy respectively. Then we can
bound the total variation between the distributions by the probability the two variables are not equal.
µx − µy
TV
= sup|µx (A) − µy (A)|
A
= sup|P (X ∈ A, X = Y ) + P (X ∈ A, X = Y )
A
− P (Y ∈ A, X = Y ) − P (Y ∈ A, X = Y )|
≤ sup|P (X ∈ A, X = Y ) − P (Y ∈ A, X = Y )|
A
≤P (X = Y )
The coupling inequality is useful in discussions of ergodicity when used in conjunction with parallel Markov chains.
Later, we will see that the coupling inequality is also useful to establish the existence of optimal solutions to average cost
optimization problems.
One creates two Markov chains having the same one-step transition probabilitis. Let {xn } and {x′n } be two Markov
chains that have probability transition kernel P (x, ·), and let C be an (m, δ, ν)-small set. We use the coupling construction
provided by Roberts and Rosenthal [48]. skip
Let x0 = x and x′0 ∼ π(·) where π(·) is the invariant distribution of both Markov chains. 1. If xn = x′n then xn+1 =
x′n+1 ∼ P (xn , ·) 2. Else, if (xn , x′n ) ∈ C × C then
with probability δ, xn+m = x′n+m ∼ ν(·)
with probability 1 − δ then independently
xn+m ∼
1
m
1−δ (P (xn , ·)
− δν(·))
x′n+m ∼
1
m ′
1−δ (P (xn , ·)
− δν(·))
3. Else, independently xn+m ∼ P m (xn , ·) and x′n+m ∼ P m (x′n , ·).
The in-between states xn+1 , ...xn+m−1 , x′n+1 , ...x′n+m−1 are distributed conditionally given xn , xn+m ,x′n , x′n+m .
By the Coupling Inequality and the previous discussion with Nummelin’s Splitting technique we have P n (x, ·) −
π(·) T V ≤ P (xn = x′n ).
4.3.1 Lyapunov conditions: Geometric ergodicity
Roberts and Rosenthal [48] use this approach to establish geometric ergodicity.
An irreducible Markov chain satisfies the univariate drift condition if there are constants λ ∈ (0, 1) and b < ∞, along with
a function W : X → [1, ∞), and a small set C such that
P V (x) ≤ λV (x) + b1C .
(4.14)
Theorem 4.3.1 (Theorem 16.1.2 in [38]) Suppose that xn is ψ-irreducible and aperiodic and (4.14) holds. Then, for some
r > 1,
1
rn sup
P n (x, ·) − π(·) T V → 0.
x∈X V (x)
One can now define a bivariate drift condition for two independent copies of a Markov chain with a small set C. This
condition requires that there exists a function h : X × X → [1, ∞) and α > 1 such that
P¯ h(x, y) ≤h(x, y)/α (x, y) ∈
/ C ×C
P¯ h(x, y) <∞ (x, y) ∈ C × C
4.3 Convergence Rates to Equilibrium
45
where
P¯ h(x, y) =
h(z, w)P (x, dz)P (y, dw)
X
X
Now we explore a connection between the univariate and bivariate drift conditions.
Proposition 4.4 (Proposition 11 of [48]). Suppose the univariate drift condition (4.14) is satisfied for V : X → [1, ∞)
b
and constants λ ∈ (0, 1) b < ∞ and small set C. Letting d = inf x∈C C V (x), if d > 1−λ
− 1, then the bivariate drift
1
−1
condition is satisfied for h(x, y) = 2 (V (x) + V (y)) and α = λ + b/(d + 1) > 1.
The bivariate condition ensures that the processes x, x′ hits the set C × C, from where the two chains may be coupled. The
coupling inequality then leads to the desired conclusion.
Theorem 4.5 (Theorem 9 of [48]). Suppose {xt } is an aperiodic, irreducible Markov chain with invariant distribution
π(·). Suppose C is a (1, ǫ, ν)-small set and V :→ [1, ∞) satisfies the univariate drift condition (4.14) with constants
λ ∈ (0, 1) and b < ∞ with V (x) < ∞ for some x ∈ X . Then {xt } is geometrically ergodic.
As a side remark, we note the following observation. If the univariate drift condition holds then the sequence of random
variables Mn = λ−n V (xn ) −
n−1
b1C (xk ) is a supermartingale and thus we have the inequality
k=0
n−1
Ex0 [λ−n V (xn )] ≤ V (x0 ) + Ex0 [
b1C (xk )]
(4.15)
k=0
for all n and x0 ∈ X , and with Theorem 3.10 we have
Ex [λ−τB ] ≤ V (x) + c(B)
for all B ∈ B + (X ).
We finally remark that a bound which explicitly bounds the moments of the coupling times is useful in itself. In particular,
it follows from [48] that
P (Xk = Xk′ ) ≤ (1 − ǫ)j + α−k (Bn0 )j−1 E[h(X0 , X0′ )],
(4.16)
where C is a n0 -small set and, h is as before and X0 , X0′ are the initial random variables used in the coupling construction.
4.3.2 Subgeometric ergodicity
Here, we review the class of subgeometric rate functions (see section 4 of [30], section 5 of [23], [38], [26], [54]).
Let Λ0 be the family of functions r : N → R>0 such that
r is non-decreasing, r(1) ≥ 2
and
log r(n)
↓ 0 as n → ∞
n
The second condition implies that for all r ∈ Λ0 if n > m > 0 then
n log r(n + m) ≤ n log r(n) + m log r(n) ≤ n log r(n) + n log r(m)
so that
46
4 Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
r(m + n) ≤ r(m)r(n)
for all m, n ∈ N.
(4.17)
The class of subgeometric rate functions Λ defined in [54] is the class of sequences r for which there exists a sequence r0
such that
r(n)
r(n)
0 < lim inf
≤ lim sup
< ∞.
n→∞ r0 (n)
n→∞ r0 (n)
The main theorem we cite on subgeometric rates of convergence is due to Tuominen and Tweedie [54].
Theorem 4.6 (Theorem 2.1 of [54]). Suppose that {xt }t∈N is an irreducible and aperiodic Markov chain on state space X
with stationary transition probabilities given by P . Let f : X → [1, ∞) and r ∈ Λ be given. The following are equivalent:
(i) there exists a petite set C ∈ B(X ) such that
τC −1
sup Ex [
x∈C
k=0
r(k)f (xk )] < ∞
(ii) there exists a sequence (Vn ) of functions Vn : X → [0, ∞], a petite set C ∈ B(X ) and b ∈ R+ such that V0 is bounded
on C,
V0 (x) = ∞ ⇒ V1 (x) = ∞,
and
P Vn+1 ≤ Vn − r(n)f + br(n)1C ,
n∈N
(iii) there exists an (f, r)-regular set A ∈ B + (X ).
(iv) there exists a full absorbing set S which can be covered by a countable number of (f, r)-regular sets.
Theorem 4.7. [54] If a Markov chain {xt } satisfies Theorem 4.6 for (f, r) then r(n) P n (x0 , ·) − π(·)
increases
f
→ 0 as n
Thus, we observe that the hitting times to a small set is a very important measure in characterizing not only the existence
of an invariant probability measure, but also how fast a Markov chain converges to equilibrium. Further results exist in the
literature to obtain more computable criteria for subgeometric rates of convergence, see e.g. [26].
4.4 Conclusion
This concludes our discussion for Controlled Markov Chains via martingale Methods. We will revisit one more application
of martingales while discussing the convex analytic approach to Controlled Markov Problems.
4.5 Exercises
Exercise 4.5.1 Let G = {Ω, ∅}. Then, show that E[X|G] = E[X] and if G = F , then E[X|F ] = X.
Exercise 4.5.2 Let X be a random variable such that P (|X| < K) = 1 for some K ∈ R, that is X takes values from a
compact set. Let Yn = X + Wn for n ∈ Z+ and Wn be an independent noise for every n.
Let Fn be the σ-field generated by Y0 , Y1 , . . . , Yn .
Does limn→∞ E[X|Fn ] exist almost surely? Prove your statement.
Can you replace the boundedness assumption on X with E[|X|] < ∞?
4.5 Exercises
47
Exercise 4.5.3 Consider the following two-server system:
x1t+1 = max(x1t + A1t − u1t , 0)
x2t+1 = max(x2t + A2t + u1t 1(u1t ≤x1t +A1t ) − u2t , 0),
(4.18)
where 1(.) denotes the indicator function and A1t , A2t are independent and identically distributed (i.i.d.) random variables
with geometric distributions, that is, for i = 1, 2,
P (Ait = k) = pi (1 − pi )k
k ∈ {0, 1, 2, . . . , },
for some scalars p1 , p2 such that E[A1t ] = 1.5 and E[A2t ] = 1.
a) Suppose the control actions u1t , u2t are such that u1t + u2t ≤ 5 for all t ∈ N+ and u1t , u2t ∈ R+ . At any given time t, the
controller has to decide on u1t and u2t with knowing {x1s , x2s , s ≤ t} but not knowing A1t , A2t .
Is this server system stochastically stabilizable by some policy, that is, does there exist an invariant distribution under some
control policy?
If your answer is positive, provide a control policy and show that there exists a unique invariant distribution.
If your answer is negative, precisely state why.
b) Repeat part a where now we restrict u1t , u2t ∈ Z+ , that is the actions are to be integer valued.
Exercise 4.5.4 a) Consider a Controlled Markov Chain with the following dynamics:
xt+1 = axt + but + wt ,
where wt is a zero-mean Gaussian noise with a finite variance, a, b ∈ R are the system dynamics coefficients. One controller
policy which is admissible (that is, the policy at time t is measurable with respect to σ(x0 , x1 , . . . , xt ) and is a mapping to
R) is the following:
a + 0.5
xt .
ut = −
b
Show that {xt }, under this policy, has a unique invariant distribution.
b) Consider a similar setup to the one earlier, with b = 1:
xt+1 = axt + ut + wt ,
where wt is a zero-mean Gaussian noise with a finite variance, and a ∈ R is a known number.
This time, suppose, we would like to find a control policy such that
lim E[x2t ] < ∞
t→∞
Further, suppose we restrict the set of control policies to be linear, time-invariant; that is of the form u(xt ) = kxt for some
k ∈ R.
Find the set of all k values for which the state has a finite second moment, that is find
{k : k ∈ R, lim E[x2t ] < ∞},
t→∞
as a function of a.
Hint: Use Foster-Lyapunov criteria.
Exercise 4.5.5 Prove Birkhoff’s Ergodic Theorem for a countable state space; that is the result that for a Markov chain
{xt } living in a countable space X, which has a unique invariant distribution µ, the following applies almost surely:
48
4 Martingales and Foster-Lyapunov Criteria for Stabilization of Markov Chains
T
1
T →∞ T
f (i)µ(i),
f (xt ) =
lim
t=1
i
for every f : X → R.
Hint: You may proceed as follows. Define a sequence of empirical occupation measures for t ∈ Z+ , A ∈ B(X):
1
T
vT (A) =
T −1
1{xt ∈A} ,
t=0
∀A ∈ σ(X).
Now, define:
t
Ft (A) =
s=1
1(xs )∈A − t
t
P (A|x)vt (x)
X
t−1
=
s=1
1{xs ∈A} −
P (A|x)1{xs =x}
(4.19)
s=0 X
Let Ft = σ(x0 , · · · , xt ). Verify that, for t ≥ 2,
E[Ft (A)|Ft−1 ]
t
=E
s=1
t−1
1{xs ∈A} −
1{xt ∈A} −
=E
X
s=0 X
P (A|x)1{xt−1 =x} |Ft−1
t−1
+
s=1
t−2
1{xs ∈A} −
t−1
=0+
s=1
P (A|x)1{xs =x} |Ft−1
1{xs ∈A} −
P (A|x)1{xs =x}
s=0 X
t−2
s=0 X
P (A|x)1{xs =x} |Ft−1
= Ft−1 (A),
(4.20)
(4.21)
where the last equality follows from the fact that E[1xt ∈A |Ft−1 ] = P (xt ∈ A|Ft−1 ). Furthermore,
|Ft (A) − Ft−1 (A)| ≤ 1.
Now, we have a sequence which is a martingale sequence. We will invoke a martingale convergence theorem; which is
applicable for martingales with bounded increments. By a version of the martingale stability theorem, it follows that
1
Ft (A) = 0
t→∞ t
lim
You need to now complete the remaining steps.
Hint: You can use the Azuma-Hoeffding inequality [25] and the Borel-Cantelli Lemma to complete the steps.
Exercise 4.5.6 Consider a queuing process, with i.i.d. Poisson arrivals and departures, with arrival mean µ and service
mean λ and suppose the process is such that when a customer leaves the queue, with probability p (independent of time) it
comes back to the queue. That is, the dynamics of the system satisfies:
Lt+1 = max(Lt + At − Nt + pt Nt , 0),
t ∈ N.
4.5 Exercises
49
where E[At ] = λ, E[Nt ] = µ and E[pt ] = p.
For what values of µ, λ is such a system stochastically stable? Prove your statement.
Exercise 4.5.7 Consider a two server-station network; where a router routes the incoming traffic, as is depicted in Figure
4.1.
Station 1
u
Station 2
Fig. 4.1
Let L1t , L2t denote the number of customers in stations 1 and 2 at time t. Let the dynamics be given by the following:
L1t+1 = max(L1t + ut At − Nt1 , 0),
t ∈ N.
L2t+1 = max(L2t + (1 − ut )At − Nt2 , 0),
t ∈ N.
Customers arrive according to an independent Bernoulli process, At , with mean λ. That is, P (At = 1) = λ and P (At =
0) = 1 − λ. Here ut ∈ [0, 1] is the router action.
Station 1 has a Bernoulli service process Nt1 with mean n1 , and Station 2 with n2 .
Suppose that a router decides to follow the following algorithm to decide on ut : If a customer arrives, the router simply
sends the incoming customer to the shortest queue.
Find necessary and sufficient conditions (on λ, n1 , n2 ) for this algorithm to lead to a stochastically stable system which
satisfies limt→∞ E[L1t + L2t ] < ∞.
5
Dynamic Programming
In this chapter, we introduce the method of dynamic programming for controlled stochastic systems. Let us first recall a
few notions discussed earlier.
Recall that a Markov control model is a five-tuple
(X, U, {U(x), x ∈ X}, Q, ct )
such that X is a Polish metric space, U is the Polish control action space, Ut (x) ⊂ U is the control action space when the
state is x at time t, such that
Kt = {(x, u) : x ∈ X, u ∈ Ut (x)},
is a subset X × U, is a feasible state, control pair, Q is a stochastic kernel on X given Kt . Finally ct : Kt → R is the cost
function.
In case Ut (x) and ct do not depend on t, we drop the time index. Then, the Markov model is called a stationary Markov
model.
If Π = {Πt } is a randomized Markov policy, we define the cost function as
c(xt , u)Πt (du|xt )
ct (xt , Π) =
U(xt )
and the transition function
Q(dz|xt , u)Πt (du|xt )
Qt (dz|xt , Π) =
U(xt )
Let, as in Chapter 2, ΠA denote the set of all admissible policies.
Let Π = {Πt , 0 ≤ t ≤ N − 1} ∈ ΠA be a policy. Consider the following expected cost:
N −1
J(Π, x) = ExΠ [
c(xt , ut ) + cN (xN )],
t=0
where cN (.) is the terminal cost function. Define
J ∗ (x) := inf J(Π, x)
Π∈ΠA
As earlier, let ht = {x[0,t] , u[0,t−1] } denote the history or the information process.
The goal is to find, if there exists one, an admissible policy such that J ∗ (x) is attained. We note that the infimum value is
not always attained.
52
5 Dynamic Programming
Before we proceed further, we note that we could express the cost as:
J(Π, x) = ExΠ c(x0 , u0 )
+ExΠ1 c(x1 , u1 )
+ExΠ2 c(x2 , u2 )
+...
+ExΠN −1 [c(xN −1 , uN −1 ) + cN (xN )|hN −1 ]|hN −2 . . . |h1 |h0 ,
= ExΠ0 c(x0 , u0 )
+ExΠ11 c(x1 , u1 )
+ExΠ22 c(x2 , u2 )
+...
−1
[c(xN −1 , uN −1 ) + cN (xN )|hN −1 ]|hN −2 . . . |h1 |h0 ,
+ExΠNN−1
(5.1)
The above follows from Theorem 4.1.3. Thus, one can inductively obtain:
inf J(Π, x) ≥ inf ExΠ0 c(x0 , u0 )
Π
Π0
+ inf ExΠ11 c(x1 , u1 )
Π1
+ inf ExΠ22 c(x2 , u2 )
Π2
+...
−1
[c(xN −1 , uN −1 ) + cN (xN )|hN −1 ]|hN −2 . . . |h1 |h0 ,
+ inf ExΠNN−1
ΠN −1
(5.2)
As a result, it can be shown that
inf J(Π, x) ≥ inf ExΠ0 c(x0 , u0 )
Π
Π0
+ inf ExΠ11 c(x1 , u1 )
Π1
+ inf ExΠ22 c(x2 , u2 )
Π2
+...
−1
[c(xN −1 , uN −1 ) + cN (xN )|hN −1 ]|hN −2 . . . |h1 |h0 ,
+ inf ExΠNN−1
ΠN −1
(5.3)
Also, observe that for all t, and measurable functions gt
E[gt (xt+1 )|ht , ut ] = E[gt (xt+1 )|xt , ut ]
5.1 Bellman’s Principle of Optimality
53
This follows from the controlled Markov property. This last step is crucial in identifying a dependence only on the most
recent state for an optimal control policy, as we see in the next section.
5.1 Bellman’s Principle of Optimality
Let {Jt (xt )} be a sequence of functions on X denifed by
JN (x) = cN (x)
and for 0 ≤ t ≤ N − 1
Jt (x) = min {ct (x, u) +
u∈Ut (x)
Jt+1 (z)Qt (dz|x, u)}.
X
Suppose these functions admit a minimum and are measurable. We will discuss a number of sufficiency conditions on when
this is possible. Let there be minimizing selectors which are deterministic and denoted by {ft (x)} and let the minimum
cost be equal to
Jt+1 (z)Qt (dz|x, ft (x))}
Jt (x) = c(x, ft (xt )) +
X
Then we have the following:
Theorem 5.1.1 The policy Π ∗ = {f0 , f1 , . . . , fN −1 } is optimal and the cost function is equal to
J ∗ (x) = J0 (x)
We compare the cost generated by the above policy, with respect to the cost obtained by any other policy.
Proof: We provide the proof by a backwards induction method in view of (5.3). Consider the time stage t = N − 1. For
this stage, the cost is equal to
JN (z)QN (dz|xN −1 , uN −1 )}
JN −1 (x) = min{cN −1 (xN −1 , uN −1 ) +
X
∗
Suppose there is a cost CN
−1 (xN −1 ), achieved by some policy η = {ηk (x)}, which we take to be deterministic but this is
without any loss. Since,
∗
CN
−1 (xN −1 )
JN (z)QN (dz|xN −1 , ηN −1 (xN −1 ))
= cN −1 (xN −1 , ηN −1 (xN −1 )) +
X
≥ JN −1 (xN −1 )
= min {cN −1 (xN −1 , uN −1 ) +
uN −1
JN (z)QN (dz|xN −1 , uN −1 )},
(5.4)
X
∗
it must be that CN
−1 (xN −1 ) ≥ JN −1 (xN −1 ). Now, we move to time stage N − 2. In this case, the cost to be minimized
is given by
∗
CN
−2 (xN −2 ) = cN −2 (xN −2 , η(xN −2 )) +
≥ min {cN −2 (xN −2 , uN −2 ) +
uN −2
X
∗
CN
−1 (z)QN (dz|xN −2 , η(xN −2 ))
JN −1 (z)QN −1 (dz|xN −2 , uN −2 )}
X
=: JN −2 (xN −2 )
(5.5)
54
5 Dynamic Programming
∗
where the inequality is due to the fact that JN −1 (xN −1 ) ≤ CN
−1 (xN −1 ) and the minimization. We can, by induction
show that the recursion holds for all 0 ≤ t ≤ N − 2.
⋄
5.2 Optimality of Deterministic Markov Policies
We observe that when there is an optimal solution, the optimal solution is Markov. Let us provide a more insightful way to
derive this property.
In the following, we will follow David Blackwell’s [15] and Hans Witsenhausen [57]’s ideas:
Theorem 5.2.1 (Blackwell’s Irrelevant Information Theorem) Let X, Y, U be complete, separable, metric spaces, and
let P be a probability measure on B(X × Y), and let c : X × U → R be a Borel measurable cost function. Then, for any
Borel measurable function γ : X × Y → U, there exists another Borel measurable function γ ∗ : X → U such that
X
c(x, γ ∗ (x))P (dx) ≤
c(x, γ(x, y))P (dx, dy)
X×Y
Furthermore, policies based on only x almost surely, are optimal.
Proof: We will construct a γ ∗ given γ.
Let us write
c(x, u)P γ (du|x) P (dx),
c(x, γ(x, y))P (dx, dy) =
X
X×Y
where P γ (u ∈ D|x) =
U
1{γ(x,y)∈D}P (dy|x). Consider hγ (x) :=
U
c(x, u)P γ (du|x)
Let
D = {(x, u) ∈ X × U : c(x, u) ≤ hγ (x)},
D is a Borel set since c(x, u) − hγ (x) is a Borel measurable function.
a) Suppose the space U is countable. In this case, let us enumerate the elements in U as {uk , k = 1, 2, . . . }. In this case,
we could define:
Di = {x ∈ X : (x, ui ) ∈ D}, i = 1, 2, . . . , .
and define:
γ ∗ (x) = uk
k−1
if x ∈ Dk \ (∪i=1
Di ), k = 1, 2, . . . ,
Such a function is measurable, by construction.
b) Suppose now the space U is separable, but assume that c(x, u) is continuous in u.
Suppose that the space U was partitioned into a countable collection of disjoint sets, Kn = {Kn1 , Kn2 , . . . , Knm , . . . , } and
elements ukn ∈ Knk in the sets. In this case, we could define, by an enumeration of the control actions in {Kn }:
Di = {x ∈ X : (x, ukn ) ∈ D},
and define:
γn (x) = ukn
if
k−1
Di )
x ∈ Dk \ (∪i=1
i
We can make Kn finer, by replacing the above with Kn+1 such that Kn+1
⊂ Knm for some m (For example if U = R, let
1
n
n
Kn = {[ 2n [⌊u2 ⌋, ⌈u2 ⌉), u ∈ R}).
Each of the γn functions are measurable. Furthermore, by a careful enumeration, since the image partitions become finer,
limn→∞ γn (x) exists pointwise. Being pointwise limit of measurable functions, the limit is also measurable, define this as
5.2 Optimality of Deterministic Markov Policies
55
γ ∗ . By continuity, then, it follows that, {(x, γ ∗ (x))} ∈ D since limn→∞ c(x, γn (x)) = c(x, γ ∗ (x)) ≤ hγ (x) (Note that
due to continuity, for every x and ǫ, there exists a n such that c(x, γn (x)) ≤ c(x, u) + ǫ ≤γ (x) + ǫ for some (x, u) ∈ D).
This would complete the proof for the case when c is continuous in u.
c) For the general case, when c is not necessarily continuous, we define Dx = {u : (x, u) ∈ D}. It follows that
P γ (Dx |x) > 0, for otherwise we would arrive at a contradiction. It then follows from Theorem 2 of [16] that we can
construct a measurable function γ ∗ whose graph {(x, γ ∗ (x))} ∈ D. We do not provide the detailed proof for this case. ⋄
We also have the following discussion.
−1
Theorem 5.2.2 Let {(xt , ut )} be a controlled Markov Chain. Consider the minimization of E[ Tt=0
c(xt , ut )], over all
control policies which are sequences of causally measurable functions of {xs , s ≤ t}, for all t ≥ 0. Suppose further that
the cost function is measurable and the transition kernel is such that Φ(dxt+1 |xt , ut )f (xt+1 ) is measurable on U for
every measurable and bounded function f on X. If an optimal control policy exists, there is no loss in restricting policies
to be Markov, that is a policy which only uses the current state xt and the time information t.
Proof: Let us define a history process {ht } where ht = {x0 , . . . , xt } for t ≥ 0. Let there be an optimal policy given by
{ηt , t ≥ 0} such that ut = ηt (ht ). It follows that
P (xt+1 ∈ B|xt )
=
U
Xt
Φ(xt+1 ∈ B|ut , xt )1{ut =η(ht )} P (dht |xt )
1{ut =η(ht )∈du} P (dht |xt )
U
Φ(xt+1 ∈ B|ut , xt )
U
Φ(xt+1 ∈ B|u, xt )P˜ (du|xt )
=
=
Xt
(5.6)
where
P˜ (du|xt ) =
Xt
1{u=η(ht )∈du} P (ht |xt )
Thus, we may replace any history dependent policy with a possibly randomized Markov policy.
By hypothesis, there exists a solution almost surely. If the solution is randomized, then there exists some distribution on U
such that P (ut ∈ B|xt ) = κt (B), and the optimal cost can be expressed as a dynamic programming recursion for some
sequence of functions on X, {Jt , t ≥ 0} with:
κt (du) c(xt , u) +
Jt (xt ) =
U
X
Φ(dxt+1 |xt , u)Jt+1 (xt+1 )
Define
M (xt , ut ) = c(xt , ut ) +
X
Φ(dxt+1 |xt , ut )Jt+1 (xt+1 )
If inf ut ∈U M (xt , ut ) admits a solution, the distribution κ can be taken to be a Dirac-delta (δ) distribution at the minimizing
value. By hypothesis, a solution exists P − a.s.. Thus, an optimal policy depends on the state value and time variable
(dependency on time is due to the time-dependent nature of Jt ).
⋄
56
5 Dynamic Programming
5.3 Existence of Minimizing Selectors and Measurability
The above dynamic programming arguments hold when there exist minimizing control policies (selectors measurable with
respect to the Borel field on X).
Measurable Selection Hypothesis: Given a sequence of functions Jt : X → R, there exists
Jt (x) =
Jt+1 (y)Q(dy|x, u)),
min (c(xt , ut ) +
ut ∈Ut (x)
X
for all x ∈ X, for t ∈ {0, 1, 2 . . . , N − 1} with
JN (xN ) = cN (xN ).
Furthermore, there exist measurable functions ft such that
Jt+1 (y)Q(dy|x, f (xt ))),
Jt (x) = (c(xt , f (xt )) +
X
⋄
Recall that a set in a normed linear space is (sequentially) compact if every sequence in the set has a converging subsequence.
Condition 1: The cost function to be minimized c(xt , ut ) is continuous on both U and X; Ut (x) = U is compact; and
Q(dy|x, u)v(y) is a (weakly) continuous function on X × U for every continuous and bounded v on X.
⋄
X
Condition 2: For every x ∈ X the cost function to be minimized c(xt , ut ) is continuous on U; Ut (x) is compact; and
X Q(dy|x, u)v(y) is a (strongly) continuous function on U for every bounded, measurable function v on X for every fixed
x.
⋄
Theorem 5.3.1 Under Condition 1 or Condition 2, there exists an optimal solution and the Measurable Selection applies,
and there exists a minimizing control policy ft : X → Ut (xt ).
Furthermore, under Condition 1, the function Jt (x) is continuous, if cN (xN ) is continuous.
The result follows from the two lemma below:
Lemma 5.3.1 A continuous function f : X → R over a compact set A ⊂ X admits a minimum.
Proof: Let δ = inf x∈A f (x). Let {xi } be a sequence such that f (xi ) converges to δ. Since A is compact {xi } must have a
converging subsequence {xi(n) }. Let the limit of this subsequence be x0 . Then, it follows that, {xi(n) } → x0 and thus, by
continuity {f (xi(n) )} → f (x0 ). As such f (x0 ) = δ.
⋄
To see why compactness is important, consider
inf
x∈A
1
x
for A = [1, 2). How about for A = R? In both cases there does not exist an x value in the specified set which attains the
infimum.
Lemma 5.3.2 Let U be compact, and c(x, u) be continuous on X × U. Then, minu c(x, u) is continuous on X.
Proof: Let xn → x, un optimal for xn and u optimal for x. Such optimal action values exist as a result of compactness of
U and continuity. Now,
| min c(xn , u) − min c(x, u)|
u
u
5.3 Existence of Minimizing Selectors and Measurability
≤ max c(xn , u) − c(x, u), c(x, un ) − c(xn , un )
57
(5.7)
The first term above converges since c is continuous in x, u. The second converges also. Suppose otherwise. Then, for some
ǫ > 0, there exists a subsequence such that
c(x, ukn ) − c(xkn , ukn ) ≥ ǫ
Consider the sequence (xkn , ukn ). There exists a subsequence such that (xkn′ , ukn′ ) which converges to x, u′ for some u′
since U is compact. Hence, for this subsequence, we have convergence of c(xkn′ , ukn′ ) as well as c(x, ukn′ ), leading to a
contradiction.
⋄
Compactness is a crucial component for this result.
The textbook by Hernandez-Lerma and Lasserre provides more general conditions. We did not address here the issue of
the existence of measurable functions. We refer the reader again to Sch¨al [51]. We, however, state the following result:
Lemma 5.3.3 Let c(x, u) be a continuous function on U for every x, U be a compact set. Then, there exists a Borel
measurable function f : X → U such that
c(x, f (x)) = min c(x, u)
u∈U
Proof: Let
c˜(x) := min c(x, u)
u∈U
We first observe the following. The function
c˜(x) := min c(x, u),
u∈U
is Borel measurable. This follows from the observation that it is sufficient to prove that {x : c˜(x) ≤ α} is Borel for every
α ∈ R. By continuity of c and compactness of U, the result follows.
Define {(x, u) : c(x, u) ≤ c˜(x).} This set is a Borel set. We can now construct a measurable function which lives in this
set, using the property that the control action space is a separable, complete, metric space as discussed in the proof of
Theorem 5.2.1. Note that due to continuity limn→∞ c(x, γn (x)) = c(x, γ ∗ (x)) ≤ c˜(x).
⋄
We can replace the compactness condition with an inf-compactness condition, and modify Condition 1 as below:
Condition 3: For every x ∈ X the cost function to be minimized c(xt , ut ) is continuous on X × U; is non-negative;
{u : c(x, u) ≤ α} is compact for all α > 0 and all x ∈ X; X Q(dy|x, u)v(y) is a continuous function on X × U for every
continuous and bounded v.
⋄
Theorem 5.3.2 Under Condition 3, the Measurable Selection Hypothesis applies.
Remark: We could relax the continuity condition and change it with lower-semi continuity. A function is lower semi⋄
continuous at x0 if lim inf x→x0 f (x) ≥ f (x0 ).
Essentially what is needed is the existence of a minimizing control function; or a selector which is optimal for every xt ∈ X.
In class we solved the following problem with q > 0, r > 0, pT > 0:
T −1
inf ExΠ [
Π
qx2t + rt2 + pT x2T ]
t=0
for a linear system:
xt+1 = axt + ut + wt ,
2
where wt is a zero-mean, i.i.d. Gaussian random variable with variance σw
. We showed that, by the method of completing
the squares:
58
5 Dynamic Programming
T −1
Jt (xt ) = Pt x2t +
2
Pt+1 σw
k=t
where
Pt = q + Pt+1 a2 −
Pt+1 a2
Pt+1 + r
and the optimal control policy is
ut =
−Pt+1 a
xt .
Pt+1 + r
Note that, the optimal control policy is Markov (as it uses only the current state).
5.4 Infinite Horizon Optimal Control Problems: Discounted Cost
When the time horizon becomes unbounded, we can not directly invoke Dynamic Programming in the form considered
earlier. In such settings, we look for conditions for optimality. We will investigate such conditions in this section.
Infinite horizon problems that we will consider will belong to two classes: Discounted Cost and Average cost problems.
We first discuss the discounted cost problem. The average cost problem is discussed in Chapter 7.
One popular type of problems are ones where the future costs are discounted: The future is less important than today, in
part due to the uncertainty in the future, as well as due to economic understanding that current value of a good is more
important than the value in the future.
The discounted cost function is given as:
T −1
β t c(xt , ut )],
J(ν0 , Π) = EνΠ0 [
t=0
for some β ∈ (0, 1).
If there exists a policy Π ∗ which minimizes this cost, the policy is said to be optimal.
We can consider an infinite horizon problem by taking the limit (when c is non-negative)
T −1
β t c(xt , ut )],
J(ν0 , Π) = lim EνΠ0 [
T →∞
t=0
and invoking the monotone convergence theorem:
∞
β t c(xt , ut )].
J(ν0 , Π) = EνΠ0 [
t=0
Note that,
∞
J(x0 , Π) =
ExΠ0 [c(x0 , u0 )
Π
β t c(xt , ut )]|x1 , x0 , u0 ],
+E [
t=1
or
∞
J(x0 , Π) = ExΠ0 [c(x0 , u0 ) + βE Π [
β t−1 c(xt , ut )|x1 , x0 , u0 ]],
t=1
or
J(x0 , Π) = ExΠ0 [c(x0 , u0 ) + βE Π [J(x1 , Π)]|x1 , x0 , u0 ].
Now, let J(x) = inf Π∈ΠA J(x, Π). We observe that the cost of starting at any position, at any time k leads to a recursion
where the future cost is multiplied by β. This can be observed by applying the Dynamic Programming equations discussed
5.4 Infinite Horizon Optimal Control Problems: Discounted Cost
59
earlier: For a given finite horizon truncation, we work with the sequential updates:
Jt (xt ) = min{c(xt , ut ) + β
u
X
Jt+1 (xt+1 )Q(dxt+1 |xt , ut )},
and hope that this converges to a limit.
Under measurable selection conditions, and a boundedness condition for the cost function, there exists a solution to the
discounted cost problem. In particular, the following holds:
Theorem 5.4.1 Suppose the cost function is bounded, non-negative, and one of the the measurable selection conditions
(Condition 1, Condition 2 or Condition 3) applies. Then, there exists a solution to the discounted cost problem. Furthermore, the optimal cost (value function) is obtained by a successive iteration of policies:
vn (x) = min{c(x, u) + β
u
X
vn−1 (y)Q(dy|x, u)}, ∀x, n ≥ 1
with v0 (x) = 0.
Before presenting the proof, we state the following two lemmas: The first one is on the exchange of the order of the
minimum and limits.
Lemma 5.4.1 [Hernandez-Lerma and Lasserre] Let Vn (x, u) ↑ V (x, u) pointwise. Suppose that Vn and V are continuous
in u and u ∈ U(x) is compact for every x ∈ X.Then,
lim min Vn (x, u) = min V (x, u)
n→∞ u∈U(x)
u∈U(x)
Proof. The proof follows from essentially the same arguments as in the proof of Lemma 5.3.2. Let u∗n solve minu∈U(x) Vn (x, u).
Note that
| min Vn (x, u) − min V (x, u)| ≤ V (x, u∗n ) − Vn (x, u∗n ),
u∈U(x)
u∈U(x)
(5.8)
since Vn (x, u) ↑ V (x, u). Now, suppose that
V (x, u∗n ) − Vn (x, u∗n ) > ǫ
(5.9)
¯ for some u
¯. Now, for every n′k ≥ n,
along a subsequence nk . There exists a further subsequence n′k such that u∗n′ → u
k
since Vn is monotonically increasing:
V (x, u∗n′ ) − Vn′k (x, u∗n′ ) ≤ V (x, u∗n′ ) − Vn (x, u∗n′ )
k
k
k
k
However, V (x, u∗n′ ) and for a fixed n, Vn (x, u∗n′ ), are continuous hence these two terms converge to:
k
k
V (x, u¯) − Vn (x, u¯).
For every fixed x and u¯, and every ǫ > 0, we can find a sufficiently large n such that V (x, u¯) − Vn (x, u¯) ≤ ǫ/2. Hence
(5.9) cannot hold.
⋄
Lemma 5.4.2 (i) The space of measurable functions X → R endowed with the ||.||∞ norm is a Banach space, that is
L∞ (X) = {f : ||f ||∞ = sup |f (x)| < ∞}
x
is a Banach space.
(ii) The space of continuous and bounded functions from X → R, Cb (X), endowed with the ||.||∞ norm is a Banach space.
60
5 Dynamic Programming
Proof of Theorem 5.4.1. Suppose that Condition 2 holds. We obtain the optimal solution as a limit of discounted cost
problems with a finite horizon T . By dynamic programming, we obtain the recursion for every T as:
JTT (x) = 0
T
JtT (x) = T (Jt+1
)(x) = {min{c(x, u) + β
u
X
T
Jt+1
(y)Q(dy|x, u)}}.
(5.10)
This sequence will lead to a solution for a T -stage discounted cost problem. Since JtT (x) ≤ JtT +1 (x), if there exists some
Jt∞ such that JtT (x) ↑ Jt∞ (x), we could invoke Lemma 5.4.1 to argue that
∞
Jt∞ (x) = T (Jt+1
)(x) = {min{c(x, u) + β
u
X
∞
Jt+1
(y)Q(dy|x, u)}}.
Such a limit exists, by the monotone convergence theorem since Jt∞ (x) ≤
indeed exists.
t
β t supx,u |c(x, u)| < ∞. Hence, a limit
We can also provide a more direct approach, which also provides a uniqueness result through a useful contraction property.
First note that
JtT (x) = J0T −t (x).
To show such a limit exists, we show that limt→−∞ JtT (x) exists which would then imply that J0∞ exists. We can obtain
the solution using an iteration, known as the value iteration algorithm. We observe that the vector J ∞ lives in L∞ (X)
(since the cost is bounded, there is a uniform bound for every x). We show that, the iteration given by
T (v)(x) = {min{c(x, u) + β
v(y)Q(dy|x, u)}}
u
X
is a contraction in L∞ (X). Let
||T (v) − T (v ′ )||∞
= sup |T (v)(x) − T (v ′ )(x)|
x∈X
= sup |{min{c(x, u) + β
x∈X
u
X
u
= sup 1A1 min{c(x, u) + β
u
x∈X
+1A2
− min{c(x, u) + β
u
X
≤ sup 1A1 c(x, u∗ ) + β
x∈X
+1A2
− c(x, u∗∗ ) − β
= sup 1A1 {β
x∈X
X
X
X
u
u
v(y)Q(dy|x, u∗ ) − c(x, u∗ ) − β
v ′ (y)Q(dy|x, u)}
X
v ′ (y)Q(dy|x, u)}
v(y)Q(dy|x, u)} + min{c(x, u) + β
X
v ′ (y)Q(dy|x, u∗ )
X
v(y)Q(dy|x, u∗∗ ) + c(x, u∗∗ ) + β
X
v ′ (y)Q(dy|x, u∗∗ )
X
x∈X
Q(dy|x, u∗∗ ) + 1A2
X
X
v(y)Q(dy|x, u)} − min{c(x, u) + β
(v(y) − v(y ′ ))Q(dy|x, u∗ )} + sup 1A2 {β
≤ β||v − v ′ ||∞ {1A1
v ′ (y)Q(dy|x, u)}|
v(y)Q(dy|x, u)} − {min{c(x, u) + β
X
(v ′ (y) − v(y))Q(dy|x, u∗∗ )}
Q(dy|x, u∗ )}
X
= β||v − v ′ ||∞
(5.11)
Here A1 denotes the event that
min{c(x, u) + β
u
X
v(y)Q(dy|x, u)} ≥ min{c(x, u) + β
u
v ′ (y)Q(dy|x, u)},
X
5.5 Linear Quadratic Gaussian Problem
and A2 denotes the complementary event, u∗∗ is the minimizing control for {c(x, u) + β
minimizer for {c(x, u) + β X v ′ (y)Q(dy|x, u)}.
X
61
v(y)Q(dy|x, u)} and u∗ is the
As a result T defines a contraction on the Banach space L∞ (X), and there exists a unique fixed point. Thus, the sequence
T
of iterations (5.10) converges to J∞
(x) = J0∞ (x).
⋄This iteration is known as the value iteration algorithm.
Hence, a function v : X → R is a solution to the infinite horizon problem if it satisfies:
v(x) = min{c(x, u) + β
u
v(y)Q(dy|x, u)},
X
for all x ∈ X. We call v or J0∞ the value function.
Remark 5.1. The above discussion also applies by considering a contraction on the space Cb (X), if Condition 1 holds.
c(x)
Remark 5.2. We note that if the cost is not bounded, one can define a weighted sup-norm: ||c||f = supx | v(x)
|, where v is
a positive function. The contraction discussion above will apply to this context with the revised norm.
5.5 Linear Quadratic Gaussian Problem
Consider the following linear system:
xt+1 = Axt + But + wt ,
n
where x ∈ R , u ∈ R
all t ≥ 0.
m
n
and w ∈ R . Suppose {wt } is i.i.d. Gaussian with a given covariance matrix E[wt wtT ] = W for
The goal is to obtain
inf J(Π, x),
Π
where
T −1
J(Π, x) = ExΠ [
xTt Qxt + uTt Rut + xTT QT xT ],
t=0
with R, Q, QT > 0 (that is, these matrices are positive definite).
In class, we obtained the Dynamic Programming recursion for the optimal control problem.
Theorem 5.5.1 The optimal control is linear and has the form:
ut = −(BPt+1 B + R)−1 B T Pt+1 Axt
where Pt solves the Discrete-Time Riccati Equation:
Pt = Q + AT Pt+1 A − AT Pt+1 B((BPt+1 B + R))−1 B T Pt+1 A,
with final condition PT = QT . The optimal cost is given by J(x0 ) = xT0 P0 x0 +
T −1
t=0
E[wtT Pt+1 wt ]
Theorem 5.5.2 As T → ∞ in the optimization problem above, if (A, B) is controllable and with Q = C T C and (A, C)
is observable, the sequence of optimal policies converges to a stationary policy. The Riccati recursion, under the stated
assumptions of observability and controllability
Pt = Q + AT Pt+1 A − AT Pt+1 B((BPt+1 B + R))−1 B T Pt+1 A,
admits a unique fixed point solution P that satisfies
62
5 Dynamic Programming
P = Q + AT P A − AT P B((BP B + R))−1 B T P A.
Furthermore, P is positive-definite. In addition, under the optimal control policy, limt→∞ xt = 0.
We observe that, if the system has noise, and if we have an an average cost optimization problem, the effect of the noise
will stay bounded, and the same solution with P and the induced control policy, will be optimal for the minimization of
the expected average cost optimization problem:
T −1
1 Π
Ex [
xTt Qxt + uTt Rut + xTT QT xT ]
T →∞ T
t=0
lim
5.6 Exercises
Exercise 5.6.1 An investor’s wealth dynamics is given by the following:
xt+1 = ut wt ,
√
where {wt } is an i.i.d. R+ -valued stochastic process with E[ wt ] = 1 and ut is the investment of the investor at time t.
The investor has access to the past and current wealth information and his previous actions. The goal is to maximize:
T −1
J(x0 , Π) = ExΠ0 [
t=0
√
xt − ut ].
The investor’s action set for any given x is: U(x) = [0, x]. His initial wealth is given by x0 .
Formulate the problem as an optimal stochastic control problem by clearly identifying the state space, the control action
space, the information available at the controller at any time, the transition kernel and a cost functional mapping the
actions and states to R.
Find an optimal policy.
√
√
Hint: For α √
≥ 0, x − u√+ α u is a concave function of u for 0 ≤ u ≤ x and its maximum is computed when the
derivative of x − u + α u is set to zero.
Exercise 5.6.2 Consider an investment problem in a stock market. Suppose there are N possible goods and suppose these
goods have prices {pit , i = 1, . . . , N } at time t taking values in a countable set P. Further, suppose that the stocks evolve
stochastically according to a Markov transition such that for every 1 ≤ i ≤ N
P (pit+1 = m|pit = n) = P i (n, m).
Suppose that there is a transaction cost of c for every buying and selling for each good. Suppose that the investor can only
hold on to at most M units of stock. At time t, the investor has access to his current and past price information and his
investment allocations up to time t.
When a transaction is made the following might occur. If the investor sells one unit of i, the investor spends c dollars for
the transaction, and adds pit to his account. If he only buys one unit of j, then he takes out pjt dollars per unit from his
account and also takes out c dollars. For every such a transaction per unit, there is a cost of c dollars. If he does not do
anything, then there is no cost. Assume that the investor can make at most one transaction per time stage.
Suppose the investor initially has an unlimited amount of money and a given stock allocation.
The goal is to maximize the total worth of the stock at the final time stage t = T ∈ Z+ minus the total transaction costs
plus the earned money in between.
5.6 Exercises
63
a) Formulate the problem as a stochastic control problem by clearly identifying the state, the control actions, the information available at the controller, the transition kernel and a cost functional mapping the actions and states to R.
b) By Dynamic Programming, generate a recursion which will lead to an optimal control policy.
Exercise 5.6.3 Consider a special case of the above problem, where the objective is to maximize the final worth of the
investment and minimize the transaction cost. Furthermore, suppose there is only one uniform stock price with a transition
kernel given by
P (pt+1 = m|pt = n) = P (n, m)
The investor has again a possibility of at most M units of goods. But, the investor might wish to purchase more, or sell
them.
a) Formulate the problem as a stochastic control problem by clearly identifying the state, the control actions, the transition
kernel and a cost functional mapping the actions and states to R.
b) By Dynamic Programming generate a recursion which will lead to an optimal control policy.
c) What is the optimal control policy?
Exercise 5.6.4 A fishery manager annually has xt units of fish and sells ut xt of these where ut ∈ [0, 1]. With the remaining
ones, the next year’s production is given by the following model
xt+1 = wt xt (1 − ut ) + wt ,
with x0 is given and wt is an independent, identically distributed sequence of random variables and wt ≥ 0 for all t and
therefore E[wt ] = w
˜ ≥ 0.
The goal is to maximize the profit over the time horizon 0 ≤ t ≤ T − 1. At time T , he sells all of the fish.
a) Formulate the problem as an optimal stochastic control problem by clearly identifying the state, the control actions, the
information available at the controller, the transition kernel and a cost functional mapping the actions and states to R.
b) Does there exist an optimal policy? If it does, compute the optimal control policy as a dynamic programming recursion.
Exercise 5.6.5 Consider the following linear system:
xt+1 = Axt + But + wt ,
where x ∈ Rn , u ∈ Rm and w ∈ Rn . Suppose {wt } is i.i.d. Gaussian with a given covariance matrix E[wt wtT ] = W for
all t ≥ 0.
The goal is to obtain
inf J(Π, x),
Π
where
T −1
J(Π, x) = ExΠ [
xTt Qxt + uTt Rut + xTT QT xT ],
t=0
with R, Q, QT > 0 (that is, these matrices are positive definite).
a) Show that there exists an optimal policy.
b) Obtain the Dynamic Programming recursion for the optimal control problem. Is the optimal control policy Markov? Is
it stationary?
c) For T → ∞, if (A, B) is controllable and with Q = C T C and (A, C) is observable, prove that the optimal policy is
stationary.
Exercise 5.6.6 Consider the following linear system:
64
5 Dynamic Programming
xt+1 = axt + pt but + wt ,
where the state, control and noise realizations satisfy xt ∈ R, ut ∈ R and wt ∈ R. Suppose {wt } is i.i.d. zero-mean
Gaussian with a given variance. Here {pt } is a sequence of independent, identically distributed random variables which
take values of 0 or 1, such that pt = 1 with probability p.
The goal is to obtain
inf J(Π, x),
Π
where
T −1
J(Π, x) = ExΠ [(
Qx2t + pt Ru2t ) + PT x2T ],
t=0
with R, Q, PT > 0.
The controller has access to the information set It = {xs , ps , s ≤ t − 1} ∪ {xt } at time t ∈ Z+ and can pick an arbitrary
value in R.
Obtain the Dynamic Programming recursion for the optimal control problem. What is an optimal control policy?
Hint: Note that when p = 1, one obtains the Riccati equation.
Exercise 5.6.7 Consider an inventory-production system given by
xt+1 = xt + ut − wt ,
where xt is R-valued, with the one-stage cost
c(xt , ut , wt ) = but + h max(0, xt + ut − wt ) + p max(0, wt − xt − ut )
Here, b is the unit production cost, h is the unit holding (storage) cost and p is the unit shortage cost. At any given time,
the decision maker can take ut ∈ R+ . The demand variable wt is a R-valued i.i.d. process, independent of x0 , with a finite
mean. The goal is to minimize
T −1
J(Π, x) = ExΠ [
c(xt , ut , wt )]
t=0
Obtain a recursive form for the optimal solution. In particular, show that the solution is of threshold type: There exists a
sequence of real-numbers st so that the optimal solution is of the form: ut = 0 × 1{xt ≥st } + (st − xt ) × 1{xt <st } . See [9]
for a detailed analysis of this problem.
Exercise 5.6.8 ( [10]) Consider a burglar who is considering retirement. His goal is to maximize his earnings up to time
T . At any time, he can either continue his profession to steal an amount of wt which is an i.i.d. R+ -valued random process
(he adds this amount to his wealth), or retire.
However, each time he attempts burglary, there is a chance that he gets caught and he loses all of his savings; this happens
according to an i.i.d. Bernoulli process so that he gets caught with probability p at each time stage.
Assume that his initial wealth is x0 = 0. His goal is to maximize E[xT ]. Find his optimal policy for 0 ≤ t ≤ T − 1.
Note: Such problems where a decision maker can quit or stop a process are known as optimal stopping problems.
Exercise 5.6.9 Consider an unemployed person who will have to work for years t = 1, 2, ..., 10 if she takes a job at any
given t.
Suppose that each year in which she remains unemployed; she may be offered a good job that pays 10 dollars per year
(with probability 1/4); she may be offered a bad job that pays 4 dollars per year (with probability 1/4); or she may not be
offered a job (with probability 1/2). These events of job offers are independent from year to year (that is the job market is
represented by an independent sequence of random variables for every year).
5.6 Exercises
65
Once she accepts a job, she will remain in that job for the rest of the ten years. That is, for example, she cannot switch from
the bad job to the good job.
Suppose the goal is maximize the expected total earnings in ten years, starting from year 1 up to year 10 (including year
10).
Should this person accept good jobs, bad jobs at any given year? What is her best policy?
State the problem as a Markov Decision Problem, identify the state space, the action space and the transition kernel.
Obtain the optimal solution.
6
Partially Observed Markov Decision Processes, Non-linear Filtering and the
Kalman Filter
As discussed earlier in chapter 2, consider a system of the form:
xt+1 = f (xt , ut , wt ),
yt = g(xt , vt ).
Here, xt is the state, ut ∈ U is the control, (wt , vt ) ∈ W × V are second order, zero-mean, i.i.d noise processes and wt
is independent of vt . We also assume that the state noise wt either has a probability mass function, or admits a probability
measure which is absolutely continuous with respect to the Lebesgue measure; this will ensure that the probability measure
admits a density function. Hence, the notation p(x) will denote either the probability mass for discrete-valued spaces or
probability density function for uncountable spaces. The controller only has causal access to the second component {yt } of
the process. A policy {Π} is measurable with respect to σ(It ) with It = {y[0,t] , u[0,t−1] }. We denote the observed history
space as: H0 := P, Ht = Ht−1 × Y × U. Here P, denotes the space of probability measures on X, which we assume to
be a Borel space. Hence, the set of randomized causal control policies are such that P (u(ht ) ∈ U|It ) = 1 ∀It ∈ Ht .
6.1 Enlargement of the State-Space and the Construction of a Controlled Markov Chain
One could transform a partially observable Markov Decision Problem to a Fully Observed Markov Decision Problem via
an enlargement of the state space. In particular, we obtain via the properties of total probability the following dynamical
recursion
πt (x) : = P (xt = x|y[0,t] , u[0,t−1] )
X
=
X
πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )P (ut−1 |y[0,t−1] , u[0,t−2] )
.
X πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )P (ut−1 |y[0,t−1] , u[0,t−2] )
πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )
.
X
X πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )
=: F (πt−1 , yt , ut−1 )
=
X
(6.1)
Here, the summation needs to be exchanged with an integral in case the variables live in an uncountable space. The
conditional measure process becomes a controlled Markov chain in P.
Theorem 6.1.1 The process {πt , ut } is a controlled Markov chain. That is, under any admissible control policy, given the
action at time t ≥ 0 and πt , πt+1 is conditionally independent from {πs , us , s ≤ t − 1}.
Proof: Suppose X is countable. Let D ∈ B(P(X)):
P (dπt+1 ∈ D|πs , us , s ≤ t)
X πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )P (ut−1 |y[0,t−1] , u[0,t−2] )
∈ D|πs , us , s ≤ t
=P
X
X πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )P (ut−1 |y[0,t−1] , u[0,t−2] )
68
6 Partially Observed Markov Decision Processes, Non-linear Filtering and the Kalman Filter
πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )
∈ D|πs , us , s ≤ t
X πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )
X πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )
∈ D|πt , ut
X πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 , ut−1 )
X
=P
X
=P
X
Here, the expression
X
X
πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 ,ut−1 )
X
πt−1 (xt−1 )P (yt |xt )P (xt |xt−1 ,ut−1 )
is a measurable function of πt , ut . The measurability follows
from the following useful results: A proof of of the first result can be found in [2] (see Theorem 15.13 in [2] or p. 215
in [17])
Theorem 6.1.2 Let S be a Polish space and M be the set of all measurable and bounded functions f : S → R. Then, for
any f ∈ M , the integral
π(dx)f (x)
defines a measurable function on P(S) under the topology of weak convergence.
This is a useful result since it allows us to define measurable functions in integral forms on the space of probability measures
when we work with the topology of weak convergence. The second useful result follows from Theorem 6.4.1 and Theorem
2.1 of Dubins and Freedman [27] and Proposition 7.25 in Bertsekas and Shreve [11].
Theorem 6.1.3 Let S be a Polish space. A function F : P(S) → P(S) is measurable on B(P(S)) (under weak convergence), if for all B ∈ B(S) (F (·))(B) : P(S) → R is measurable under weak convergence on P(S), that is for every
B ∈ B(S), (F (π))(B) is a measurable function when viewed as a function from P(S) to R.
⋄
Let the cost function to be minimized be
T −1
ExΠ0 [
c(xt , ut )],
t=0
where ExΠ0 [·] denotes the expectation over all sample paths with initial state given by x0 under policy Π. We can transform
the system into a fully observed Markov model as follows. Define the new cost as Since
T −1
T −1
c(xt , ut )] = ExΠ0 [
ExΠ0 [
E[c(xt , ut )|It ]],
t=0
t=0
we can write the per-stage cost function as
c˜(π, u) =
c(x, u)π(x),
X
π∈P
(6.2)
The stochastic transition kernel q is given by:
P (x, y|x′ , u)π(x′ ),
q(x, y|π, u) =
X
π∈P
And, this kernel can be decomposed as q(x, y|π, u) = P (y|π, u)P (x|π, u, y).
The second term here is the filtering equation, mapping (π, u, y) ∈ (P × U × Y) to P. It follows that (P, U, K, c˜) defines
a completely observable controlled Markov process. Here, we have
K(B|π, u) =
1{P (.|π,u,y)∈B} P (y|π, u),
Y
∀B ∈ σ(P).
6.2 Linear Quadratic Gaussian Case and Kalman Filtering
69
As such, one can obtain the optimal solution by using the filtering equation as a sufficient statistic in a centralized setting,
as Markov policies (policies that use the Markov state as their sufficient statistics) are optimal for control of Markov chains,
under well-studied sufficiency conditions for the existence of optimal selectors.
We call the control policies which use π as their information to generate control as separated control policies. This will
become more clear in the context of Linear Quadratic Gaussian Control Systems.
Since in our analysis earlier, we had assumed that the state belongs to a complete, separable, metric space, the analysis
here follows the analysis in the previous chapter, provided that the measurability, continuity, compactness conditions are
verified.
6.2 Linear Quadratic Gaussian Case and Kalman Filtering
Consider the following linear system:
xt+1 = Axt + But + wt ,
and
yt = Cxt + vt ,
where x ∈ Rn , u ∈ Rm and w ∈ Rn , y ∈ Rp , v ∈ Rp . Suppose {wt , vt } are i.i.d. random Gaussian vectors with given
covariance matrices E[wt wtT ] = W and E[vt vtT ] = V for all t ≥ 0.
The goal is to obtain
inf J(Π, µ0 ),
Π
where
T −1
xTt Qxt + uTt Rut + xTT QT xT ],
J(Π, µ0 ) = EµΠ0 [
t=0
with R > 0 and Q, QT ≥ 0 (that is, these matrices are positive definite and positive semi-definite).
We observe that the optimal control is linear and has the form:
ut = −(BKt+1 B + R)−1 B T Kt+1 AE[xt |It ]
where Kt solves the Discrete-Time Riccati Equation:
Kt = Q + AT Kt+1 A − AT Kt+1 B((BKt+1 B + R))−1 B T Kt+1 A,
with final condition KT = QT .
6.2.1 Separation of Estimation and Control
In the above problem, we observed that the optimal control has a separation structure: The controller first estimates the
state, and then applies its control action, by regarding the estimate as the state itself.
We say that control has a dual effect, if control affects the estimation quality. Typically, when the dual effect is absent,
separation principle observed above applies. In most general problems, however, the dual effect of the control is present.
That is, depending on the control, the estimation quality at the controller will be affected. As an example, consider a linear
system controlled over the Internet, where the controller applies a control, but does not know if the packet reaches the
destination or not. In this case, the control signal which was intended to be sent, does affect the estimation error quality.
70
6 Partially Observed Markov Decision Processes, Non-linear Filtering and the Kalman Filter
6.3 Estimation and Kalman Filtering
In this section, we discuss the control-free setup and derive the celebrated Kalman Filter.
Lemma 6.3.1 Let x, y be random variables with finite second moments and R > 0. The function which solves inf g(y) E[(x−
g(y))T R(x − g(y))] is G(y) = E[x|y] almost surely.
Proof: Let G(y) = E[x|y] + h(y), for some measurable h, then, P a.s. we have the following
E[(x − E[x|y] − h(y))T R(x − E[x|y] − h(y))|y]
= E[(x − E[x|y])T R(x − E[x|y])|y] + E[hT (y)Rh(y)|y] + 2E[(x − E[x|y])T Rh(y)|y]
= E[(x − E[x|y])T R(x − E[x|y])|y] + E[hT (y)Rh(y)|y] + 2E[(x − E[x|y])T |y]Rh(y)
= E[(x − E[x|y])T R(x − E[x|y])|y] + E[hT (y)Rh(y)|y]
≥ E[(x − E[x|y])T R(x − E[x|y])|y]
(6.3)
⋄
Remark 6.1. We note that the above admits a Hilbert space interpretation or formulation: Let H denote the space random
variables on which an inner product x, y = E[xT Ry] is defined. Let MH be a subspace of H, the closed space of
random variables that are measurable on σ(y) which have finite second moments. Then, the Projection theorem leads to
the observation that, an optimal estimate g(y) minimizing ||x − g(y)||22 , denoted here by G(y), is one which satisfies:
x − G(y), h(y) = E[x − G(y)T Rh(y)] = 0, ∀h ∈ MH
The conditional expectation satisfies this since:
E[(x − E[x|y])T Rh(y)] = E[E[(x − E[x|y])T Rh(y)|y]] = E[E[(x − E[x|y])T |y]Rh(y)] = 0,
since P a.s., E[(x − E[x|y])T |y] = 0.
For a linear Gaussian system, the state process has a Gaussian probability measure. A Gaussian probability measure can be
uniquely identified by knowing the mean and the covariance of the Gaussian random variable. This makes the analysis for
estimating a Gaussian random variable particularly simple to perform, since the conditional estimate of a partially observed
(through an additive Gaussian noise) Gaussian random variable is a linear function of the observed variable.
Recall that a Gaussian measure with mean µ and covariance matrix KXX has the following density:
p(x) =
−1
T
1
e−1/2((x−µ) KXX (x−µ))
1/2
(2π) |KXX |
n
2
Lemma 6.3.2 Let x, y be zero-mean Gaussian processes. E[x|y] is linear in y (if not zero mean, than the expectation is
T
affine). Let ΣXY = E[xy T ]. Then, E[x|y] = ΣXY ΣY−1Y y and E[(x−E[x|y])(x−E[x|y])T ] = ΣXX −ΣXY ΣY−1Y ΣXY
=:
D.
ΣXX ΣXY
.
ΣY X ΣY Y
is also symmetric (since the eigenvectors are the same and the eigenvalues are inverted) and given by:
Proof: By Baye’s rule and the fact that the processes admit densities: p(x|y) = p(x, y)/p(y). Let KXY :=
−1
It follows that ΣXY
ΨXX ΨXY
,
ΨY X ΨY Y
−1
KXY
=
Thus, for some normalization constant C,
T
T
p(x, y)
e−1/2(x ΨXX x+2x ΨXY y+y
=C
−1
T
p(y)
e−1/2(y KY Y )y
T
ΨY Y y)
6.3 Estimation and Kalman Filtering
71
By the completing the squares method for the expression
(xT ΨXX x + 2xT ΨXY y + y T ΨY Y y − y T KY−1Y )y) = (x − Hy)T D−1 (x − Hy) + Q(y),
−1
−1
−1
it follows that H = −ΨXX
ΨXY . Since KXY KXY
= I, this is also equal to ΣXY
ΣY Y . Here Q(y) is a quadratic
expression in y. As a result, one obtains
p(x|y) = Q(y)e−1/2(x−Hy)
Since
p(x|y)dx = 1, it follows that Q(y) =
1
n
(2π) 2 |D|1/2
T
D−1 (x−Hy)
.
and is in fact independent of y. Note also that
T
E[(x − E[x|y])(x − E[x|y])T ] = E[xxT ] − E[(E[x|y])(E[x|y])T ] = ΣXX − ΣXY ΣY−1Y ΣXY
⋄
Remark 6.2. Even if the random variables x, y are not Gaussian (but zero-mean), through another Hilbert space formulation
and an application of the Projection Theorem, the expression ΣXY ΣY−1Y y is the best linear estimate, that is the solution to
inf K E[(x − Ky)T (x − Ky)]. One can naturally generalize this for random variables with non-zero mean.
We will derive the Kalman filter in the following. The following two Lemma are instrumental.
Lemma 6.3.3 If E[x] = 0 and z1 , z2 independent Gaussian processes, then E[x|z1 , z2 ] = E[x|z1 ] + E[x|z2 ].
−1
Proof: The proof follows by writing z = [z1 , z2 ], noting that ΣZZ is diagonal and E[X|z] = ΣXZ ΣZZ
z.
Lemma 6.3.4 E[(x − E[x|y])(x − E[x|y])T ] is given by D above.
Now, we can move on to the derivation of the Kalman Filter.
Consider;
xt+1 = Axt + wt ,
yt = Cxt + vt ,
with E[wt wtT ] = W and E[vt vtT ] = V .
Define:
mt = E[xt |y[0,t−1] ]
Σt|t−1 = E[(xt − E[xt |y[0,t−1] ])(xt − E[xt |y[0,t−1] ])T |y[0,t−1] ]
Theorem 6.3.1 The following holds:
mt+1 = Amt + AΣt|t−1 C T (CΣt|t−1 C T + V )−1 (yt − Cmt )
Σt+1|t = AΣt|t−1 AT + W − (AΣt|t−1 C T )(CΣt|t−1 C T + V )−1 (CΣt|t−1 AT )
with
m0 = E[x0 ]
and
Σ0|−1 = E[x0 xT0 ]
Proof:
xt+1 = Axt + wt
mt+1 = E[Axt + wt |y[0,t] ] = E[Axt + wt |y[0,t−1] , yt − E[y[0,t−1] ]]
= Amt + E[A(xt − mt )|y[0,t−1] , yt − E[yt |y[0,t−1] ]] + E[wt |y[0,t−1] , yt − E[yt |y[0,t−1] ]]
⋄
72
6 Partially Observed Markov Decision Processes, Non-linear Filtering and the Kalman Filter
= Amt + E[A(xt − mt )|y[0,t−1] ] + E[A(xt − mt )|yt − E[yt |y[0,t−1] ]] + E[wt |yt − E[yt |y[0,t−1] ]]
= Amt + E[A(xt − mt )|yt − E[y[0,t−1] ]] + E[wt |yt − E[yt |y[0,t−1] ]]
= Amt + E[A(xt − mt ) + wt |yt − E[yt |y[0,t−1] ]]
= Amt + E[A(xt − mt ) + wt |yt − E[yt |y[0,t−1] ]]
(6.4)
Here, we use the fact that wt is independent of y[0,t−1] . Let know X = A(xt − mt ) + wt and Y = yt − Cmt . Then, by
Lemma 6.3.2, E[X|Y ] = ΣXY ΣY−1Y Y leading to:
mt+1 = Amt + AΣt|t−1 C T (CΣt|t−1 C T + V )−1 (yt − Cmt )
Likewise,
xt+1 − mt+1 = A(xt − mt ) + wt − AΣt|t−1 C T (CΣt|t−1 C T + V )−1 (yt − Cmt ),
leads to, after a few lines of calculations:
Σt+1|t = AΣt|t−1 AT + W − (AΣt|t−1 C T )(CΣt|t−1 C T + V )−1 (CΣt|t−1 AT )
⋄
The above is the celebrated Kalman filter.
Define now
m
˜ t = E[xt |y[0,t] ] = mt + E[xt − mt |y[0,t] ] = mt + E[xt − mt |yt − E[yt |y[0,t] ]].
It follows from mt = Am
˜ t−1 that
m
˜ t = Am
˜ t−1 + Σt|t−1 C T (CΣt|t−1 C T + V )−1 (yt − CAm
˜ t−1 )
We observe that the zero-mean variable xt − m
˜ t is orthogonal to It , in the sense that the error is independent of the
information available at the controller, and since the information available is Gaussian, independence and orthogonality are
equivalent.
We observe that the recursions in Theorem 6.3.1 remind us the recursions in Theorem 5.5.2. This leads to the following
result.
Theorem 6.3.2 Suppose (A, C) is observable and V > 0. Then, the recursions for the covariance matrices Σ(·) in 6.3.1
converge to a unique solution. If, in addition, with W = BB T and that (AT , B T ) is observable (that is (A, B) is controllable), the solution is positive definite.
Remark 6.3. The above suggest that if the observations are sufficiently informative, then the Kalman filter converges to a
unique solution, even in the absence of an irreducibility condition on the original state process xt . This observation has
been extended in the non-linear filtering literature [?].
6.3.1 Optimal Control of Partially Observed LQG Systems
With this observation, in class we reformulated the quadratic optimization problem as a function of m
˜ t , ut and xt − m
˜ t,
the optimal control problem is equivalent to the control fully observed state m
˜ t , with additive time-varying independent
Gaussian noise process.
In particular, the cost:
T −1
xTt Qxt + uTt Rut + xTT QT xT ],
J(Π, µ0 ) = EµΠ0 [
t=0
writes as:
6.4 Partially Observed Markov Decision Processes in General Spaces
T
T −1
m
˜ Tt Qm
˜ t + uTt Rut + m
˜ TT QT m
˜ T ] + EµΠ0 [
EµΠ0 [
73
t=0
t=0
(xt − m
˜ t )T Q(xt − m
˜ t )]
for the fully observed system:
m
˜ t = Am
˜ t−1 + But + w
¯t ,
with
w
¯t = Σt|t−1 C T (CΣt|t−1 C T + V )−1 (yt − CAm
˜ t−1 )
That we can take out the error term (xt − m
˜ t ) out of the summation is a consequence of what is known as the separation of
estimation and control, and a more special version is known as the certainty equivalence principle. In class, we observed
that, the absence of dual effect plays a key part in this analysis leading the separation of estimation and control principle,
in taking E[(xt − m)
˜ T Q(xt − m)]
˜ out of the conditioning on It .
Thus, the optimal control has the following form for all time stages:
ut = −(BPt+1 B T + R)−1 B T Pt+1 AE[xt |It ],
and the optimal cost will be
T −1
E[xT0 P0 x0 ] + E[
k=0
w
¯kT Pk+1 w
¯k + E(xt − m
˜ t )T Q(xt − m
˜ t )]
6.4 Partially Observed Markov Decision Processes in General Spaces
Topological Properties of Space of Probability Measures
In class we discussed three convergence notions; weak convergence, setwise convergence and convergence under total
variation:
Let P(RN ) denote the family of all probability measure on (X, B(RN )) for some N ∈ N. Let {µn , n ∈ N} be a sequence
in P(RN ).
A sequence {µn } is said to converge to µ ∈ P(RN ) weakly if
RN
c(x)µn (dx) →
c(x)µ(dx)
RN
for every continuous and bounded c : RN → R. On the other hand, {µn } is said to converge to µ ∈ P(RN ) setwise if
RN
c(x)µn (dx) →
c(x)µ(dx)
RN
for every measurable and bounded c : RN → R. Setwise convergence can also be defined through pointwise convergence
on Borel subsets of RN (see, e.g., [32]), that is
µn (A) → µ(A),
for all A ∈ B(RN )
since the space of simple functions are dense in the space of bounded and measurable functions under the supremum norm.
For two probability measures µ, ν ∈ P(RN ), the total variation metric is given by
µ−ν
TV
:= 2
sup
B∈B(RN )
=
sup
f: f
∞ ≤1
|µ(B) − ν(B)|
f (x)µ(dx) −
f (x)ν(dx) ,
(6.5)
74
6 Partially Observed Markov Decision Processes, Non-linear Filtering and the Kalman Filter
where the infimum is over all measurable real f such that f
converge to µ ∈ P(RN ) in total variation if µn − µ T V → 0.
∞
= supx∈RN |f (x)| ≤ 1. A sequence {µn } is said to
Setwise convergence is equivalent to pointwise convergence on Borel sets whereas total variation requires uniform convergence on Borel sets. Thus these three convergence notions are in increasing order of strength: convergence in total variation
implies setwise convergence, which in turn implies weak convergence.
Weak convergence may not be strong enough for certain measurable selection conditions. One could use other metrics,
such as total variation, or other topologies, such as the one generated by setwise convergence.
Total variation is a stringent notion for convergence. For example a sequence of discrete probability measures never converges in total variation to a probability measure which admits a density function with respect to the Lebesgue measure
and such a space is not separable. Setwise convergence also induces a topology on the space of probability measures and
channels which is not easy to work with since the space under this convergence is not metrizable ( [29, p. 59]).
However, the space of probability measures on a complete, separable, metric (Polish) space endowed with the topology
of weak convergence is itself a complete, separable, metric space [12]. The Prohorov metric, for example, can be used to
metrize this space.
In the following, we will verify the sufficiency of weak convergence.
The discussion in the previous chapter for countable spaces applies also to Polish spaces. In particular, with π denoting a
conditional probability measure πt (A) = P (xt ∈ A|It ), we can define a new cost as c˜(π, u) = X c(x, u)π(x), π ∈ P.
Let Γ (S) be the set of all Borel measurable and bounded functions from S to R. We first wish to find a topology on P(S),
under which functions of the form:
π(dx)f (x), f ∈ M (X)}
Θ := {
x∈S
are measurable on P(S).
By the above definitions, it is evident that both setwise convergence and total variation are sufficient for measurability of
cost functions of the form π(dx)f (x), since these are (sequentially) continuous on P(S) for every f ∈ Γ (S). However,
as we state in the following, the cost functions defined in (6.2) c˜ : P(X) × U → R is a measurable cost function under
weak convergence topology, as we discuss in the following (for a proof, see Bogachev (p. 215 [17])).
Theorem 6.4.1 Let S be a metric space and let M be the set of all measurable and bounded functions f : S → R under
which
π(dx)f (x)
defines a measurable function on P(S) under the weak convergence topology. Then, M is the same as Γ (S) of all bounded
and measurable functions.
This discussion is useful to establish that under weak convergence topology πt , ut forms a controlled Markov chain for the
dynamic programming purposes.
6.5 Exercises
Exercise 6.5.1 Consider a linear system with the following dynamics:
xt+1 = axt + ut + wt ,
and let the controller have access to the observations given by:
yt = pt (xt + vt ).
6.5 Exercises
75
Here {wt , vt , t ∈ Z} are independent, zero-mean, Gaussian random variables, with variances E[w2 ] and E[v 2 ]. The
controller at time t ∈ Z has access to It = {ys , us , pt s ≤ t − 1} ∪ {yt }. Here pt is an i.i.d. Bernoulli process such that
pt = 1 with probability p.
The initial state has a Gaussian distribution, with zero mean and variance E[x20 ], which we denote by ν0 . We wish to find
for some r > 0:
3
x2t + ru2t ],
inf J(x0 , Π) = EνΠ0 [
Π
t=0
Compute the optimal control policy and the optimal cost. It suffices to provide a recursive form.
Hint: Show that the optimal control has a separation structure. Compute the conditional estimate through a revised Kalman
Filter due to the presence of pt .
Exercise 6.5.2 Let X, Y be Rn and Rm valued zero-mean random vectors defined on a common probability space, which
have finite covariance matrices. Suppose that their probability measures are given by PX and PY respectively.
Find
inf E[(X − KY )T (X − KY )],
K
that is find the best linear estimator of X given Y and the resulting estimation error.
Hint: You may pose the problem as a Projection Theorem problem.
Exercise 6.5.3 Consider a Markov Decision Problem set up as follows. Let there be two possible links S 1 , S 2 a decision
maker can take to send information packets to a destination. These links each are Markov and for each of the links we have
that S i ∈ {0, 1}, i = 1, 2. Here, 0 denotes a good link and 1 denotes a bad link. If the decision maker picks one link, he
T −1
can find the state of the link. Suppose the goal is to maximize E[ k=0 1{Ut =1} 1{St1 =1} + 1{Ut =2} 1{St2 =1} ]. Each of the
links evolve according to a Markov model, independent of the other link.
The only information the encoder has is the initial probability measure on the links and the actions it has taken. Obtain
a dynamic programming recursion, by clearly identifying the controlled Markov chain and the cost function. Is the value
function convex? Can you deduce any structural properties on the optimal policies?
7
The Average Cost Problem
Consider the following average cost problem:
T −1
J(x, Π) = lim sup
T →∞
1 Π
E [
c(xt , ut )]
T x t=0
This is an important problem in applications where one is concerned about the long-term behavior, unlike the discounted
cost setup where the primary interest is in the short-term time stages.
7.1 Average Cost and the Average Cost Optimality Equation (ACOE) or Inequality (ACOI)
To study the average cost problem, one approach is to establish the existence of an Average Cost Optimality Equation.
Definition 7.1.1 The collection of functions g, h, f is a canonical triplet if for all x ∈ X,
g(x) = inf
u∈U
g(x) + h(x) = inf
u∈U
g(x′ )P (dx′ |x, u)
h(x′ )P (dx′ |x, u)
c(x, u) +
with
g(x) =
g(x) + h(x) =
g(x′ )P (dx′ |x, f (x))
c(x, u) +
h(x′ )P (dx′ |x, f (x))
Theorem 7.1.1 Let g, h, f be a canonical triplet. a) If g is a constant and lim supn→∞ n1 ExΠ [h(xn )] = 0 for all x and
under every policy Π, then the stationary deterministic policy Π ∗ = {f } is optimal so that
g = J(x, Π ∗ ) = inf J(x, Π)
Π∈Π A
where
J(x, Π) = lim sup
T →∞
1 Π
E [
T x
b) If g, considered above, is not a constant and depends on x then
T −1
c(xt , ut )].
k=0
78
7 The Average Cost Problem
T −1
T −1
1
1 Π∗
Ex [
g(xt )] ≤ inf lim sup ExΠ [
c(xt , ut )].
Π N →∞ N
T
t=0
t=0
Furthermore, Π ∗ = {f } is optimal.
Proof: We prove (a); (b) follows from a similar reasoning. For any policy Π,
n
h(xt ) − E Π [h(xt )|x[0,t−1] , u[0,t−1] ] = 0
ExΠ
t=1
Now,
E Π [h(xt )|x[0,t−1] , u[0,t−1] ] =
y
h(y)P (xt ∈ dy|xt−1 , ut−1 )
= c(xt−1 , ut−1 ) +
y
≥ min
h(y)P (dy|xt−1 , ut−1 ) − c(xt−1 , ut−1 )
c(xt−1 , ut−1 ) +
ut−1 ∈U
(7.1)
y
h(y)P (dy|xt−1 , ut−1 ) − c(xt−1 , ut−1 )
= g + h(xt−1 ) − c(xt−1 , ut−1 )
(7.2)
(7.3)
(7.4)
The above hold with equality if Π ∗ = {f } is adopted since Π ∗ provides the pointwise minimum. Hence,
0≤
1 Π
E
n x
n
t=1
[h(xt ) − g − h(xt−1 ) + c(xt−1 , ut−1 )]
and
g ≤ E Π [h(xn )]/n − ExΠ [h(x0 )]/n +
1 Π
E
n x
n
[c(xt−1 , ut−1 )]
t=1
with equality under Π ∗ .
⋄
7.1.1 The Discounted Cost Approach to the Average Cost Problem
Average cost emphasizes the asymptotic values of the cost function whereas the discounted cost emphasizes the short-term
cost functions. However, under technical restrictions, one can show that the limit as the discounted factor converges to 1,
one can obtain a solution for the average cost optimization. We now state one such condition below.
Theorem 7.1.2 Consider a finite state controlled Markov chain where the state and action spaces are finite, and for every
deterministic policy the entire state space is a recurrent set. Let
∞
Jβ (x) = inf Jβ (x, Π) = inf
Π∈ΠA
Π∈ΠA
ExΠ [
β t c(xt , ut )]
t=0
and suppose that Πn∗ is an optimal deterministic policy for Jβn (x). Then, there exists some Π ∗ which is optimal for every
β sufficiently close to 1, and is also optimal for the average cost
T −1
J(x) = inf lim sup
Π∈ΠA T →∞
1 Π
E [
c(xt , ut )]
T x t=0
Proof. Let βn ↑ 1. For each βn , Jβn is achieved by a stationary and deterministic policy. Since there are only finitely many
such policies, there exists at least one policy f ∗ which is optimal for infinitely many βn ; call this sequence βnk . We will
show that this policy is optimal for the average cost problem also.
7.1 Average Cost and the Average Cost Optimality Equation (ACOE) or Inequality (ACOI)
79
It follows that Jβnk (x, f ∗ ) ≤ Jβnk (x, Π) for all Π. Furthermore, it can be shown that for β ∈ (0, 1), [Jβ (x) is continuous
in β and as a result, it follows that for some β ∗ < 1, Jβ (x, f ∗ ) ≤ Jβ (x, Π) for all β ∈ (β ∗ , 1).
Now,
(1 − βnk )Jβnk (x, f ∗ ) ≤ (1 − βnk )Jβnk (x, Π)
(7.5)
and
lim (1 − βnk )Jβnk (x, f ∗ ) = lim sup(1 − βnk )Jβnk (x, f ∗ )
nk →∞
nk →∞
≤ lim sup(1 − βnk )Jβnk (x, Π) ≤ lim sup
nk →∞
T →∞
1 Π
E [
T x
T −1
c(xk , uk )]
(7.6)
k=0
Here, the sequence of inequalities follows from the following Abelian inequalities (see Lemma 5.3.1 in [31]): Let an be a
sequence of non-negative numbers and β ∈ (0, 1). Then,
lim inf
N →∞
1
N
∞
N −1
m=0
am ≤ lim inf (1 − β)
β↑1
∞
≤ lim sup(1 − β)
β↑1
m
β am
m=0
β m am
m=0
1
≤ lim sup
N
N →∞
N −1
am
(7.7)
m=0
As a result, f ∗ is optimal.
⋄
7.1.2 Polish State and Action Spaces, ACOE and ACOI [Optional]
In the following, we assume that c is bounded, measurable selection hypothesis holds and X is a Polish space and U is a
compact subset of a Polish space.
Consider the value function for a discounted cost problem as discussed in Section 5.4:
Jβ (x) = min{c(x, u) + β
U
X
Jβ (y)Q(dy|x, u)}, ∀x.
Let x0 be an arbitrary state and for all x ∈ X consider
Jβ (x) − Jβ (x0 )
= min c(x, u) + β
u∈U
P (dx′ |x, u)(Jβ (x′ ) − Jβ (x0 )) − (1 − β)Jβ (x0 )
(7.8)
As argued in Section 5.4, this has a solution for every β ∈ (0, 1).
A family of functions F mapping a metric separable space S to R is said to be equicontinuous at a point x0 ∈ S if, for
every ǫ > 0, there exists a δ > 0 such that d(x, x0 ) ≤ δ =⇒ |f (x) − f (x0 )| ≤ ǫ for all f ∈ F . The family F is said to
be equicontinuous if it is equicontinuous at each x ∈ S.
Now suppose that Jβ (x)− Jβ (x0 ) is (over β) equi-continuous and X is compact (or in the non-compact case, the difference
is uniformly upper and lower bounded). By the Arzela-Ascoli Theorem, taking β ↑ 1 along some sequence, for some
subsequence, Jβnk (x) − Jβnk (x0 ) → η(x) and by the Abelian inequality (7.7) (1 − βnk )Jβ (x0 ) → ζ ∗ will converge
possibly along a further subsequence.
Further analysis involving the exchange of minimum and the limit shows that one obtains the Average Cost Optimality
Equation (ACOE):
80
7 The Average Cost Problem
η(x) = min c(x, u) +
u∈U
P (dx′ |xt , ut )η(x′ ) − ζ ∗
(7.9)
We now make this observation formal.
Assumption 7.1.1
(a) The one stage cost function c is in Cb (X × U).
(b) The stochastic kernel η( · |x, u) is weakly continuous in (x, u) ∈ X × U, i.e., if (xk , uk ) → (x, u), then η( · |xk , uk ) →
η( · |x, u) weakly.
(c) X and U are compact.
Note that since the one stage cost function c is bounded by some L ≥ 0, we must have
(d) (1 − β)Jβ∗ (z) ≤ L for all β ∈ (0, 1) and z ∈ X.
In addition to Assumption 7.1.1, we impose the following assumption in this section.
Assumption 7.1.2
There exists α ∈ (0, 1) and N ≥ 0, a nonnegative function b on X and a state z0 ∈ X such that,
(e) −N ≤ hβ (z) ≤ b(z) for all z ∈ X and β ∈ [α, 1), where hβ (z) = Jβ∗ (z) − Jβ∗ (z0 ), for some fixed x0 ∈ X.
(f) The sequence {hβ(k) } is equicontinuous, where {β(k)} is a sequence of discount factors converging to 1 which satisfies
∗
limk→∞ (1 − β(k))Jβ(k)
(z) = ρ∗ for all z ∈ X for some ρ∗ ∈ [0, L].
Theorem 7.1. [50] Under Assumptions 7.1.1 and 7.1.2, there exist a constant ρ∗ ≥ 0, a continuous and bounded h from
to with −N ≤ h( · ) ≤ b( · ), and {f ∗ } ∈ S such that (ρ∗ , h, f ∗ ) satisfies the ACOE; that is,
ρ∗ + h(z) = min[c(z, a) + h(y)η(dy|z, a)]
u∈U
= c(z, f ∗ (z)) + h(y)η(dy|z, f ∗ (z)),
for all z ∈. Moreover, {f ∗ } is optimal and ρ∗ is the value function, i.e.,
inf J(ϕ, z) =: J ∗ (z) = J({f ∗ }, z) = ρ∗ ,
ϕ
for all z ∈ X.
Proof. By Assumption 7.1.2-(f) and the Arzela-Ascoli theorem, there exists a subsequence {hβ(kl ) } of {hβ(k) } which
converges uniformly to a continuous and bounded function h. Take the limit in (7.8) along this subsequence, i.e., consider
ρ∗ + h(z) = lim min[c(z, a) + β(kl ) hβ(kl ) (y)η(dy|z, a)]
l
A
= min lim[c(z, a) + β(kl ) hβ(kl ) (y)η(dy|z, a)]
A
l
= min[c(z, a) + h(y)η(dy|z, a)],
A
where the exchange of limit and minimum follows from the compactness of U and since
c(z, a) + β(kl ) hβ(kl ) (y)η(dy|z, a) → c(z, a) + h(y)η(dy|z, a)
7.2 Linear Programming Approach to Average Cost Markov Decision Problems
81
uniformly as l → ∞.
⋄
Remark 7.2. Remark 7.3. One can relax the compactness conditions and use instead Assumptions 4.2.1 and 5.5.1 of [31],
but one needs strong continuity; see e.g. Theorem 5.5.4 in in [31])). Further conditions also appear in the literature; see [24]
for such results and a detailed literature review.
In addition, if one cannot verify the equi-continuity assumption, the following holds; note that the condition of strong
continuity is required here.
Theorem 7.1.3 (Theorem 5.4.3 in [31]) Let for every measurable and bounded g, the integral
x, ut = u) be continuous in u for every x and that there exists N < ∞ and a function b(x) with
−N ≤ hβ (x) ≤ b(x),
g(xt+1 )P (dxt+1 |xt =
β ∈ (0, 1), x ∈ X
(7.10)
P (dx′ |xt , ut )η(x′ ) − ζ ∗
(7.11)
Then, the Average Cost Optimality Inequality (ACOI) holds:
η(x) ≥ min c(x, u) +
u∈U
The optimality result holds even with an inequality in that a solution to the right-hand side (7.11) is an optimal policy. Since
under the hypothesis of the setting we have that limT →∞ supΠ T1 Ex [c(xT , uT )] = 0 for every admissible x, Theorem 7.1.1
leads to to the result that both the ACOI and ACOE imply that ζ ∗ is the optimal cost and η is the value function. F
Remark 7.4. If under an optimal policy, there exists an atom α (or a pseudo-atom in view of the splitting technique discussed
in Chapter 3) so that in some finite expected time, for any two initial conditions x, x′ , the processes are coupled in the sense
described in Chapter 3, then the condition (7.10) holds since after hitting α the processes will be coupled and equal, leading
to a difference in the costs only until hitting the atom. One sufficient condition is through the coupling construction; see
(4.16). Therefore, the univariate drift condition (4.14) is sufficient (By Theorem 15.0.1 of [38] geometric ergodicity is
equivalent to the satisfaction of this drift condition). If this difference is further continuous in x, x′ so that the ArzelaAscoli Theorem is satisfied, the ACOE holds (this time, without the strong continuity condition). The assumption that
{hβ(k) } is equi-continuous is restrictive in general.
We also note that for the above arguments to hold, there does not need to be a single invariant distribution. Here in (7.8),
the pair x and x0 should be picked as a function of the reachable set under a given sequence of policies. The analysis for
such a condition is tedious in general since for every β a different optimal policy will typically be adopted; however, for
certain applications the reachable set from a given point may be independent of the control policy applied.
7.2 Linear Programming Approach to Average Cost Markov Decision Problems
Convex analytic approach of Manne [36] and Borkar [20] (later generalized by Hernandez-Lerma and Lasserre [31]) is
a powerful approach to the optimization of infinite-horizon problems. It is particularly effective in proving results on the
optimality of stationary policies, which can lead to an infinite-dimensional linear program. This approach is particularly
effective for constrained optimization problems and infinite horizon average cost optimization problems. It avoids the use
of dynamic programming.
Recall that we are interested in the minimization
inf lim sup
Π∈ΠA T →∞
1 Π
E
T x0
T
[c(xt , ut )],
(7.12)
t=1
where ExΠ0 [] denotes the expectation over all sample paths with initial state given by x0 under the admissible policy Π.
82
7 The Average Cost Problem
We first consider the finite space setting where both X and U are finite sets. We study the limit distribution of the following
occupation measures.
T
1
vT (D) =
1{xt ,ut )∈D} , D ∈ B(X × U).
T t=1
Define a Ft measurable process with A ∈ B(X):
t
1xs ∈A − t
Ft (A) =
s=1
P (A|x, u)vt (x, u)
X×U
We have that
E[Ft (A)|Ft−1 ] = Ft−1 (A)
∀t ≥ 0,
and {Ft (A)} is a martingale sequence.
Furthermore, Ft (A) is a bounded-increment martingale since |Ft (A) − Ft−1 (A)| ≤ 1. Hence, for every T > 2,
{F1 (A), . . . , FT (A)} forms a martingale sequence with uniformly bounded increments, and we could invoke the AzumaHoeffding inequality to show that for all x > 0
P (|
2
Ft (A)
| ≥ x) ≤ 2e−2x t
t
Finally, invoking the Borel-Cantelli Lemma [19] for the summability of the estimate above, that is:
∞
2e−2x
n=1
we deduce that
lim
t→∞
2
t
< ∞, ∀x > 0,
Ft (A)
= 0 a.s.
t
Thus,
lim
T →∞
vT (A) −
P (A|x, u)vT (x, u)
X×U
= 0,
A⊂X
Define
GX = {v ∈ P(X × U) : v(B × U) =
x,u
P (xt+1 ∈ B|x, u)v(x, u),
B ∈ B(X)}
Further define
G = {v ∈ P(X × U) : ∃Π ∈ ΠS , v(A) =
x,u
P Π ((xt+1 , ut+1 ) ∈ A|x)v(x, u),
A ∈ B(X × U)}
It is evident that G ⊂ GX since there are fewer restrictions for GX . We can show that these two sets are indeed equal.
In particular, for v ∈ GX , if we write: v(x, u) = π(x)P (u|x), then, we can construct a consistent v ∈ G: v(B × C) =
x∈B P (C|x).
Thus, every converging sequence vt will satisfy the above equation (note that, we do not claim that every sequence converges). And hence, any converging sequence of vT will asymptotically live in the set G.
This is where finiteness is helpful: If the state space were countable, we could also have to consider the sequences which
possible converge to infinity.
Thus, under any sequence of admissible policies, every converging occupation measure sequence converges to the set G.
7.2 Linear Programming Approach to Average Cost Markov Decision Problems
83
Lemma 7.2.1 Under any admissible policy, with probability 1, any converging sequence {vt } will converge to the set G.
Let us define
γ ∗ = inf
v(x, u)c(c, u)
v∈G
The above optimality argument holds also in a stronger sample-path sense. Consider the following:
inf lim sup
Π
T →∞
1
T
T
[c(xt , ut )],
(7.13)
t=1
where there is no expectation. The above is known as the sample path cost. Let v, c :=
v(x, u)c(x, u).
Now, we have that
lim inf vT , c ≥ γ ∗
T →∞
since for any sequence vTk which converges to the liminf value, there exists a further subsequence (due to the (weak)
compactness of the space of occupation measures) which has a weak limit, and this weak limit is in G. By Fatou’s Lemma:
lim vTk , c =
Tk →∞
lim vTk , c ≥ γ ∗ .
Tk →∞
Likewise, for the average cost problem
lim inf E[ vT , c ] ≥ E[lim inf vT , c ] ≥ γ ∗ .
T →∞
T →∞
Lemma 7.2.2 [Proposition 9.2.5 in [37]] The space G is convex and closed. Let Ge denote the set of extreme points of G.
A measure is in Ge if and only if two conditions are satisfied: i) The control policy inducing it is deterministic ii) Under
this deterministic policy, the induced Markov chain is irreducible in the support set of the invariant measure (there can’t
be two invariant measures under this deterministic policy).
Proof:
Consider η, an invariant measure in G which is non-extremal. Then, this measure can be written as a convex combination
of two measures:
η(x, u) = Θη 1 (x, u) + (1 − Θ)η 2 (x, u)
π(x) = Θπ 1 (x) + (1 − Θ)π 2 (x)
Let φ1 (u|x), φ2 (u|x) be two policies leading to ergodic measures η1 and η2 and invariant measures π1 and π2 respectively.
Then, there exists γ(·) such that:
φ(u|x) = γ(x)φ1 (u|x) + (1 − γ(x))φ2 (u|x),
where under φ the invariant measure η is attained.
There are two ways this can be satisfied: i) φ is randomized and ii) If φ is not randomized with φ1 = φ2 for all x and
are deterministic, then, it must be that π 1 and π 2 are different measures and have different support sets. In this case, the
measures η 1 and η 2 do not correspond to ergodic occupation measures.
We can show that for the countable state space case a converse also holds, that is, if a policy is randomized it cannot lead
to an extreme point unless it has the same invariant measure under some deterministic policy. That is, extreme points are
achieved under deterministic policies. Let there be a policy P which is randomizing between two deterministic policies
P 1 and P 2 at a state α such that with probability θ, P 1 is chosen; and with probability 1 − θ, P 2 is chosen. Let v, v 1 , v 2
i
i
be corresponding invariant measures to P, P 1 , P 2 . Here, α is an accessible atom if EαP [τα ] < ∞ and P P (x, B) =
i
P P (y, B) for all x, y ∈ α, where τα = min(k > 0 : πk = α) is the return time to α. In this case, the expected occupation
measure can be shown to be equal to:
1
v(x, u) =
2
EαP [τα ]θv 1 (x, u) + EαP [τα ](1 − θ)v 2 (x, u)
,
EαP 1 [τα ]θ + EαP 2 [τα ](1 − θ)
(7.14)
84
7 The Average Cost Problem
which is a convex combination of v 1 and v 2 .
⋄
Hence, a randomized policy cannot lead to an extreme point in the space of invariant occupation measures. We note that,
even if one randomizes at a single state, under irreducibility assumptions, the above argument applies.
The above does not easily extend to uncountable settings. In Section 3.2 of [20], there is a discussion for a specialized
setting.
We can deduce that, for such a finite state space, an optimal policy is deterministic:
Theorem 7.2.1 Let X, U be finite spaces. Furthermore, let, for every stationary policy, there be a unique invariant distribution on a single communication class. In this case, a sample path optimal policy and an average cost policy exists. These
policies are deterministic and stationary.
Proof:
G is a compact set. There exists an optimal v. Furthermore, by some further study, one realizes that the set G is closed,
convex; and its extreme points are obtained by deterministic policies.
⋄
Let µ∗ be the optimal occupation measure (this exists, as discussed above, since the state space is finite, and thus G is
compact, and X×U µ(x, u)c(x, u) is continuous in µ(., .)). This induces an optimal policy π(u|x) as:
µ∗ (x, u)
.
∗
U µ (x, u)
π(u|x) =
Thus, we can find the optimal policy conveniently, without using dynamic programming.
7.2.1 General State/Action Spaces
We now discuss a more general setting where the state and action spaces are Polish. Let φ : X → R be a continuous and
bounded function. Define:
T
1
φ(x, u).
vT (φ) =
T t=1
Define a Ft measurable process, with π an admissible control policy (not necessarily stationary or Markov):
t
Ft (φ) =
s=1
φ(xt ) − t
P×U
φ(x′t )P π (dx′t , du′t )|x)vt (dx)
(7.15)
We define GX to be the following set in this case.
GX = {η ∈ P(X × U) : η(D) =
P (D|z)η(dz),
X×U
∀D ∈ B(X)}.
The above holds under a class of technical conditions:
1. (A) The state process takes values in a compact set X. The control space U is also compact.
2. (A’) The cost function satisfies the following condition: limKn ↑X inf u,x∈K
/ n c(x, u) = ∞, (here, we assume that the
space X is locally compact, or σ−compact).
3. (B) There exists a policy leading to a finite cost.
7.2 Linear Programming Approach to Average Cost Markov Decision Problems
85
4. (C) The cost function is continuous in x and u.
5. (D) The transition kernel is weakly continuous in the sense that
every continuous and bounded function v.
Q(dz|x, u)v(z) is continuous in both x and u, for
Theorem 7.2.2 Under the above Assumptions A, B, C, and D (or A’, B, C, and D) there exists a solution to the optimal
control problem given in (7.13) for almost every initial condition under the optimal invariant probability measure.
The key step is the observation that every expected occupation measure sequence has a weakly converging subsequence. If this exists, then the limit of such a converging sequence will be in G and the analysis presented for the finite
state space case will be applicable.
Let v(A) = E[vT (A)]. For C, without the boundedness assumption, the argument follows from the observation that, for
νn → ν weakly, for every lower semi-continuous function (see [31])
lim inf
n→∞
vn (dx, du)c(x, u) ≥
v(dx, du)c(x, u)
However, by sequential compactness, every sequence has a weakly converging subsequence and the result follows.
The issue here is that a sequence of measures on the space of invariant measures vn may have a limiting probability
measure, but this limit may not correspond to an invariant measure under a stationary policy. The weak continuity of the
kernel, and the separability of the space of continuous functions on a compact set, allow for this generalization.
This ensures that every sequence has a converging subsequence weakly. In particular, there exists an optimal occupation
measure.
Lemma 7.2.3 There exists an optimal occupation measure in G under the Assumptions above.
Proof: The problem has now reduced to
min
µ
µ(dx, du)c(x, u),
s.t.
µ ∈ G.
The set of invariant measures is weakly sequentially compact and the integral is lower semi-continuous on the set of
measures.
⋄
Following the Lemma above, there exists optimal occupation measures, say v(dπ, df ). This defines the optimal stationary
control policy by the decomposition:
v(dx, df )
,
µ(df |π) =
u v(dx, df )
for all x such that
u
v(dx, df ) > 0.
There is a final consideration of reachability; that is, whether from any initial state, or an initial occupation set, the region
where the optimal policy is defined is attracted (see [4]).
If the Markov chain is stable and irreducible, then the optimal cost is independent of where the chain starts from, since in
finite time, the states on which the optimal occupation measure has support, can be reached.
If this Markov Chain is not irreducible, then, the optimal stationary policy is only optimal when the state process starts at
the locations where in finite time the optimal stationary policy is applicable. This is particularly useful for problems where
one has control over from which state to start the system.
86
7 The Average Cost Problem
Remark 7.5. Finally, similar results are possible when the weak continuity assumption on the transition kernel and continuity assumption on c(x, u) are eliminated, provided that the set G is setwise sequentially compact and c is measurable and
bounded. A theorem from [52] leads to the following.
Theorem 7.6. Let µ, µn (n ≥ 1) be probability measures. Suppose µn → µ setwise, limn→∞ hn (x) = h(x) for all x ∈,
and h, hn (n ≥ 1) are uniformly bounded. Then, limn→∞ hn dµn = hdµ.
As a result of this theorem, it can be deduced that G is a closed and compact set, and
ν, leading to the existence of an optimal policy.
c(x, u)ν(dx, du) is continuous in
⋄
7.2.2 Extreme Points and the Optimality of Deterministic Policies
Let µ be an invariant measure in G which is not extreme. This means that there exists κ ∈ (0, 1) and invariant empirical
occupation measures µ1 , µ2 ∈ G such that:
µ(dx, du) = κµ1 (dx, du) + (1 − κ)µ2 (dx, du)
Let µ(dx, du) = P (du|x)µ(dx) and for i = 1, 2, µi (dx, du) = P i (du|x)µi (dx). Then,
µ(dx) = κµ1 (dx) + (1 − κ)µ2 (dx)
Note that µ(dx) = 0 =⇒ µi (dx) = 0 for i = 1, 2. As a consequence, the Radon-Nikodym derivative of µi with respect
i
i
to µ exists. Let dµ
dµ (x) = f (x). Then
P (du|x) = P 1 (du|x)
dµ2
dµ1
(x)κ + P 2 (du|x)
(x)(1 − κ)
dµ
dµ
is well-defined. Then,
P (du|x) = P 1 (du|x)f 1 (x)κ + P 2 (du|x)(1 − κ)f 2 (x),
such that f i (x)κ + (1 − κ)f 2 (x) = 1. As a result, we have that
P (du|x) = P 1 (du|x)η(x) + P 2 (du|x)(1 − η(x)),
with the randomization kernel η(x) = f 1 (x)κ. As in Meyn [37], there are two ways where such a representation is possible:
Either P (du|x) is randomized, or P 1 (du|x) = P 2 (du|x) and deterministic but there are multiple invariant probability
measures under P 1 .
The converse direction can also be obtained for the countable state space case: If a measure in G is extreme, than it
corresponds to a deterministic process. The result also holds for the uncountable state space case under further technical
conditions. See section 3.2 of Borkar [20] for further discussions.
Remark 7.7. If there is an atom (or a pseudo-atom constructed through Nummelin’s splitting technique) which is visited
under every stationary policy, then as in the arguments leading to (7.14), one can deduce that on this atom there cannot be
a randomized policy which leads to an extreme point on the set of invariant occupation measures.
7.2.3 Sample-Path Optimality
To apply the convex analytic approach, we require that under any admissible policy, the set of sample path occupation
measures would be tight, for almost every sample path realization. If this can be established, then the result goes through
not only for the expected cost, but also the sample-path average cost, as discussed for the finite state-action setup.
7.4 Exercises
87
7.3 Constrained Markov Decision Processes
Consider the following average cost problem:
inf J(x, Π) = inf lim sup
Π
Π
T →∞
1 Π
E
T x
T −1
c(xt , ut )
(7.16)
t=0
subject to the constraints:
1
lim sup ExΠ
T
T →∞
T −1
t=0
di (xt , ut ) ≤ Di
(7.17)
for i = 1, 2, · · · , m where m ∈ N.
A linear programming formulation leads to the following result.
Theorem 7.3.1 [49] [3] Let X, U be countable. Consider (7.16-7.17). An optimal policy will randomize between at most
m + 1 deterministic policies.
Ross also discusses a setup with one constraint where a non-stationary history-dependent policy may be used instead of
randomized stationary policies.
Finally, the theory of constrained Markov Decision Processes is also applicable to Polish state and action spaces, but this
requires further technicalities. If there is an accessible atom (or an artificial atom as considered earlier in Chapter 3) under
any of the policies considered, then the randomizations can be made at the atom.
7.4 Exercises
Exercise 7.4.1 Consider the convex analytic method we discussed in class. Let X, U be countable sets and consider the
occupation measure:
T −1
1
vT (A) =
1(xt ∈A) , ∀A ∈ B(X).
T t=0
While proving that the limit of such a measure process lives in a specific set, we used the following result: Let Ft be the
σ−field generated by {xs , us , s ≤ t}. Define a Ft measurable process
t
Ft (A) =
s=1
1xs ∈A − t
P (A|x)vt (x, u) ,
X×U
where Π is some arbitrary but admissible control policy. Show that, {Ft (A),
for every T ∈ Z.
t ∈ {1, 2, . . . , T }} is a martingale sequence
Exercise 7.4.2 Let, for a Markov control problem, xt ∈ X, ut ∈ U, where X and U are finite sets denoting the state space
and the action space, respectively. Consider the optimal control problem of the minimization of
T −1
1 Π
Ex [
c(xt , ut )],
T →∞ T
t=0
lim
where c is a bounded function. Further assume that under any stationary control policy, the state transition kernel
P (xt+1 |xt , ut ) leads to an irreducible Markov Chain.
Does there exist an optimal control policy? Can you propose a way to find the optimal policy?
88
7 The Average Cost Problem
Is the optimal policy also sample-path optimal?
Exercise 7.4.3 Let Π0 = {γ0 (x)} be some stationary and deterministic policy for a finite state-action space setup. Suppose
that we update the policy sequentially as follows:
γ1 (x) = argminu c(x, u) + E[J 0 (xt+1 |xt = x, ut = u)] ,
where J 0 (x) is the value function at state x under Π that is J 0 (x) =
x∈X
c(x, γ0 (x)) + E[J −1 (xt+1 |xt = x, γ0 (x))] for a
given initial J −1 . Thus, for n ≥ 1, let
γn (x) = argminu c(x, u) + E[J n−1 (xt+1 |xt = x, ut = u)] ,
x∈X
and
J n (x) =
c(x, γn (x)) + E[J n−1 (xt+1 |xt = x, γn (x))]
Observe that
J 1 (x) =
≥
c(x, γ1 (x)) + E[J 0 (xt+1 |xt = x, γ1 (x))]
c(x, γ1 (x)) + E[J 1 (xt+1 |xt = x, γ1 (x))]
= J 2 (x)
(7.18)
and that J 0 (x) ≥ J 1 (x) ≥ J 2 (x) · · · ↓ J ∗ (x). From here, conclude that for a finite state and action problem, the sequence
of policies constructed above leads to a limit and an optimal policy. This algorithm is known as the Policy Iteration
Algorithm.
Exercise 7.4.4 Consider a two-state, controlled Markov Chain with state space X = {0, 1}, and transition kernel for
t ∈ Z+ :
P (xt+1 = 0|xt = 0) = u0t
P (xt+1 = 1|xt = 0) = 1 − u0t
P (xt+1 = 1|xt = 1) = u1t
P (xt+1 = 0|xt = 1) = 1 − u1t .
Here u0t ∈ [0.2, 1] and u1t ∈ [0, 1] are the control variables. Suppose, the goal is to minimize the quantity
T −1
lim sup
T →∞
1 Π
E [
c(xt , ut )],
T 0 t=0
where
c(0, u0 ) = 1 + u0 ,
c(1, u1 ) = 1.5,
∀u1 ∈ [0, 1],
with given α, β ∈ R+ .
Find an optimal policy and find the optimal cost.
Hint: Consider deterministic and stationary policies and analyze the costs corresponding to such policies.
8
Learning and Approximation Methods
In some Markov Decision Problems, one does not know the true transition kernel, or the cost function and may wish to
use past data to obtain an asymptotically optimal solution. In some problems, this may be used as a en efficient numerical
method to obtain approximately optimal solutions.
There may also be algorithms where a prior probabilistic knowledge on the system dynamics may be used to learn the true
system. One may apply Bayesian or non-Bayesian methods.
Stochastic approximation methods are used extensively in many application areas. A typical stochastic approximation has
the following form:
xt+1 = xt + αt (F (xt ) − xt + wt )
where wt is a noise variable. The goal is to arrive at a point x∗ which satisfies x∗ = F (x∗ ).
8.1 Q-Learning
Q−learning [56], [53], [8] is a stochastic approximation algorithm that does not require the knowledge of the transition
kernel, or even the cost (or reward) function for its implementation. In this algorithm, the incurred per-stage cost variable is
observed through simulation of a single sample path. When the state and the action spaces are finite, under mild conditions
regarding infinitely often hits for all state-action pairs, this algorithm is known to converge to the optimal cost. We now
discuss this algorithm.
Consider a Markov Decision Problem with finite state and action sets with the objective of finding
∞
β t c(xt , ut )].
inf ExΠ0 [
Π
k=0
for some β ∈ (0, 1).
Let Q : X × U → R denote the Q-factor of a decision maker. Let us assume that the decision maker uses a stationary
random policy Π : X → P(U) and updates its Q-factors as: for t ≥ 0,
Qt+1 (x, u) = Qt (x, u) + αt (x, u) c(x, u) + β min Qt (xt+1 , v) − Qt (x, u)
v
(8.1)
where the initial condition Q0 is given, αt (x, u) is the step-size for (x, u) at time t, ut is chosen according to some policy
Π, and the state xt evolves according to P ( · |xt , ut ) starting at x0 . It is assumed that, for all (x, u), t ≥ 0,
Assumption 8.1.1 For all (x, u), t ≥ 0,
a) αt (x, u) ∈ [0, 1]
90
8 Learning and Approximation Methods
b) αt (x, u) = 0 unless (x, u) = (xt , ut )
c) αt (x, u) is a (deterministic) function of (x0 , u0 ), . . . , (xt , ut ). This can also be made only a function of t, x and u.
d)
t≥0
αt (x, u) → ∞, w.p. 1
e)
t≥0
α2t (x, u) ≤ C, w.p. 1, for some (deterministic) constant C < ∞.
In the above one may take αt (x, u) as deterministic for every x, u pair
Let F be an operator on the Q factors defined by:
F (Q)(x, u) = c(x, u) + β
x′
P (x′ |x, u) min Q(x′ , v)
v
(8.2)
Consider the following fixed point equation.
Q∗ (x, u) = F (Q∗ )(x, u) = c(x, u) + β
Pu (x, x′ ) min Q∗ (x′ , v)
x′
v
(8.3)
whose existence follows from the contraction arguments as in Chapter 5.
Theorem 8.1.1 Under Assumption 8.1.1, the algorithm (8.1) converges almost surely to Q∗ . A policy f which satisfies
minu Q∗ (x, u) = Q∗ (x, f ∗ (x)) is an optimal policy.
8.2 Proof of Convergence
8.3 Exercises
Exercise 8.3.1 Simulate through Matlab a Q-learning algorithm for a finite state controlled Markov chain: Let X =
{1, 2, 3, 4};U={a,b}withP(xt+1 = j|xt = k, ut = a) = P a (k, j) and P (xt+1 = j|xt = k, ut = b) = P b (k, j),
T −1
c : X × U → R+ and the objective function lim ExΠ t=0 β t c(xt , ut ) for some β ∈ (0, 1).
9
Decentralized Stochastic Control
In the following, we primarily follow Chapters 2, 3, 4 and 12 of [62] for this topic. For a more complete coverage, the
reader may follow [62].
We will consider a collection of decision makers (DMs) where each has access to some local information variable: Such
a collection of decision makers who wish to minimize a common cost function and who has an agreement on the system
(that is, the probability space on which the system is defined, and the policy and action spaces) is said to be a team. To
study such team problems in a systematic fashion, we will obtain classifications of such teams.
9.1 Witsenhausen’s Intrinsic Model
In the following, we consider a decentralized stochastic control model with N stations.
A decentralized control system may either be sequential or non-sequential. In a sequential system, the decision makers
(DMs) act according to an order that is specified before the system starts running; while in a non-sequential system the
DMs act in an order that depends on the realization of the system uncertainty and the control actions of other DMs.
According to the instrinsic model, any (finite horizon) sequential team problem can be characterized by a tuple (Ω, F ),
N , {(Ui , U i ), i = 1, . . . , N }, {J i , i = 1, . . . , N } or equivalently by a tuple (Ω, F ), N , {(Ui , U i ), i = 1, . . . , N },
{(Ii , I i ), i = 1, . . . , N } where
•
(Ω, F ) is a measurable space representing all the uncertainty in the system. The realization of this uncertainty is called
the primitive variable of the system. Ω denotes all possible realizations of the primitive random variable and F is a
sigma-algebra over Ω.
•
N denotes the number of decision makers (DMs) in the system. Each DM takes only one action. If the system has a
control station that takes multiple actions over time, it is modeled as a collection of DMs, one for each time instant.
•
{(Ui , U i ), i = 1, . . . , N } is a collection of measurable spaces representing the action space for each DM. The control
action ui of DM i takes value in Ui and U i is a sigma-algebra over Ui .
•
{J i , i = 1, . . . , N } is a collection of sets in F and represents the information available to a DM to take an action.
Sometimes it is useful to assume that the information is available in terms of an explicit observation that takes values
in a measurable space (Ii , I i ). Such an observation is generated by a measurable observation function from Ω × U1 ×
· · · × Ui−1 to Ii . The collection {J i , i = 1, . . . , N } or {(Ii , I i ), i = 1, . . . , N } is called the information structure of
the system.
•
A control strategy (also called a control policy or design) of a decentralized control system is given by a collection
{γ i , i = 1, . . . , N } of functions where γ i : (Ii , I i ) → (Ui , U i ) (or equivalently, γ i : (Ω, J i ) → (Ui , U i ). Let Γ
denote the set of all such measurable policies.
92
9 Decentralized Stochastic Control
Although, there are different ways to define a control objective of a decentralized system, we focus on minimizing a loss
function. Other performance measures include minimizing regret, minimizing risk, ensuring safety, and ensuring stability.
We will assume that we are given a probability measure P on (Ω, F) and a real-valued loss function ℓ on (Ω × U1 × · · · ×
UN , F ⊗ U 1 ⊗ · · · ⊗ U N ) =: (H, H). Any choice γ = (γ 1 , . . . , γ n ) of the control strategy induces a probability measure
P γ on (H, H). We define the performance J(γ) of a strategy as the expected loss (under probability measure P γ ), i.e.,
J(γ) = E γ [ℓ(ω, u1 , . . . , un )]
where ω is the primitive variable (or the primitive random variable, since a measure is specified) and ui is the control action
of DM i.
As an example, consider the following model of a system with two decision makers which is taken from [62]. Let Ω =
{ω1 , ω2 , ω3 }, F be the power set of Ω. Let the action space be U1 = {U (up), D(down)}, U2 = {L(left), R(right)},
and U 1 and U 2 be the power sets of U1 and U 2 respectively. Let the information fields J 1 = {∅, {ω1}, {ω2 , ω3 }, Ω}
and J 2 = {∅, {ω1, ω2 }, {ω3 }, Ω}. (This information corresponds to the non-identical imperfect (quantized) measurement
setting considered in [62]).
Suppose the probability measure P is given by P (ωi ) = pi , i = 1, 2, 3 and p1 = p2 = 0.3, p3 = 0.4, and the loss function
ℓ(ω, u1 , u2 ) is given by
u2
u2
u2
L R
L R
L R
u1 U 1 0 U 2 3 U 1 2
D 3 1 D 2 1 D 0 2
ω : ω1 ↔ 0.3 ω2 ↔ 0.3 ω3 ↔ 0.4
For the above model, the unique optimal control strategy is given by
γ 1,∗ (y 1 ) =
U, y 1 = {ω1 }
D, else
γ 2,∗ (y 2 ) =
R, y 2 = {ω1 , ω2 }
L, else
The development of a systematic solution procedure to a generalized sequential decentralized stochastic control problem is
a difficult task. Most of the work in the literature has concentrated on identifying solution techniques for specific subclasses.
Typically, these subclasses are characterized on the basis of the information structure of the system.
9.1.1 Static and dynamic information structures
An information structure is called static if the observation of all DMs depends only on the primitive random variable
(and not on the control actions of others). Systems that don’t have static information structure are said to have dynamic
information structure. In such systems, some DMs influence the observations of others through their actions.
Witsenhausen [58] showed that any dynamic decentralized control system can be converted to a static decentralized control
system by an appropriate change of measures. However, very little is known regarding the solution of a non-LQG static
system; hence, the above transformation is not practically useful.
9.1.2 Classical, quasiclassical and nonclassical information structures
Centralized control systems are a special case of decentralized control systems; their characterizing feature is centralization
of information, i.e., any DM knows the information available to all the DMs that acted before it, or formally, J i ⊆ J i+1
for all i. Such information structures are called classical.
9.1 Witsenhausen’s Intrinsic Model
93
A decentralized system is called quasiclassical or partially nested if the following condition holds: whenever DM i can
influence DM j (that is the information at DM j is dependent on the action of DM i), then DM j must know the observations
of DM i, or more formally, J i ⊆ J j . We will use the notation DM i → DM j to represent this relationship.
Any information structure that is not classical or quasiclassical is called nonclassical.
In a state space model, we assume that the decentralized control system has a state xt that is evolving with time. The
evolution of the state is controlled by the actions of the control stations. We assume that the system has N control stations
where each control station i chooses a control action uit at time t. The system runs in discrete time, either for finite or
infinite horizon.
Let X denote the space of realizations of the state xt , and Ui denote the space of realization of control actions uit . Let T
denote the set of time for which the system runs.
The initial state x1 is a random variable and the state of the system evolves as
0
xt+1 = ft (xt , u1t , . . . , uN
t ; wt ) ,
t∈T,
(9.1)
where {wt0 , t ∈ T } is an independent noise process that is also independent of x1 .
We assume that each control station i observes the following at time t
yti = gti (xt , wti ),
(9.2)
where {wti , t ∈ T } are measurement noise processes that are independent across time, independent of each other, and
independent of {wt0 , t ∈ T } and x1 .
The above evolution does not completely describe the dynamic control system, because we have not specified the data
available at each control station. In general, the data Iti available at control station i at time t will be a function of all the
past system variables {x[1,t] , y[1,t] , u[1,t−1] , w[1,t] }, i.e.,
Iti = ηti (x[1,t] , y[1,t] , u[1,t−1] , w[1,t] ),
(9.3)
where we use the notation u = {u1 , . . . , uN } and x[1,t] = {x1 , . . . , xt }. The collection {Iti , i = 1, . . . , N , t ∈ T } is
called the information structure of the system.
When T is finite, say equal to {1, . . . , T }, the above model is a special case of the sequential intrinsic model presented
above. The set {x1 , wt0 , wt1 , . . . , wtN , t ∈ T } denotes the primitive random variable with probability measure given by the
product measure of the marginal probabilities; the system has N × T DMs, one for each control station at each time. DM
(i, t) observes Iti and chooses uit . The information sub-fields J k are determined by {ηti , i = 1, . . . , N , t ∈ T }.
Some important information structures are
1. Complete information sharing: In complete information sharing, each DM has access to present and past measurements
and past actions of all DMs. Such a system is equivalent to a centralized system.
Iti = {y[1,t] , u[1,t−1] }, t ∈ T .
2. Complete measurement sharing: In complete measurement sharing, each DM has access to the present and past measurements of all DMs. Note that past control actions are not shared.
Iti = {y[1,t] },
t∈T.
3. Delayed information sharing In delayed information sharing, each DM has access to n-step delayed measurements
and control actions of all DMs.
94
9 Decentralized Stochastic Control
Iti =
i
{y[t−n+1,t]
, ui[t−n+1,t−1] y[1,t−n] , u[1,t−n] }, t > n
i
{y[1,t]
, ui[1,t−1] },
(9.4)
t≤n
4. Delayed measurement sharing In delayed measurement sharing, each DM has access to n-step delayed measurements
of all DMs. Note that control actions are not shared.
Iti =
i
{y[t−n+1,t]
, ui[1,t−1] , y[1,t−n] }, t > n
i
{y[1,t]
, ui[1,t−1] },
t≤n
5. Delayed control sharing In delayed control sharing, each DM has access to n-step delayed control actions of all DMs.
Note that measurements are not shared.
Iti =
i
{y[1,t]
, ui[t−n+1,t−1] , u[1,t−n] }, t > n
i
{y[1,t]
, ui[1,t−1] },
t≤n
6. Periodic information sharing In periodic information sharing, the DMs share their measurements and control periodically after every k time steps. No information is shared at other time instants.

i
i

{y[⌊t/k⌋k,t] , u[⌊t/k⌋k,t] , y[1,⌊t/k⌋k] , u[1,⌊t/k⌋k] },
Iti =
t≥k

 i
t<k
{y[1,t], ui[1,t−1] },
7. Completely decentralized information In a completely decentralized system, no data is shared between the DMs.
i
, ui[1,t−1] },
Iti = {y[1,t]
t∈T.
In all the information structures given above, each DM has perfect recall (PR), that is, each DM has full memory of its
past information. In general, a DM need not have perfect recall. For example, a DM may only have access to its current
observation, in which case the information structure is
Iti = {yti },
t∈T.
(9.5)
To complete the description of the team problem, we have to specify the loss function. We will assume that the loss function
is of an additive form:
ℓ(x[1,T ] , u[1,T ] ) =
c(xt , ut )
(9.6)
t∈T
where each term in the summation is known as the incremental (or stagewise) loss.
The objective is to choose control laws γti such that uit = γti (Iti ) so as to minimize the expected loss (9.6). In the sequel,
we will denote the set of all measurable control laws γti under the given information structure by Γti .
9.1.3 Solutions to Static Teams
Definition 9.1.1 Let J(γ) := E[c(ω, γ 1 (η 1 (ω)), . . . , γ N (η N (ω)))]. We say that a policy γ ∗ is person by person optimal if
J(γ ∗ ) ≤ J(γ ∗1 , . . . , γ ∗(k−1) , β, γ ∗(k+1) , . . . ),
Definition 9.1.2 A policy γ ∗ is optimal if
β ∈ Γ k,
k = 1, 2, . . . , N. (9.7)
9.1 Witsenhausen’s Intrinsic Model
J(γ ∗ ) ≤ J(γ),
95
γ∈Γ
Definition 9.1.3 A team decision rule γ ∗ is stationary if J(γ ∗ ) < ∞ and
d
E[c(ω, γ ∗−i (y), ui )|y i ]
dui
ui =γ ∗i (y i )
= 0,
for i = 1, · · · , N . Here, γ ∗−i = {gammak , k ∈ {0, 1, · · · , N }, k = i}.
Definition 9.1.4 A cost function is locally finite at γ ∈ Γ if (i) J(γ) < ∞ and (ii) for any δ ∈ Γ , there exist k1 , k2 , · · · , kN
so that
J(γ 1 + h1 δ 1 , · · · , γ N + hN δ N ) < ∞,
for all |h1 | ≤ k1 , |h2 | ≤ k2 , · · · , |hN | ≤ kN .
N
Theorem 9.1.1 (Radner [47]) Let c(ω, u) be measurable on Ω × k=1 Ui , where Ui is a real finite dimensional vector
space,
inf J(γ) < ∞,
γ∈Γ
∗
∗
∗
γ is stationary and J is locally finite at γ . Then γ is optimal.
Krainak et al [34] relax some of the conditions in Radner’s theorem above as follows.
Theorem 9.1.2 Let for ω ∈ Ω, c(ω, u) be convex and continuously differentiable and suppose that inf γ∈Γ J(γ) > −∞.
Let γ ∗ be stationary and suppose that the following expectations exist for all γ ∈ Γ such that J(γ) < ∞.
E[
where cui denotes the partial derivative with respect to ui . Then, γ ∗ s optimal. If c(ω, u) is strictly convex, then γ ∗ is the
unique optimal solution (almost everywhere).
Quadratic-Gaussian teams
Given a probability space (Ω, , PΩ ), and an associated vector-valued random variable ξ, let {J; Γ i , i ∈} be a static stochastic team problem with the following specifications:
(i) Ui ≡ Rmi , i ∈; i.e., the action spaces are unconstrained Euclidean spaces.
(ii) The loss function is a quadratic function of for every ξ:
′
′
ui ri (ξ) + c(ξ)
ui Rij (ξ)uj + 2
L(ξ; ) =
i,j∈
(9.8)
i∈
′
), ri (ξ) is a vector-valued random variable, and c(ξ)
where Rij (ξ) is a matrix-valued random variable (with Rij ≡ Rji
is a random variable, all generated by measurable mappings on the random state of nature, ξ.
(iii) L(ξ; ) is strictly (and uniformly) convex in a.s., i.e., there exists a positive scalar α such that, with R(ξ) defined as
a matrix comprised of N blocks, with the ij’th block given by Rij (ξ), the matrix R(ξ) − αI is positive definite a.s.,
where I is the appropriate dimensional identity matrix.
(iv) R(ξ) is uniformly bounded above, i.e., there exists a positive scalar β such that the matrix βI − R(ξ) is positive
definite a.s.
(v) Y i ≡ Rri , i ∈, i.e., the measurement spaces are Euclidean spaces.
96
9 Decentralized Stochastic Control
(vi) y i = η i (ξ), i ∈, for some appropriate Borel measurable functions η i , i ∈.
(vii) Γ i is the (Hilbert) space of all Borel measurable mappings of γ i : Rri → Rmi , which have bounded second moments,
′
i.e., Eyi {γ i (y i )γ i (y i )} < ∞.
(viii) Eξ [ri′ (ξ)ri (ξ)] < ∞, i ∈; Eξ [c(ξ)] < ∞.
Definition 9.1.5 A static stochastic team is quadratic if it satisfies (i)–(viii) above. It is a standard quadratic team if furthermore the matrix R is constant for all ξ (i.e., it is deterministic). If, in addition, ξ is a Gaussian distributed random vector,
and ri (ξ) = Qi ξ, η i (ξ) = H i ξ, i ∈, for some deterministic matrices Qi , H i , i ∈, the decision problem is a quadraticGaussian team (more widely known as a linear-quadratic-Gaussian (LQG) team under some further structure on Qi and
H i ).
⋄
One class of quadratic teams for which the team-optimal solution can be obtained in closed form are those where the
random state of nature ξ is a Gaussian random vector. Let us decompose ξ into N + 1 block vectors
′
′
′
ξ = (x′ , y 1 , y 2 , . . . , y N )′
(9.9)
of dimensions r0 , r1 , r2 , . . . , rN , respectively. Being a Gaussian random vector, ξ is completely described in terms of its
mean value and covariance matrix, which we specify below:
′
′
E[ξ] =:= (¯
x′ , y1 , . . . , yN )
cov (ξ) =: Σ, with [Σ]ij =: Σij ,
(9.10)
i, j = 0, 1, . . . , N
(9.11)
[Σ]ij denotes the ij’th block of the matrix Σ of dimension ri × rj , which stands for the cross-variance between the i’th and
j’th block components of ξ. We further assume (in addition to the natural condition Σ ≥ 0) that Σii > 0 for i ∈, which
means that the measurement vectors y i ’s have nonsingular distributions. To complete the description of the quadraticGaussian team, we finally take the linear terms ri (ξ) in the loss function (9.8) to be linear in x, which makes x the “payoff
relevant” part of the state of nature:
ri (ξ) = Di x, i ∈
(9.12)
where Di is an (ri × r0 ) dimensional constant matrix.
In the characterization of the team-optimal solution for the quadratic-Gaussian team we will need the following important
result on the conditional distributions of Gaussian random vectors.
Lemma 9.1.1 Let z and y be jointly Gaussian distributed random vectors with mean values , y, and covariance
cov (z, y) =
Σzz Σzy
′
Σzy
Σyy
≥ 0,
Σyy > 0.
(9.13)
Then, the conditional distribution of z given y is Gaussian, with mean
−1
E[z|y] = +Σzy Σyy
(y − y)
(9.14)
−1 ′
cov(z|y) = Σzz − Σzy Σyy
Σzy
(9.15)
and covariance
⋄
The complete solution to the quadratic-Gaussian team is now given in the following Theorem.
Theorem 9.1.3 [62] The quadratic-Gaussian team decision problem as formulated above admits a unique team-optimal
solution, that is affine in the measurement of each agent:
γ i∗ (y i ) = Π i (y i − yi ) + M i x
¯,
i∈.
(9.16)
9.2 Dynamic Teams Quasi-Classical Information Patterns
97
Here, Π i is an (mi × ri ) matrix (i ∈), uniquely solving the set of linear matrix equations:
Rii Π i Σii +
Rij Π j Σji + Di Σ0i = 0,
(9.17)
j∈,j=i
and M i is an (mi × r0 ) matrix for each i ∈, obtained as the unique solution of
Rij M j + Di = 0,
j∈
i∈.
(9.18)
Remark 9.1. The proof of this result follows immediately from Theorem 9.1.2. However, a Projection Theorem based
concise proof can also be provided exploiting the quadratic nature of the problem [?]. If one can show orthogonality of
the estimation error to the space of linear policies, the Gaussianity of the random variables implies orthogonality to the
subspace of measurable functions with finite second moment, leading to the desired conclusion.
An important application of the above result is the following static Linear Quadratic Gaussian Problem: Consider a twocontroller system evolving in Rn with the following description: Let x1 be Gaussian and x2 = Ax1 + B 1 u11 + B 2 u21 + w1
y11 = C 1 x1 + v11 ,
y12 = C 2 x1 + v12 ,
with w, v 1 , v 2 zero-mean, i.i.d. disturbances. For ρ1 , ρ2 > 0, let the goal be the minimization of
J(γ 1 , γ 2 ) = E ||x1 ||22 + ρ1 ||u11 ||22 + ρ2 ||u21 ||22 + ||x2 ||22
(9.19)
over the control policies of the form:
uit = µit (y1i ),
i = 1, 2
For such a setting, optimal policies are linear.
9.2 Dynamic Teams Quasi-Classical Information Patterns
Under quasi-classical information, LQG stochastic team problems are tractable by conversion into equivalent static team
problems: Consider the following dynamic team with N agents, where each agent acts only once, with Ak, k ∈ N , having
the following measurement
yk = C k ξ +
Dik ui ,
(9.20)
i:i→k
where ξ is an exogenous random variable picked by nature, and i → k denotes the precedence relation that the action of
Ai affects the information of Ak and ui is the action of Ai.
If the information structure is quasi-classical, then
I k = {y k , {I i , i → k}}.
That is, Ak has access to the information available to all the signaling agents. Such an IS is equivalent to the IS I k = {˜
y k },
k
where y˜ is a static measurement given by
y˜k =
C k ξ, {C i ξ, i → k} .
(9.21)
Such a conversion can be done provided that the policies adopted by the agents are deterministic, with the equivalence to be
interpreted in the sense that any deterministic policy measurable under the original IS being measurable also under the new
98
9 Decentralized Stochastic Control
(static) IS and vice versa, since the actions are determined by the measurements. The restriction of using only deterministic policies is, however, without any loss of optimality: with policies of all other agents fixed (possibly randomized) no
agent can benefit from randomized decisions in such team problems. We discussed this property of irrelevance of random
information/actions in optimal stochastic control in Chapter 5 in view of Blackwell’s Irrelevant Information Theorem.
This observation, made by Ho and Chu [33] leads to the following result.
Theorem 9.2.1 Consider an LQG system with a partially nested information structure. For such a system, optimal solutions
are affine (that is, linear plus a constant).
Remark 9.2. Another class of dynamic team problems that can be converted into solvable dynamic optimization problems
are those where even though the information structure is nonclassical, there is no incentive for signaling because any
signaling from say agent Ai to agent Aj conveys information to the latter which is “cost irrelevant”, that is it does not lead
to any improvement in performance [61] [62].
Remark 9.3. Witsenhausen [59] has shown that any sequential dynamic team can be converted to a static decentralized
control system by an appropriate change of measures and cost functions, provided that a mild conditional density condition
holds (see Section 3.7 of [62]).
9.2.1 Non-classical information structures, signaling and its effect on lack of convexity
What makes a large number of problems possessing the nonclassical information structure difficult is the fact that signaling
is present: Signaling is the policy of communication through control actions. Under signaling, the decision makers apply
their actions to affect the information available at the other decision makers. In this case, the control policies induce a
probabilistic map (hence, a channel or a stochastic kernel) from the exogenous random variable space to the observation
space of the signaled decision makers. For the nonclassical case, the problem thus also features an information transmission
aspect, and the signaling decision maker’s objective also includes the design of an optimal measurement channel.
Consider the following example [61]. Consider a two-controller system evolving in Rn :
xt+1 = Axt + B 1 u1t + B 2 u2t + wt ,
yt1 = C 1 xt + vt1 ,
yt2 = C 2 xt + vt2 ,
where w, v 1 , v 2 are zero-mean, i.i.d. disturbances, and A, B 1 , B 2 , C 1 , C 2 matrices of appropriate dimensions. For ρ1 , ρ2 >
0, let the objective be the minimization of the cost functional be
T
J =E
t=1
|xt |2 + ρ1 |u1t |2 + ρ2 |u2t |2 + xT
2
over control policies of the form:
i
, ui[0,t−1] ),
uit = µit (y[0,t]
i = 1, 2;
t = 0, 1, . . . , T − 1.
For a multi-stage problem (say with T = 2), unlike T = 1 in (9.19), the cost is in general no-longer convex in the action
variables of the controllers acting in the first stage t = 0. This is because these actions might affect the estimation quality
of the other controller in the future stages, if one DM can signal information to the other DM in one stage. We note that
this condition is equivalent to C 1 Al B 2 = 0 or C 2 Al B 1 = 0 with l + 1 denoting the delay in signaling with l = 0 in the
problem considered.
9.3 Common Knowledge as Information State and the Dynamic Programming Approach to Team Decision Problems
99
9.2.2 Expansion of information Structures: A recipe for identifying sufficient information
We start with a general result on optimum-performance equivalence of two stochastic dynamic teams with different information structures. This is in fact a result which has a very simple proof, but it is quite effective as we will see shortly.
Proposition 9.2.1 Let D1 and D2 be two stochastic dynamic teams with the same loss function, and differing only in
their information structures, η 1 and η 2 , respectively, with corresponding composite strategy spaces Γ1 and Γ2 , such that
Γ2 ⊆ Γ1 . Let D1 admit a team-optimal solution, denoted by γ ∗1 ∈ Γ1 , with the further property that γ ∗1 ∈ Γ2 . Then γ ∗1
also solves D2 .
A recipe for utilizing the result above would be [62]:
Given a team problem, say D2 , with IS η 2 , which is presumably difficult to solve, obtain a finer IS η 1 , and solve
the team problem under this expanded IS (assuming that this new team problem is easier to solve). Then, if the
team-optimal solution here is adapted to the sigma-field generated by the original coarser IS, it solves also the
original problem D2 .
9.3 Common Knowledge as Information State and the Dynamic Programming Approach to
Team Decision Problems
In a team problem, if all the random information at any given decision maker is common knowledge between all decision
makers, then the system is essentially centralized. If only some of the system variables are common knowledge, the remaining unknowns may or may not lead to a computationally tractable program generating an optimal solution. A possible
approach toward establishing a tractable program is through the construction of a controlled Markov chain where the controlled Markov state may now live in a larger state space (for example a space of probability measures) and the actions are
elements in possibly function spaces. This controlled Markov construction may lead to a computation of optimal policies.
Such a dynamic programming approach has been adopted extensively in the literature (see for example, [5], [60], [22], [1],
[61], [42] and generalized in [43]) through the use of a team-policy which uses common information to generate partial
functions for each DM to generate their actions using local information. Thus, in the dynamic programming approach,
a separation of team decision policies in the form of a two-tier architecture, a higher-level controller and a lower-level
controller, can be established with the use of common knowledge.
In the following, we present the ingredients of such an approach, as formalized by Nayyar, Mahajan and Teneketzis [43]
and termed the common information approach:
1. Elimination of irrelevant information at the DMs: In this step, irrelevant local information at the DMs, say DM k, is
identified as follows. By letting the policy at other DMs to be arbitrary, the policy of DM k can be optimized as a
best-response function, and irrelevant data at DM k can be removed.
2. Construction of a coordinated system: This step identifies the common information and local/private information at
the DMs, after Step 1 above has been carried out. A fictitious coordinator (higher-level controller) uses the common information to generate team policies, which in turn dictates the (lower-level) DMs what to do with their local
information.
3. Formulation of the cost function as a Partially Observed Markov Decision Process (POMDP), in view of the coordinator’s optimal control problem: A fundamental result in stochastic control is that the problem of optimal control of
a partially observed Markov chain (with additive per-stage costs) can be solved by turning the problem into a fully
observed one on a larger state space where the state is replaced by the “belief” on the state.
4. Solution of the POMDP leads to the structural results for the coordinator to generate optimal team policies, which in
turn dictates the DMs what actions to take given their local information realizations.
100
9 Decentralized Stochastic Control
5. Establishment of the equivalence between the solution obtained and the original problem, and translation of the optimal
policies. Any coordination strategy can be realized in the original system. Note that, even though there is no real
coordinator, such a coordination can be realized implicitly, due to the presence of common information.
We will provide a further explicit setting with such a recipe at work, in the context of the k-stage periodic belief sharing
pattern in the next section. In particular, Lemma 9.4.1 and Lemma 9.4.2 will highlight this approach. When a given
information structure does not allow for the construction of a controlled Markov chain even in a larger, but fixed for all
time stages, state space, one question that can be raised is what information requirements would lead to such a structure.
We will also investigate this problem in the context of the one-stage belief sharing pattern in the next section.
9.4 k-Stage Periodic Belief Sharing Pattern
In this section, we will use the term belief for a probability measure-valued random variable. This terminology has been
used particularly in the artificial intelligence and computer science communities, which we adopt here. We will, however,
make precise what we mean by such a belief process in the following.
9.4.1 k-stage periodic belief sharing pattern
As mentioned earlier and discussed in detail in Appendix ??, a fundamental result in stochastic control is that the problem
of optimal control of a partially observed Markov chain can be solved by turning the problem into a fully observed one on a
larger state space where the state is replaced by the belief on the state. Such an approach is very effective in the centralized
setting; in a decentralized setting, however, the notion of a state requires further specification. In the following, we illustrate
this approach under the k-step periodic belief sharing information pattern.
Consider a joint process {xt , yt , t ∈ Z+ }, where we assume for simplicity that the spaces where xt , yt take values from
are finite dimensional real-valued or countable. They are generated by
xt+1 = f (xt , u1t , . . . , uL
t , wt ),
yti = g(xt , vti ),
where xt is the state, uit ∈ Ui is the control action, (wt , vti , 1 ≤ i ≤ L) are second order, zero-mean, mutually independent,
i.i.d. noise processes. We also assume that the state noise, wt , either has a probability mass function, or a probability
measure with a density function. To minimize the notational clutter, P (x) will denote the probability mass function for
discrete-valued spaces or probability density function for continuous spaces.
Suppose that there is a common information vector Itc at some time t, which is available to all the decision makers. At
c
times ks−1, with k > 0 fixed, and s ∈ Z+ , the decision makers share all their information: Iks−1
= {y[0,ks−1] , u[0,ks−1] }
and for I0c = {P (x0 )}, that is at time 0 the DMs have the same a priori belief on the initial state. Hence, at time t, DM i
i
c
}.
has access to {y[ks,t]
, Iks−1
Until the next common observation instant t = k(s + 1) − 1 we can regard the individual decision functions specific to DM
i
c
, Iks−1
)}; we let γs denote the ensemble of such decision functions and let γ denote the team policy.
i as {uit = γsi (y[ks,t]
i
i
c
It then suffices to generate γs for all s ≥ 0, as the decision outputs conditioned on y[ks,t]
, under γsi (y[ks,t]
, Iks−1
), can
c
c
be generated. In such a case, we can define γs (., Iks−1 ) to be the joint team decision rule mapping Iks−1 into a space of
i
c
action vectors: {γsi (y[ks,t]
, Iks−1
), i ∈ L = {1, 2 . . . , L}, t ∈ {ks, ks + 1, . . . , k(s + 1) − 1}}.
Let [0, T − 1] be the decision horizon, where T is divisible by k. Let the objective of the decision makers be the joint
minimization of
1
Exγ0 ,γ
2
,...,γ L
T −1
c(xt , u1t , u2t , . . . , uL
t )],
[
t=0
9.5 Exercises
101
over all policies γ 1 , γ 2 , . . . , γ L , with the initial condition x0 specified. The cost function
T −1
γ
Jx0 (γ) = Ex0
c(xt , ut )
t=0
can be expressed as:
Jx0 (γ) =
γ
Ex0 [
T
k
−1
s=0
c
c¯(γs (., Iks−1
), x
¯s )]
with
γ
c
c¯(γs (., Iks−1
), x
¯s ) = Ex¯s [
k(s+1)−1
c(xt , ut )]
t=ks
Lemma 9.4.1 [61] Consider the decentralized system setup above. Let Itc be a common information vector supplied to the
DMs regularly every k time stages, so that the DMs have common memory with a control policy generated as described
c
above. Then, {¯
xs := xks , γs (·, Iks−1
), s ≥ 0} forms a controlled Markov chain.
In view of the above, we have the following separation result.
Lemma 9.4.2 [61] Let Itc be a common information vector supplied to the DMs regularly every k time steps. There is no
c
c
loss in performance if Iks−1
is replaced by P (¯
xs |Iks−1
).
An essential issue for a tractable solution is to ensure a common information vector which will act as a sufficient statistic
for future control policies. This can be done via sharing information at every stage, or some structure possibly requiring
larger but finite delay.
The above motivates us to introduce the following pattern.
Definition 9.4.1 k-stage periodic belief sharing pattern [61] An information pattern in which the decision makers share
their posterior beliefs to reach a joint belief about the system state is called a belief sharing information pattern. If the belief
sharing occurs periodically every k-stages (k > 1), the DMs also share the control actions they applied in the last k − 1
stages, together with intermediate belief information. In this case, the information pattern is called the k-stage periodic
belief sharing information pattern.
Remark 9.4. For k > 1, it should be noted that, the exchange of the control actions is essential, as was also observed
in performance-irrelevant signaling or stochastic nestedness discussion in Section ??. The decision makers also need to
exchange information for intermediate beliefs. The following algorithmic discussion will make this clear.
⋄
9.5 Exercises
Exercise 9.5.1 Consider the following team decision problem with dynamics:
xt+1 = axt + b1 u1t + b2 u2t + wt ,
yt1 = xt + vt1 ,
yt2 = xt + vt2 ,
Here x0 , vt1 , vt2 , wt are mutually and temporally independent zero-mean Gaussian random variables.
Let {γ i } be the policies of the controllers so that uit = γti (y0i , y1i , · · · , yti ) for i = 1, 2.
102
9 Decentralized Stochastic Control
Consider:
1
min
Exγ0 ,γ
1
2
2
T −1
x2t + ρ1 (u1t )2 + ρ2 (u2t )2 ] + x2T ,
γ ,γ
t=0
where ρ1 , ρ2 > 0.
Explain if the following are correct or not:
a) For T = 1, the problem is a static team problem.
b) For T = 1, optimal policies are linear.
c) For T = 1, linear policies may be person-by-person-optimal. That is, if γ 1 is assumed to be linear, then γ 2 is linear;
and if γ 2 is assumed to be linear then γ 1 is linear.
d) For T = 2, optimal policies are linear.
e) For T = 2, linear policies may be person-by-person-optimal.
Exercise 9.5.2 Consider a common probability space on which the information available to two decision makers DM1 and
DM2 are defined, such that I1 is available at DM1 and I2 is available at DM2 .
R. J. Aumann [7] defines that an information E is common knowledge between two decision makers DM1 and DM2 , if
whenever E happens, DM1 knows E, DM2 knows E, DM1 knows that DM2 knows E, DM2 knows that DM1 knows E, and
so on.
Suppose that one claims that an event E is common knowledge if and only if E ∈ σ(I1 ) ∩ σ(I2 ), where σ(I1 ) denotes the
σ−field generated by information I1 and likewise for σ(I2 ).
Is this argument correct? Provide an answer with precise arguments. You may wish to consult [7], [44], [21] and Chapter
12 of [62].
Exercise 9.5.3 Consider the following static team decision problem with dynamics:
x1 = ax0 + b1 u10 + b2 u20 + w0 ,
y01 = x0 + v01 ,
y02 = x0 + v02 ,
Here v01 , v02 , w0 are independent, Gaussian, zero-mean with finite variances.
Let γ i : R → R be policies of the controllers: u10 = γ01 (y01 ), u20 = γ02 (y02 ).
Find
1
2
min
Eνγ0 ,γ [x21 + ρ1 (u10 )2 + ρ2 (u20 )2 ],
1
2
γ ,γ
where ν0 is a zero-mean Gaussian distribution and ρ1 , ρ2 > 0.
Find an optimal team policy γ = {γ 1 , γ 2 }.
Exercise 9.5.4 Consider a linear Gaussian system with mutually independent and i.i.d. noises:
L
B j ujt + wt ,
xt+1 = Axt +
j=1
yti
= C xt + vti , 1 ≤ i ≤ L,
with the one-step delayed observation sharing pattern.
i
(9.22)
9.5 Exercises
103
Construct a controlled Markov chain for the team decision problem: First show that one could have
L
2
1
)}
, . . . , y[0,t−1]
, y[0,t−1]
{yt1 , yt2 , . . . , ytL , P (dxt |y[0,t−1]
as the state of the controlled Markov chain.
Consider the following problem:
γ
T −1
Eν0 [
t=0
]c(xt , u1t , · · · , uL
t )?
1
2
L
For this problem, if at time t ≥ 0 each of the decision makers (say DM i) has access to P (dxt |y[0,t−1]
, y[0,t−1]
, . . . , y[0,t−1]
)
i
and their local observation y[0,t] , show that they can obtain a solution where the optimal decision rules only uses
1
2
L
{P (dxt |y[0,t−1]
, y[0,t−1]
, . . . , y[0,t−1]
), yti }:
1
2
L
i
What if, they do not have access to P (dxt |y[0,t−1]
, y[0,t−1]
, . . . , y[0,t−1]
), and only have access to y[0,t]
? What would be a
sufficient statistic for each decision maker for each time stage?
Exercise 9.5.5 Two decision makers, Alice and Bob, wish to control a system:
xt+1 = axt + uat + ubt + wt ,
yta = xt + vta ,
ytb = xt + vtb ,
where uat , yta are the control actions and the observations of Alice, ubt , ytb are those for Bob and vta , vtb , wt are independent
zero-mean Gaussian random variables with finite variance. Suppose the goal is to minimize for some T ∈ Z+ :
a
ExΠ0
,Π b
T −1
x2t + ra (uat )2 + rb (ubt )2 ,
t=0
for ra , rb > 0, where Π a , Π b denote the policies adopted by Alice and Bob. Let the local information available to Alice be
Ita = {ysa , uas , s ≤ t − 1} ∪ {yta }, and Itb = {ysb , ubs , s ≤ t − 1} ∪ {ytb } is the information available at Bob at time t.
Consider an n−step delayed information pattern: In an n−step delayed information sharing pattern, the information at
Alice at time t is
b
Ita ∪ It−n
,
and the information available at Bob is
a
Itb ∪ It−n
.
State if the following are true or false:
a) If Alice and Bob share all the information they have (with n = 0), it must be that, the optimal controls are linear.
b) Typically, for such problems, for example, Bob can try to send information to Alice to improve her estimation on the
state, through his actions. When is it the case that Alice cannot benefit from the information from Bob, that is for what
values of n, there is no need for Bob to signal information this way?
c) If Alice and Bob share all information they have with a delay of 2, then their optimal control policies can be written as
a
b
a
uat = fa (E[xt |It−2
, It−2
], yt−1
, yta ),
a
b
b
ubt = fb (E[xt |It−2
, It−2
], yt−1
, ytb ),
for some functions fa , fb . Here, E[.|.] denotes the expectation.
d) If Alice and Bob share all information they have with a delay of 0, then their optimal control policies can be written as
104
9 Decentralized Stochastic Control
uat = fa (E[xt |Ita , Itb ]),
ubt = fb (E[xt |Ita , Itb ]),
for some functions fa , fb . Here, E[.|.] denotes the expectation.
Exercise 9.5.6 In Theorem 9.1.2, explain why continuous differentiability of the cost function (in addition to its convexity)
is needed for person-by-person-optimality of a team policy to imply global optimality.
A
Metric Spaces
Definition A.0.1 A linear space is a space which is closed under addition and multiplication by a scalar.
Definition A.0.2 A normed linear space X is a linear vector space on which a functional (a mapping from X to R, that is a
member of RX ) called norm is defined such that:
•
||x|| ≥ 0 ∀x ∈ X,
||x|| = 0 if and only if x is the null element (under addition and multiplication) of X.
•
||x + y|| ≤ ||x|| + ||y||
•
||αx|| = |α|||x||,
∀α ∈ R,
∀x ∈ X
Definition A.0.3 In a normed linear space X, an infinite sequence of elements {xn } converges to an element x if the
sequence {||xn − x||} converges to zero.
Definition A.0.4 A metric defined on a set X, is a function d : X × X → R such that:
•
d(x, y) ≥ 0,
∀x, y ∈ X and d(x, y) = 0 if and only if x = y.
•
d(x, y) = d(y, x),
•
d(x, y) ≤ d(x, z) + d(z, y),
∀x, y ∈ X.
∀x, y, z ∈ X.
Definition A.0.5 A metric space (X, d) is a set equipped with a metric d.
A normed linear space is also a metric space, with metric
d(x, y) = ||x − y||.
An important class of normed spaces that are widely used in optimization and engineering problems are Banach spaces:
A.0.1 Banach Spaces
Definition A.0.6 A sequence {xn } in a normed space X is Cauchy if for every ǫ, there exists an N such that ||xn −xm || ≤ ǫ,
for all n, m ≥ N .
The important observation on Cauchy sequences is that, every converging sequence is Cauchy, however, not all Cauchy
sequences are convergent: This is because the limit might not live in the original space where the sequence elements take
values in. This brings the issue of completeness:
106
A Metric Spaces
Definition A.0.7 A normed linear space X is complete, if every Cauchy sequence in X has a limit in X. A complete normed
linear space is called Banach.
Banach spaces are important for many reasons including the following one: In optimization problems, sometimes we
would like to see if a sequence converges, for example if a solution to a minimization problem exists, without knowing
what the limit of the sequence could be. Banach spaces allow us to use Cauchy sequence arguments to claim the existence
of optimal solutions. If time allows, we will discuss how this is used by using contraction and fixed point arguments for
transformations.
An example is the following: consider the solutions to the equation Ax = b for A a square matrix and b a vector. In class
we will identify conditions on an iteration of the form xk+1 = (I − A)xk − b to form a Cauchy sequence and converge to
a solution through the contraction principle.
In applications, we will also discuss completeness of a subset. A subset of a Banach space X is complete if and only if it is
closed. If it is not closed, one can provide a counterexample sequence which does not converge. If the set is closed, every
Cauchy sequence in this set has a limit in X and this limit should be a member of this set, hence the set is complete.
Exercise A.0.1 The space of bounded functions {x : [0, 1] → R, supt∈[0,1] |x(t)| < ∞} is a Banach space.
The above space is also denoted by L∞ ([0, 1]; R) or L∞ ([0, 1]).
p
i∈N+ |x(i)|
Theorem A.0.1 lp (Z+ ; R) := {x ∈ f (Z+ ; R) : ||x||p =
1
p
< ∞} is a Banach space for all 1 ≤ p ≤ ∞.
Sketch of Proof: The proof is completed in three steps.
(i) Let {xn } be Cauchy. This implies that for every ǫ > 0, ∃N such that for all n, m ≥ N ||xn − xm || ≤ ǫ. This also
implies that for all n > N , ||xn || ≤ ||xN || + ǫ. Now let us denote xn by the vector {xn1 , xn2 , xn3 . . . , }. It follows that for
every k the sequence {xnk } is also Cauchy. Since xnk ∈ R, and R is complete, xnk → xk for some xk . Thus, the sequence
xn pointwise converges to some vector x∗ .
(ii) Is x ∈ lp (Z+ ; R)? Define xn,K = {xn1 , xn2 , . . . , xnK−1 , xnK , 0, 0, . . . }, that is vector which truncates after the Kth
coordinate. Now, it follows that
||xn,K || ≤ ||xN || + ǫ,
for every n ≥ N and K and
K
K
lim ||xn,K ||p = lim
n→∞
n→∞
i=1
|xni |p =
i=1
|xi |p ,
since there are only finitely many elements in the summation. The question now is whether ||x∞ || ∈ lp (Z+ ; R). Now,
||xn,K || ≤ ||xN || + ǫ,
and thus
lim ||xn,K || = ||xK || ≤ ||xN || + ǫ,
n→∞
Let us take another limit, by the monotone convergence theorem (Recall that this theorem says that a monotonically
increasing sequence which is bounded has a limit).
K
lim ||x∗,K ||p = lim
K→∞
K→∞
i=1
|xi |p = ||x∞ ||pp ≤ ||xN || + ǫ.
(iii) The final question is: Does ||xn − x∗ || → 0? Since the sequence is Cauchy, it follows that for n, m ≥ N
||xn − xm || ≤ ǫ
A Metric Spaces
107
Thus,
||xn,K − xm,K || ≤ ǫ
and since K is finite
lim ||xn,K − xm,K || = ||xn,K − x∗,K || ≤ ǫ
m→∞
Now, we take another limit
lim ||xn,K − x∗,K || ≤ ǫ
K→∞
By the monotone convergence theorem again,
lim ||xn,K − x∗,K || = ||xn − x|| ≤ ǫ
K→∞
Hence, ||xn − x|| → 0.
⋄
The above spaces are also denoted lp (Z+ ), when the range space is clear from context.
The following is a useful result.
Theorem A.0.2 (H¨older’s Inequality)
x(t)y(t) ≤ ||x||p ||y||q ,
with 1/p + 1/q = 1 and 1 ≤ p, q ≤ ∞.
Remark: A brief remark for notations: When the range space is R, the notation lp (Ω) denotes lp (Ω; R) for a discrete-time
index set Ω and likewise for a continuous-time index set Ω, Lp (Ω) denotes Lp (Ω; R).
⋄
A.0.2 Hilbert Spaces
We first define pre-Hilbert spaces.
Definition A.0.8 A pre-Hilbert space X is a linear vector space where an inner product is defined on X ×X. Corresponding
to each pair x, y ∈ X the inner product x, y is a scalar (that is real-valued or complex-valued). The inner product satisfies
the following axioms:
1. x, y = y, x
conjugate)
∗
(the superscript denotes the complex conjugate) (we will also use y, x to denote the complex
2. x + y, z = x, z + y, z
3. αx, y = α x, y
4. x, x ≥ 0, equals 0 iff x is the null element.
The following is a crucial result in such a space, known as the Cauchy-Schwarz inequality, the proof of which was presented
in class:
Theorem A.0.3 For x, y ∈ X,
x, y ≤
x, x
y, y ,
where equality occurs if and only if x = αy for some scalar α.
Exercise A.0.2 In a pre-Hilbert space x, x defines a norm: ||x|| =
x, x
108
A Metric Spaces
The proof for the result requires one to show that
x, x satisfies the triangle inequality, that is
||x + y|| ≤ ||x|| + ||y||,
which can be proven by an application of the Cauchy-Schwarz inequality.
Not all spaces admit an inner product. In particular, however, l2 (N+ ; R) admits an inner product with x, y =
for x, y ∈ l2 (N+ ; R). Furthermore, ||x|| =
t∈N+
x(n)y(n)
x, x defines a norm in l2 (N+ ; R).
The inner product, in the special case of RN , is the usual inner vector product; hence RN is a pre-Hilbert space with the
usual inner-product.
Definition A.0.9 A complete pre-Hilbert space, is called a Hilbert space.
Hence, a Hilbert space is a Banach space, endowed with an inner product, which induces its norm.
Proposition A.0.1 The inner product is continuous: if xn → x, and yn → y, then xn , yn → x, y for xn , yn in a Hilbert
space.
Proposition A.0.2 In a Hilbert space X, two vectors x, y ∈ X are orthogonal if x, y = 0. A vector x is orthogonal to a
set S ⊂ X if x, y = 0 ∀y ∈ S.
Theorem A.0.4 (Projection Theorem:) Let H be a Hilbert space and B a closed subspace of H. For any vector x ∈ H,
there is a unique vector m ∈ B such that
||x − m|| ≤ ||x − y||, ∀y ∈ B.
A necessary and sufficient condition for m ∈ B to be the minimizing element in B is that, x − m is orthogonal to B.
A.0.3 Separability
Definition A.0.10 Given a normed linear space X, a subset D ⊂ X is dense in X, if for every x ∈ X, and each ǫ > 0,
there exists a member d ∈ D such that ||x − d|| ≤ ǫ.
Definition A.0.11 A set is countable if every element of the set can be associated with an integer via an ordered mapping.
Examples of countables spaces are finite sets and the set Q of rational numbers. An example of uncountable sets is the set
R of real numbers.
Theorem A.0.5 a) A countable union of countable sets is countable. b) Finite Cartesian products of countable sets is
countable. c) Infinite Cartesian products of countable sets may not be countable. d) [0, 1] is not countable.
Cantor’s diagonal argument and the triangular enumeration are important steps in proving the theorem above.
Since rational numbers are the ratios of two integers, one may view rational numbers as a subset of the product space of
countable spaces; thus, rational numbers are countable.
Definition A.0.12 A space X is separable, if it contains a countable dense set.
Separability basically tells that it suffices to work with a countable set, when the set is uncountable. Examples of separable
sets are R, and the set of continuous and bounded functions on a compact set metrised with the maximum distance between
the functions.
B
On the Convergence of Random Variables
B.1 Limit Events and Continuity of Probability Measures
Given A1 , A2 , . . . , An ∈ F , define:
∞
lim sup An = ∩∞
n=1 ∪k=n Ak
n
∞
lim inf An = ∪∞
n=1 ∩k=n Ak
n
For the superior limit, an element is in this set, if it is in infinitely many An s. For the inferior case, an element is in the
limit, if it is in almost except for a finite number of An s. The limit of a sequence of sets exists if the above limits are equal.
We have the following result:
Theorem B.1.1 For a sequence of events An :
P (lim inf An ) ≤ lim inf P (An ) ≤ lim sup P (An ) ≤ P (lim sup An )
n
n
n
n
We have the following regarding continuity of probability measures:
Theorem B.1.2 (i)For a sequence of events An , An ⊂ An+1 ,
lim P (An ) = P (∪∞
n=1 An )
n→∞
(ii) For a sequence of events An , An+1 ⊂ An ,
lim P (An ) = P (∩∞
n=1 An )
n→∞
B.2 Borel-Cantelli Lemma
Theorem B.2.1 (i) If n P (An ) converges, then P (lim supn An ) = 0. (ii) If {An } are independent and if
∞, then P (lim supn An ) = 1.
P (An ) =
Exercise B.2.1 Let {An } be a sequence of independent events where An is the event that the nth coin flip is head. What is
the probability that there are infinitely many heads if P (An ) = 1/n2 ?
An important application of the above is the following:
110
B On the Convergence of Random Variables
Theorem B.2.2 Let Zn , n ∈ N and Z be random variables and for every ǫ > 0,
n
P (|Zn − Z| ≥ ǫ) < ∞.
Then,
P ({ω : Zn (ω) = Z(ω)}) = 1.
That is Zn converges to Z with probability 1.
B.3 Convergence of Random Variables
B.3.1 Convergence almost surely (with probability 1)
Definition B.3.1 A sequence of random variables Xn converges almost surely to a random variable X if P ({ω :
limn→∞ Xn (ω) = X(ω)}) = 1.
B.3.2 Convergence in Probability
Definition B.3.2 A sequence of random variables Xn converges in probability to a random variable X if limn→∞ P (|Xn −
X| ≥ ǫ}) = 0 for every ǫ > 0.
B.3.3 Convergence in Mean-square
Definition B.3.3 A sequence of random variables Xn converges in the mean-square sense to a random variable X if
limn→∞ E[|Xn − X|2 ] = 0.
B.3.4 Convergence in Distribution
Definition B.3.4 Let Xn be a random variable with cumulative distribution function Fn , and X be a random variable with
cumulative distribution function F . A sequence of random variables Xn converges in distribution (or weakly) to a random
variable X if limn→∞ Fn (x) = F (x) for all points of continuity of F .
Theorem B.3.1 a) Convergence in almost sure sense implies in probability. b) Convergence in mean-square sense implies
convergence in probability. c) If Xn → X in probability, then Xn → X in distribution.
We also have partial converses for the above results:
Theorem B.3.2 a) If P (|Xn | ≤ Y ) = 1 for some random variable Y with E[Y 2 ] < ∞, and if Xn → X in probability,
then Xn → X in mean-square. b) If Xn → X in probability, there exists a subsequence Xnk which converges to X almost
surely. c) If Xn → X and Xn → Y in probability, mean-square or almost surely. Then P (X = Y ) = 1.
A sequence of random variables is uniformly integrable if:
lim sup E[Xn 1|Xn |≥K ] = 0.
K→∞ n
Note that, if {Xn } is uniformly integrable, then, supn E[|Xn |] < ∞.
B.3 Convergence of Random Variables
111
Theorem B.3.3 Under uniform integrability, convergence in almost sure sense implies convergence in mean-square.
Theorem B.3.4 If Xn → X in probability, there exists some subsequence Xnk which converges to X almost surely.
Theorem B.3.5 (Skorohod’s representation theorem) Let Xn → X in distribution. Then, there exists a sequence of
random variables Yn and Y such that, Xn and Yn have the same cumulative distribution functions; X and Y have the
same cumulative distribution functions and Yn → Y almost surely.
With the above, we can prove the following result.
Theorem B.3.6 The following are equivalent: i) Xn converges to X in distribution. ii) E[f (Xn )] → E[f (X)] for all
continuous and bounded functions f . iii) The characteristic functions Φn (u) := E[eiuXn ] converge pointwise for every
u ∈ R.
References
1. M. Aicardi, F. Davoli, and R. Minciardi. Decentralized optimal control of Markov chains with a common past information set. IEEE
Transactions on Automatic Control, 32:1028–1031, November 1987.
2. C. D. Aliprantis and K. C. Border. Infinite Dimensional Analysis: A Hitchhikers Guide, third edition. Springer-Verlag, Berlin, 2006.
3. E. Altman. Constrained Markov Decision Processes. Chapman Hall/CRC, Boca Raton, FL, 1999.
4. A. Arapostathis, V. S. Borkar, E. Fernandez-Gaucherand, M. K. Ghosh, and S. I. Marcus. Discrete-time controlled Markov processes
with average cost criterion: A survey. SIAM J. Control and Optimization, 31:282–344, 1993.
5. M. Athans. Survey of decentralized control methods. Washington D.C, 1974. 3rd NBER/FRB Workshop on Stochastic Control.
6. K. B. Athreya and P. Ney. A new approach to the limit theory of recurrent markov chains. Transactions of the American Mathematical
Society, 245:493–501, 1978.
7. R. J. Aumann. Agreeing to disagree. Annals of Statistics, 4:1236 – 1239, 1976.
8. W. L. Baker. Learning via stochastic approximation in function space. PhD Dissertation, Harvard University, Cambridge, MA,
1997.
9. D. Bertsekas. Dynamic Programming and Optimal Control Vol. 1. Athena Scientific, 2000.
10. D. P. Bertsekas. Dynamic Programming and Stochastic Optimal Control. Academic Press, New York, New York, 1976.
11. D. P. Bertsekas and S. Shreve. Stochastic Optimal Control: The Discrete Time Case. Academic Press, New York, 1978.
12. P. Billingsley. Convergence of Probability Measures. New York, NY, John Wiley, 1968.
13. P. Billingsley. Probability and Measure. Wiley, New York, 2nd edition, 1986.
14. P. Billingsley. Probability and Measure. Wiley, 3rd ed.), New York, 1995.
15. D. Blackwell. Memoryless strategies in finite-stage dynamic programming. Annals of Mathematical Statistics, 35:863–865, 1964.
16. D. Blackwell and C. Ryll-Nadrzewski. Non-existence of everywhere proper conditional distributions. Annals of Mathematical
Statistics, pages 223–225, 1963.
17. V. I. Bogachev. Measure Theory. Springer-Verlag, Berlin, 2007.
18. V. S. Borkar. White-noise representations in stochastic realization theory. SIAM J. on Control and Optimization, 31:1093–1102,
1993.
19. V. S. Borkar. Probability Theory: An Advanced Course. Springer, New York, 1995.
20. V. S. Borkar. Convex analytic methods in Markov decision processes. In Handbook of Markov Decision Processes, E. A. Feinberg,
A. Shwartz (Eds.), pages 347–375. Kluwer, Boston, MA, 2001.
21. A. Brandenburger and E. Dekel. Common knowledge with probability 1. J. Mathematical Economics, 16:237–245, 1987.
22. C. Y. Chong and M. Athans. On the periodic coordination of linear stochastic systems. Automatica, 12:321–335, 1976.
23. S. B. Connor and G. Fort. State-dependent foster-lyapunov criteria for subgeometric convergence of Markov chains. Stoch. Process
Appl., 119:176–4193, 2009.
24. OLV Costa and Franc¸ois Dufour. Average control of markov decision processes with feller transition probabilities and general action
spaces. Journal of Mathematical Analysis and Applications, 396(1):58–69, 2012.
25. A. Dembo and O. Zeitouni. Large deviations techniques and applications, volume 38. Springer, 2010.
26. R. Douc, G. Fort, E. Moulines, and P. Soulier. Practical drift conditions for subgeometric rates of convergence. Ann. Appl. Probab,
14:1353–1377, 2004.
27. L. Dubins and D. Freedman. Measurable sets of measures. Pacific J. Math., 14:1211–1222, 1964.
28. R. Durrett. Probability: theory and examples, volume 3. Cambridge university press, 2010.
29. J. K. Ghosh and R. V. Ramamoorthi. Bayesian Nonparametrics. Springer, New York, 2003.
30. M. Hairer. Convergence of markov processes. Lecture Notes, 2010.
31. O. Hernandez-Lerma and J. Lasserre. Discrete-time Markov control processes. Springer, 1996.
32. O. Hernandez-Lerma and J. B. Lasserre. Markov Chains and Invariant Probabilities. Birkh¨auser, Basel, 2003.
33. Y. C. Ho and K. C. Chu. Team decision theory and information structures in optimal control problems - part I. IEEE Transactions
on Automatic Control, 17:15–22, February 1972.
114
References
34. J.C. Krainak, J.S. Speyer, and S.I. Marcus. Static team problems – part II: Affine control laws, projections, algorithms, and the
LEGT problem. IEEE Transactions Automatic Contr., 27:848–859, 1982.
35. J. B. Lasserre. Invariant probabilities for Markov chains on a metric space. Statistics and Probability Letters, 34:259–265, 1997.
36. A. S. Manne. Linear programming and sequential decision. Management Science, 6:259–267, April 1960.
37. S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
38. S. P. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. Springer Verlag, London, 1993.
39. S. P. Meyn and R. Tweedie. State-dependent criteria for convergence of Markov chains. Ann. Appl. Prob, 4:149–168, 1994.
40. S.P. Meyn and R.L. Tweedie. Markov Chains and Stochastic Stability. Springer, 1993.
41. S.P. Meyn and R.L. Tweedie. State-dependent criteria for convergence of markov chains. Annals Appl. Prob., 4(1):149–168, 1994.
42. A. Nayyar, A. Mahajan, and D. Teneketzis. Optimal control strategies in delayed sharing information structures. IEEE Transactions
Automatic Contr., 56:1606–1620, 2011.
43. A. Nayyar, A. Mahajan, and D. Teneketzis. The common-information approach to decentralized stochastic control. In Information
and Control in Networks, Editors: G. Como, B. Bernhardsson, A. Rantzer. Springer, 2013.
44. L. Nielsen. Common knowledge, communication and convergence of beliefs. Mathematical Social Sciences, 8:1–14, 1984.
45. E. Nummelin. A splitting technique for harris recurrent markov chains. Z. Wahrscheinlichkeitstheoric verw. Gebiete, 43:309–318,
1978.
46. E. Nummelin. A splitting technique for harris recurrent markov chains. Z. Wahrscheinlichkeitstheoric verw. Gebiete, 43:309–318,
1978.
47. R. Radner. Team decision problems. Annals of Mathematical Statistics, 33:857–881, 1962.
48. G.O. Roberts and J.S. Rosenthal. General state space markov chains and mcmc algorithms. Probability Survery, 1:20–71, 2004.
49. K. W. Ross. Randomized and past-dependent policies for Markov decision processes with multiple constraints. Operations Research,
37:474–477, May 1989.
50. N. Saldi, S. Y¨uksel, and T. Linder. Near optimality of quantized policies in stochastic control under weak continuity conditions.
arXiv preprint arXiv:1410.6985, 2014.
51. M. Sch¨al. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Z.
Wahrscheinlichkeitsth, 32:179–296, 1975.
52. R. Serfozo. Convergence of lebesgue integrals with varying measures. Sankhy¯a: The Indian Journal of Statistics, Series A, pages
380–402, 1982.
53. J. N. Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine Learning, 16:185–202, 1994.
54. P. Tuominen and R.L. Tweedie. Subgeometric rates of convergence of f-ergodic markov chains. Adv. Annals Appl. Prob., 26(3):775–
798, September 1994.
55. R. L. Tweedie. Drift conditions and invariant measures for Markov chains. Stochastic Processes and Their Applications, 92:345–354,
2001.
56. C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
57. H. S. Witsenhausen. On the structure of real-time source coders. Bell Syst. Tech. J, 58:1437–1451, July/August 1979.
58. Hans S Witsenhausen. Equivalent Stochastic Control Problems. Mathematics of Control, Signals and Systems (Springer Verlag),
1:3–11, January 1988.
59. H.S. Witsenhausen. Equivalent stochastic control problems. Math. Control, Signals and Systems, 1:3–11, 1988.
60. T. Yoshikawa. Dynamic programming approach to decentralized control problems. IEEE Transactions on Automatic Control,
20:796–797, 1975.
61. S. Y¨uksel. Stochastic nestedness and the belief sharing information pattern. IEEE Transactions on Automatic Control, 54:2773–
2786, December 2009.
62. S. Y¨uksel and T. Bas¸ar. Stochastic Networked Control Systems: Stabilization and Optimization under Information Constraints.
Birkh¨auser, Boston, MA, 2013.
63. S. Y¨uksel and S. P. Meyn. Random-time, state-dependent stochastic drift for Markov chains and application to stochastic stabilization
over erasure channels. IEEE Transactions on Automatic Control, 58:47 – 59, January 2013.