In this exposition, we try to develop a rigorous reasoning about the existence of analytical solution for value of a stationary Markov Reward Process (MRP) with finite state space, infinite horizon, bounded reward, and less-than-$1$ discount factor.
Background
We consider MRP $M = (\mathcal{S},\mathrm{P},R,\gamma)$ where:
- $\mathcal{S}$ is the state space (here we assume finiteness: $\lvert \mathcal{S} \rvert < \infty$)
- $\mathrm{P}$ is the (one-step) transition probability matrix
- $R$ is the reward function
- $\gamma$ is the discount factor (here we assume $\gamma \in [0,1)$)
The process is also defined by another parameter: time horizon $H$. It is the number of time steps in a trajectory (sequence of observations along a sample path) of the process. Here we only focus on the case in which MRP has infinite horizon ($H = \infty$).
Transition probability matrix $\mathrm{P}_t$, for each time step $0 \leqslant t < H$, is of size $\lvert \mathcal{S} \rvert \times \lvert \mathcal{S} \rvert$, with $(i,j)$-th entry representing the probability of the agent entering state $s_j$ from state $s_i$ in one time step at time $t$ (from $t$ to $t+1$).
The matrix has a nice property: it is always row stochastic (or right stochastic). This property comes from the fact that transition probabilities at any time step $t$ satisfy probability axioms. More precisely, matrix $\mathrm{P}_t$ is row stochastic because it is a square matrix that consists of nonnegative real numbers, with each row summing to $1$.
Reward function $R_t: \mathbb{R} \times \mathcal{S} \to [0,1]$ gives a probability distribution characterizing the immediate reward from being in state $s \in \mathcal{S}$ at time $t$:
$$R_t(r; s) = P(r_t = r \mid s_t = s)$$
We note that reward function can also be defined as what accompanies the transition from $s_t$ to $s_{t+1}$ with a transition-based form $R: \mathbb{R} \times \mathcal{S} \times \mathcal{S} \to [0,1]$. In other words, the reward happens because of the transition in this alternative definition. Some research studies the distinctions between these reward function definitions1. For our purpose, the rest of the exposition assumes the former (state-based) definition. One benefit of this definition is that $r_t$ naturally refers to the immediate reward received from being in state $s_t$ without any ambiguity2.
We represent the expected immediate rewards at each time step $t$ with a vector $\vec{R_t} \in \mathbb{R}^{\lvert \mathcal{S} \rvert}$ where $i$-th entry gives the (expected) immediate reward to be collected in state $s_i$ at time $t$:
$$[\vec{R_{t}}]_{i} = \mathbb{E}\big[r_t \mid s_t = s_i\big]$$
Since rewards are bounded, we have $\lVert \vec{R_{t}} \rVert_{\infty} < \infty$ for all time steps $t$.
Stationarity Assumption
We assume the MRP $M$ is a stationary process. This means for any arbitrary time steps $t, t' \in [0, \infty)$, and the number of transitions $k \in \mathbb{N}$:
$$ \begin{aligned} & P(s_t = x_0, r_t = y_0, \cdots, s_{t+k} = x_l, r_{t+k} = y_k) \\ =~& P(s_{t'} = x_0, r_{t'} = y_0, \cdots, s_{t'+k} = x_l, r_{t'+k} = y_k) \end{aligned} $$
Therefore by the conditional probability, the MRP $M$ exhibits time-homogeneity: for any arbitrary states $s_i$, $s_j \in \mathcal{S}$, arbitrary reward $r \in \mathbb{R}$, and arbitrary time steps $t, t' \in [0, \infty)$:
$$ P(s_{t+1} = s_j, r_t = r \mid s_t = s_i) = P(s_{t'+1} = s_j, r_{t'} = r \mid s_{t'} = s_i) $$
To see why it is true, we let the number of transitions to be $k=0$ and $1$. We then have:
$$ \begin{aligned} P(s_{t+1} = s_j, r_t = r \mid s_t = s_i) &= \frac{P(s_t = s_i, r_t = r, s_{t+1} = s_j)}{P(s_t = s_i)} \\ &= \frac{\int_{r'} P(s_t = s_i, r_t = r, s_{t+1} = s_j, r_{t+1} = r') ~dr'}{\int_{r} P(s_t = s_i, r_t = r) ~dr} \\ &= \frac{\int_{r'} P(s_{t'} = s_i, r_{t'} = r, s_{t'+1} = s_j, r_{t'+1} = r') ~dr'}{\int_{r} P(s_{t'} = s_i, r_{t'} = r) ~dr} \\ &= \frac{P(s_{t'} = s_i, r_{t'} = r, s_{t'+1} = s_j)}{P(s_{t'} = s_i)} \\ &= P(s_{t'+1} = s_j, r_{t'} = r \mid s_{t'} = s_i) \end{aligned} $$
With marginalization, we see both the transition probability and reward function are time-homogeneous:
$$ \begin{aligned} & \big[\mathrm{P}_{t}\big]_{i,j} = P(s_{t+1} = s_j \mid s_{t} = s_i) = P(s_{t'+1} = s_j \mid s_{t'} = s_i) = \big[\mathrm{P}_{t'}\big]_{i,j} \\ & R_{t}(r, s_i) = P(r_t = r \mid s_t = s_i) = P(r_{t'} = r \mid s_{t'} = s_i) = R_{t'}(r, s_i) \end{aligned} $$
Therefore we are able to drop time subscript from transition probability matrix and (expected) reward vector, and simply write $\mathrm{P}$ and $\vec{R}$.
Value of MRP
We define return $G_t$ to be the discounted sum of rewards from time $t$ to horizon $H$:
$$ \begin{aligned} G_t &= r_t + \gamma^{1} \cdot r_{t+1} + \cdots + \gamma^{H-t} \cdot r_{H} \\ &= \sum_{i=t}^{H} \gamma^{i-t} \cdot r_{i} \end{aligned} $$
This definition of return allows us to describe the value of an MRP. Specifically, we use $V_t(\cdot)$ to denote the value function for the MRP at time $t$, and $V_t(s)$ is the expected return from starting in state $s \in \mathcal{S}$ at time $t$:
$$ V_t(s) = \mathbb{E}\big[G_t \mid s_t = s\big] $$
By infiniteness of time horizon $H$ and homogeneity of transition as well as reward functions, we claim that the value function of the type of MRP we are considering is also time-homogeneous. Let $\tau_t = (a_0, b_0, a_{1}, b_{1}, \cdots)$ be a trajectory starting from time step $t$, where $a_i$’s are states and $b_i$’s are rewards. Then by the chain rule, we have:
$$ \begin{aligned} & P(\tau_t) \\ =~& P(s_t = a_0) \cdot P(r_t = b_0 \mid s_t = a_0) \cdot P(s_{t+1} = a_1 \mid s_t = a_0, r_t = b_0) \cdots \end{aligned} $$
Since MRP $M$ satisfies the Markov property, we have:
$$ \begin{aligned} & P(s_t = a_0) \cdot P(r_t = b_0 \mid s_t = a_0) \cdot P(s_{t+1} = a_1 \mid s_t = a_0, r_t = b_0) \cdots \\ =~& P(s_t = a_0) \cdot R_{t}(r_t = b_0, s_t = a_0) \cdot P(s_{t+1} = a_1 \mid s_t = a_0) \cdots \end{aligned} $$
We’ve established that both the transition probability and reward function are time-homogeneous. Hence:
$$ \begin{aligned} & P(s_t = a_0) \cdot R_{t}(r_t = b_0, s_t = a_0) \cdot P(s_{t+1} = a_1 \mid s_t = a_0) \cdots \\ =~& P(s_t = a_0) \cdot R(b_0, a_0) \cdot P(a_1 \mid a_0) \cdot R(b_1 \mid a_1) \cdots \\ =~& P(s_t = a_0) \cdot \prod_{i=0}^{\infty} \Big[R(b_i \mid a_i) \cdot P(a_{i+1} \mid a_i)\Big] \end{aligned} $$
Now consider another trajectory that starts from time step $t'$: $\tau_{t'} = (a_0, b_0, a_{1}, b_{1}, \cdots)$. By similar argument, we obtain:
$$ P(\tau_{t'}) = P(s_{t'} = a_0) \cdot \prod_{i=0}^{\infty} \Big[R(b_i \mid a_i) \cdot P(a_{i+1} \mid a_i)\Big] $$
We conclude that for any arbitrary state $s \in \mathcal{S}$, time steps $t,t' \in [0, \infty)$, and same trajectories $\tau_t, \tau_{t'}$ that start at time steps $t$ and $t'$ respectively:
$$ P(\tau_{t} \mid s_{t} = s) = P(\tau_{t'} \mid s_{t'} = s) $$
If then follows that:
$$ V_t(s) = \mathbb{E}\big[G_t \mid s_t = s\big] = \mathbb{E}\big[G_{t'} \mid s_{t'} = s\big] = V_{t'}(s) $$
Therefore value function of MRP $M$ is time-homogeneous. We can also drop time subscript, and write vector $\vec{V}$ where each entry denotes the value for respective state.
Analytical Solution
Now we are poised to derive the solution. We start with the definition of the definition of value function. Let state $s \in \mathcal{S}$ be arbitrary. The value of MRP $M$ in state $s$ is the immediate reward and discounted sum of future rewards. By the time-homogeneity of value function, we can express the value function as follows:
$$ V(s) = \underbrace{\mathbb{E}[r \mid s]}_{\substack{\text{immediate}\\ \text{reward}}} + \underbrace{\gamma \cdot \sum_{s' \in \mathcal{S}} P(s' \mid s) ~ V(s')}_{\text{discounted sum of future rewards}} $$
Using the vector representation we previously defined, we have:
$$ \vec{V} = \vec{R} + \gamma ~ \mathrm{P} \vec{V} $$
which allows us to express value function $\vec{V}$ as:
$$ \vec{V} = \big(\mathrm{I} - \gamma ~ \mathrm{P}\big)^{-1} \vec{R} $$
where $\mathrm{I}$ is the identity matrix of size $\big\lvert \mathcal{S} \big\rvert \times \big\lvert \mathcal{S} \big\rvert$.
Existence of Solution
For the solution of value function $\vec{V}$ to exist, we need to show $\big(\mathrm{I} - \gamma \mathrm{P}\big)$ is invertible.
We first show the spectral radius of $\mathrm{P}$ is one; that is $\rho(\mathrm{P}) = 1$. More general result is what known as the Perron–Frobenius theorem.
Let $\vec{v} \in \mathbb{R}^{\lvert \mathcal{S} \rvert}$ be a vector where all entries are $1$s. We claim $(1, \vec{v})$ is an eigenpair of $\mathrm{P}$. We have:
$$ \begin{aligned} \mathrm{P}\vec{v} &= \begin{bmatrix} ~ P(s_1 \mid s_1) & \cdots & P(s_{\lvert \mathcal{S} \rvert} \mid s_1) ~ \\ ~ \vdots & \ddots & \vdots ~ \\ ~ P(s_1 \mid s_{\lvert \mathcal{S} \rvert}) & \cdots & P(s_{\lvert \mathcal{S} \rvert} \mid s_{\lvert \mathcal{S} \rvert}) ~ \end{bmatrix} \begin{bmatrix} ~ 1 ~ \\ ~ \vdots ~ \\ ~ 1 ~ \end{bmatrix} \\ &= \begin{bmatrix} ~ \sum_{i=1}^{\lvert \mathcal{S} \rvert} P(s_i \mid s_1) ~ \\ ~ \vdots ~ \\ ~ \sum_{i=1}^{\lvert \mathcal{S} \rvert} P(s_i \mid s_{\lvert \mathcal{S} \rvert}) ~ \end{bmatrix} \\ &= \begin{bmatrix} ~ 1 ~ \\ ~ \vdots ~ \\ ~ 1 ~ \end{bmatrix} \\ &= 1 ~ \vec{v} \end{aligned} $$
The penultimate step follows from $\mathrm{P}$ being row-stochastic. To show $1$ is the spectral radius of $\mathrm{P}$, we consider the induced matrix infinity norm. Again by row-stochsticity of $\mathrm{P}$, we have:
$$ \big\lVert \mathrm{P} \big\rVert_{\infty} = \max_{1 \leqslant i \leqslant \lvert \mathcal{S} \rvert} \bigg\{ \sum_{j=1}^{\lvert \mathcal{S} \rvert} \big\lvert \mathrm{P}_{i,j} \big\rvert \bigg\} = 1 $$
Therefore $\rho(\mathrm{P}) \leqslant 1$. Alternatively, one can also apply Gershgorin circle theorem to show this result.
Next since $\gamma \in [0,1)$, we have that the spectral radius of $\gamma \mathrm{P}$ is less than one; that is $\rho(\gamma \mathrm{P}) < 1$. All there is left to show is that $\big(\mathrm{I} - \gamma \mathrm{P}\big)$ is invertible.
Assume for the sake of contradiction that $\big(\mathrm{I} - \gamma \mathrm{P}\big)$ is not invertible. This means there exists non-trivial vector $\vec{v} \neq \vec{0}$ such that:
$$ \big(\mathrm{I} - \gamma ~ \mathrm{P}\big) \vec{v} = \vec{0} $$
It follows:
$$ \gamma ~ \mathrm{P}\vec{v} = 1 ~ \vec{v} $$
Hence $1$ is an eigenvalue of $\gamma \mathrm{P}$. So $1 \in \sigma(\gamma \mathrm{P})$. Therefore the spectral radius is no smaller than one; that is $\rho(\gamma \mathrm{P}) \geqslant 1$. However, we just showed that the spectral radius of $\gamma \mathrm{P}$ is less than one. We have a contradiction. Our assumption must have been wrong. We conclude that $\big(\mathrm{I} - \gamma \mathrm{P}\big)$ is indeed invertible. $\blacksquare$
-
S. Ma and J. Y. Yu, “Transition-based versus state-based reward functions for MDPs with Value-at-Risk,” 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2017, pp. 974-981, doi: 10.1109/ALLERTON.2017.8262843. ↩︎
-
With alternative transition-based reward function, the immediate reward received as a result of transitioning from $s_t$ to $s_{t+1}$ could be labeled as $r_{t}$ or $r_{t+1}$. Unfortunately, there is no consensus on whether using $r_t$ or $r_{t+1}$ is more advantageous notation-wise in this case.
Such notational disagreement also appears in the context of a Markov Decision Process (MDP). Stanford CS234 lectures (as well as AA228) adopt $r_{t}$, whereas the course reference texts Reinforcement Learning: An Introduction prefers $r_{t+1}$ as the authors argue $r_{t+1}$ is helpful in highlighting the fact that the next reward and next state are jointly determined.
A trajectory in MDP under former notational scheme begins like $(s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, \cdots)$, and under the latter $(s_0, a_1, r_1, s_1, a_1, r_2, s_2, a_2, r_3, \cdots)$. ↩︎