Notes On Reinforcement Learning

Author: Eiko

Tags: machine learning, mathematics, probability

Time: 2024-11-13 17:49:27 - 2024-11-13 17:49:27 (UTC)

Markov Reward Process

The states, reward are given by random variables \(S_t, R_t\). With transition probabilities determined by

\[ p(s',r|s) = \mathbb{P}(S_{t+1} = s', R_{t+1} = r | S_t = s) \]

i.e. the reward depends on both \(s, s'\). You can compute the expected state reward denoted by \(\rho(s)\) as

\[\rho(s) = \mathbb{E}[R_{t+1} | S_t = s] = \sum_r r p(r|s)\]

The return is a random variable which counts the future rewards, not just a single step reward, it can has many forms, but their basic ideas are the same.

\[ G_t = R_{t+1} + \dots + R_T \]

where \(T\) is some stopping time, another common form is geometrically discounted return

\[ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \]

where \(\gamma \in [0, 1]\) is the discount factor.

The value function is the expected return starting from state \(s\), which is a future update version of \(\rho(s)\)

\[ v(s) = \mathbb{E}[G_t | S_t = s] \]

The Bellman equation is a recursive equation for the value function.

\[\begin{align*} v(s) &= \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) | S_t = s] \\ &= \rho(s) + \gamma \sum_{s'} p(s'|s) v(s'). \end{align*}\]

In matrix form, \(v = \rho + \gamma P v\), where \(P\) is the transition matrix of \(p(s'|s)\), so you can solve it by \(v = (I - \gamma P)^{-1} \rho\). For \(\gamma <1\) it is always invertible because the spectral radius of \(P\) is less than 1.