Author: Eiko

Tags: machine learning, mathematics, probability

Time: 2024-11-13 17:49:27 - 2024-11-13 17:49:27 (UTC)

Markov Reward Process

The states, reward are given by random variables St,Rt. With transition probabilities determined by

p(s,r|s)=P(St+1=s,Rt+1=r|St=s)

i.e. the reward depends on both s,s. You can compute the expected state reward denoted by ρ(s) as

ρ(s)=E[Rt+1|St=s]=rrp(r|s)

The return is a random variable which counts the future rewards, not just a single step reward, it can has many forms, but their basic ideas are the same.

Gt=Rt+1++RT

where T is some stopping time, another common form is geometrically discounted return

Gt=Rt+1+γRt+2+γ2Rt+3+

where γ[0,1] is the discount factor.

The value function is the expected return starting from state s, which is a future update version of ρ(s)

v(s)=E[Gt|St=s]

The Bellman equation is a recursive equation for the value function.

v(s)=E[Rt+1+γv(St+1)|St=s]=ρ(s)+γsp(s|s)v(s).

In matrix form, v=ρ+γPv, where P is the transition matrix of p(s|s), so you can solve it by v=(IγP)1ρ. For γ<1 it is always invertible because the spectral radius of P is less than 1.