We define an infinite-horizon discounted MDP in the following manner. There are three states x,y1,y2 and one action a . The MDP dynamics are independent of the action a as shown below:

At state x , with probability 1 the state transits to y1 , i.e.,

P(y1|x)=1.

Then at state y1 , we have

P(y1|y1)=p,P(y2|y1)=1−p,

which says there is probability p we stay in y1 and probability 1−p the state transits to y2 . Finally, state y2 is the absorbing state so that

P(y2|y2)=1.

The instant reward is set as 1 for starting in state y1 and 0 elsewhere:

R(y1,a,y1)=1,R(y1,a,y2)=1,,R(s,a,s′)=0 otherwise.

The discount factor is denoted by γ ( 0<γ<1 ).

1
2.0 points possible (graded, results hidden)
Define V∗(y1) as the optimal value function of the state y1 . Compute V∗(y1) via Bellman's Equation. (The answer is a formula in terms of γ,p ).

(Enter gamma for γ .)

V∗(y1)=

3 answers

Is this a homework dump? Or a test dump?
Review for final questions
OK, be sure to have patience and wait for a math tutor to come online.