Frozen-Lake as a Markov Decision Process

Winter is here. Can Dynamic Programming save us?

Sai Sasank
The Startup

--

This is the second post in the series on Reinforcement Learning. In the previous post, we looked at the simple environment of k-Armed Bandit and learned the ideas of action-value methods and exploration. In this post, we will look at Frozen-Lake, an environment more complex than the previous. We will use Markov Decision Processes to model this environment. We will then learn about value functions and policies and ways to make them optimal. The code discussed in this post can be found here.

Frozen Lake ft. Slippery Ice and Dreadful Holes

Frozen-Lake is a more involved environment than k-Armed Bandit. It is a 4x4 grid where each cell is either a starting cell (S), a frozen cell (F) indicating a safe position, a hole (H) which must be avoided, or the goal (G) which must be reached. Any agent starts at S and needs to find its way to G via F avoiding H. The agent’s actions result in deterministic transitions and work as intended. The reward is 1 if you reach G and 0 otherwise.

Read more about the environment.

To make our lives harder, the frozen surfaces are slippery making our movements uncertain. We will look at the exact details later.

What would happen once the agent reaches the goal cell? And, what would happen if it falls through one of the holes? The agent is simply put into the starting position once again as if it has been given a new life. It’s simply the laws of the environment. The agent’s activity eventually leads to termination and we call such tasks episodic. In this case, an episode begins when the agent is in a start state which is a given and ends when the agent reaches the goal or falls into the hole. Contrary to the episodic tasks, we can have environments where there’s really no end to the tasks and the agent will live forever maximizing expected reward. Such tasks are continuing tasks.

Frozen-Lake is more sophisticated than k-Armed Bandit. This is no longer a non-associative setting. So, the reward depends not just on the action but also the state which is the position of the agent in the case of Frozen-Lake.

Agent and Environment

We already have an informal understanding of agents and environments. Let’s make it a little more formal and explicit here.

The agent is the decision-making system and the environment is what the agent interacts with. Depending on the agent’s action, it may receive some reward from the environment. The agent observes the environment and leverages the observations (and rewards, if any) to take further actions. The goal of the agent is to maximize the expected cumulative reward.

A depiction of the Agent-Environment interaction.

Markov Decision Process (MDP)

It’s useful for any agent to have a model that represents the environment. The model, if accurate, can be used to make decisions so as to maximize the expected cumulative reward.

MDPs are used to model a wide range of environments involving sequential decision making. In an MDP, we have the notion of state, which describes some useful representation of the environment. There’s the notion of actions, which represent the decisions any agent is allowed to take. Actions have consequences and affect the state and may also give rise to a scalar reward. The transitions and rewards are dependent on the current state and action, which are captured by a dynamics function p. This function completely characterizes the environment’s dynamics and is also referred to as the model.

Given state s and action a, p defines the probability of transitioning into state s’ with reward r.

At time t=0, the agent is in state S₀ and takes an action A₀. This causes a transition in the environment’s state to S₁, and a scalar reward R₁ is given to the agent. The agent takes another action A₁, giving rise to a new state S₂ and reward R₂. In the case of continuing tasks, this process continues ad infinitum and till the end of the episode in case of episodic tasks.

A depiction of sequential decision making.

Any MDP is a finite MDP if the sets of states, actions, and rewards are finite. Below is a definition for a finite MDP.

A definition for a finite MDP, where the reward is assumed to be deterministic for any state.

Frozen-Lake as an MDP

There are 16 distinct states and 4 distinct actions in our Frozen-Lake. The rewards are either 0 or 1. We will represent the states with integers from 0 (top-left) to 15 (bottom-right). Actions are represented with integers from 0 (left) to 3 (up) ordered in anti-clockwise direction.

States and actions in Frozen-Lake.

Our movement in the lake is uncertain and we move in our choice of direction only 33.33% of the time and 66.66% of the time we move to the left and right of the intended direction.

Uncertain movement due to slippery ice.

And finally, the reward is 0 except when you enter the goal state G from any frozen state F in which case the reward is 1. The episode ends when you reach G or H. The code below shows Frozen-Lake and an MDP that models it.

Frozen-Lake modelled as a finite Markov Decision Process.

Below is the output. Note that state 0 is the starting cell S, state 11 is the hole H in the third row and state 15 is the goal state G. Each transition is defined as a 4-tuple: (probability_of_transistion, next_state, reward, is_episode_end).

SFFF 
FHFH
FFFH
HFFG
Number of states 16
Number of actions 4
Transitions for state 0 and action LEFT are
[(0.33, 0, 0.0, False),(0.33, 0, 0.0, False),(0.33, 4, 0.0, False)]
Transitions for state 0 and action UP are
[(0.33, 1, 0.0, False),(0.33, 0, 0.0, False),(0.33, 0, 0.0, False)]
Transitions for state 11 and action LEFT are
[(1.0, 11, 0, True)]
Transitions for state 11 and action UP are
[(1.0, 11, 0, True)]
Transitions for state 15 and action LEFT are
[(1.0, 15, 0, True)]
Transitions for state 15 and action UP are
[(1.0, 15, 0, True)]

Returns, Policies, and Value Functions

The goal has always been to maximize the expected sum of rewards. To characterize this more formally, we define return G(t) at time step t as the sum of all the rewards starting from that time step. We seek to maximize the expected value of G(t).

Defining return G(t) at time step t.

In the case of continuing tasks, the return could be infinite, so we introduce a discount rate γ which lies in [0, 1) and the return is finite if the reward sequence is bounded.

γ can be 1 in the case of episodic tasks.

Defining return G(t) at time step t with discounting.

We define policy as a mapping from states to probability distributions over actions. A policy informs the agent of the action to take in any given state. Mathematically, π(a|s) at time t is the probability that Aₜ = a given Sₜ = s. π is the policy the agent is said to be following if at each time step it takes the maximizing action over π.

Policy π for a state s is simply a probability distribution over the set of actions.

The key idea behind many algorithms is to evaluate value functions. These are functions of states or state-action pairs that calculate the expected return for these states or state-action pairs. Value function v(s) is defined under some policy π as follows:

The expected return from a state s is simply the weighted combinations of actions a, successor states s` and rewards r.

The final equation is called the Bellman equation for v(s) where the value of a state is expressed using the values of successor states. We get a system of linear equations and finding the exact solution is expensive yet straightforward.

Fun exercise: The state value function is defined as a function of the state. Can you come up with the equation for the action-value function q(s, a) under policy π that is defined as the expected return if you take an action a in a state s and follow π thereafter?

Optimal Policies and Optimal Value Functions

A policy π is said to be better than or equal to a policy π` if its expected return is greater than or equal to that of π` for all states. There is always a policy that is better than or equal to any other policy and this is the optimal policy. Optimal policies are denoted by π*. All optimal policies share the same state-value function v*(s). Below is the Bellman optimality equation for the state value function.

The Bellman Optimality Equation for a state-value function.

And using the optimal state-value function (or action-value function as shown below), we can get (one of) the optimal policies.

An optimal policy defined using the optimal action-value function and then using the optimal state-value function.

Improving Policies and Value Functions

Before we solve for optimal policies and value functions, let's look at the initial agent definition. The value function v(s) is initialized to 0s and the policy π is initialized to a random deterministic policy.

The initial agent definition.

Policy evaluation. Instead of solving for v(s) exactly by analytically solving the linear equations, we solve iteratively by starting with an arbitrary v(s) and nudging it closer to the exact solution at each step. In each iteration, the “nudge” is done using the Bellman equation itself. Below is a function that does this and we call this algorithm iterative policy evaluation.

Policy evaluation implementation.

Policy improvement. We have seen how to estimate a value function for a given policy π using the policy evaluation algorithm. Once, we evaluate the policy π we ask the following question: For some state s would it be better to take an action a that is not suggested by π? To find out we look at all such actions a and see if q(s, a) ≥ v(s) in which case we will update the policy π(s) to a. The resulting policy is at least as good as the previous as a consequence of policy improvement theorem. We repeat this step for all states and the resulting algorithm is called the policy improvement algorithm. If there’s no change in policy we say it’s stable.

Policy improvement implementation.

Policy Iteration. As long as the policy is not stable we repeatedly use policy evaluation and policy improvement. Each policy is guaranteed to be a strict improvement over the previous one unless it is already optimal and therefore stable.

Policy iteration using policy evaluation and policy improvement.

Further optimization by Value Iteration. In policy iteration, we perform policy evaluation and policy improvement repeatedly in succession until the policy is stable. Instead, we can sideline updating the policy and instead find optimal value function first. Using the optimal value function, we can extract optimal policy directly.

Value iteration implementation where we sidestep explicit policy improvement until the end.

Generalized Policy Iteration. The idea of using policy evaluation and improvement algorithms together (we have already seen two ways of doing it: policy iteration and value iteration) is a recurring idea in RL and is referred to as Generalized Policy Iteration. In fact, we can make an asynchronous combination of policy evaluation and improvement algorithms by updating the state-values in any order, even updating some state-values multiple times before updating others.

Experiments

In a small experiment, we will consider two agents where the first one uses policy iteration and the second one uses value iteration. The complete code can be found here. Below is the frozen-lake environment.

S F F F
F H F H
F F F H
H F F G

Comparing Optimal Value Functions. Let us compare the estimates of the optimal value function in the cases of policy iteration and value iteration.

Optimal value function calculated using policy and value iteration respectively. Darker colours indicate lower values and brighter colours indicate higher values.

There’s absolutely no difference between their estimates as both algorithms eventually converge to the optimal value function. The brightest coloured cell is to the left of the goal state as expected. The darkest cells are either holes or the goal state itself which is also expected since we don’t get any returns from these cells. Let’s pay attention to the 3rd cell in the 2nd row which is quite dark but not the darkest. This is because it is surrounded by two holes and also because the uncertain movement can cause an unintended transition into one of the hole states.

The cumulative reward of the agents. Now, let’s look at their performance over 2000 runs. Both agents gain about +1500 reward indicating they fail in approximately 500 episodes. This is mostly due to the uncertain movement in the Frozen-Lake.

The cumulative reward of agents trained using policy and value iteration respectively.

What’s the difference? So what exactly did we gain by using value iteration over policy iteration? See for yourself below the execution time of policy iteration vs value iteration.

The benefit of using value iteration over policy iteration: Execution speed.

Value iteration converges to the same estimate of the optimal value function, and therefore the optimal policy, faster than policy iteration.

--

--