Most of us must already be aware of the famous Pac-Man grid game. In this article we will look into a type of Markov decision process that is used to make our Pac-Man intelligent. Such that it takes decision by its own to maximise the win rate.

Markov Decision Process

Markov Decision Process (MDP) is a sequential decision process in a fully observable but stochastic environment with a markovian transition model and rewards for each state in the given environment. It consists a set of states, a set of actions available, a transition model and a reward function. Formally it can be defined as a tuple as below:

< T,S,γ,A,R(s) >

T = P(s′|s, a) is the probability of moving to state s′ from s using action “a”
S = (s0,s1,….,sn) set of states in the observable environment.
A = (a0,a1,….,an) set of actions for the agent to execute in the environment. R(s) = A reward for each state.
γ = Discountfactor.

Example Grid

For e.g., an agent in the grid world above will have several paths to reach the target from start point, but which path will it select to increase the total utility? The aim of a MDP solver is to calculate the optimal policy for the current state. An optimal policy can be considered as the plan or an action an agent should do based on highest expected utility. There have been several algorithms proposed to calculate optimal policy but the key ones are:



In this article, VALUE ITERATION is considered to calculate the optimal policy. In this algorithm, utility of each state is calculated to select the optimal action in each state. This algorithm considers the fact that there is a direct relation between the utility of current state and the neighbouring state(s) in a fully observable environment. A utility of a state is given by the Bellman Equation (named after Richard Bellman).

For n possible states in a fully observable environment, there are n Bellman equations, one for each state. VALUE ITERATION uses an iterative approach to overcome the limitations of ‘max’ operator (non-linear operator) in the Bellman equation. This approach is called the Bellman Update

Bellman Update

This can be defined in a pseudo code

Value Iteration [1]

When the update is applied for certain number of iterations, Bellman update guarantees to reach an equilibrium for Utility value (U(s)) for every state (s) as shown in figure below.

Utility converges for all states in the environment


For this article the framework used was developed by UC Berkeley as a platform to experiment with algorithms to aid the Agent to move around the environment. The framework uses Python2.7 version. The game ends in either loss (when ghost(s) kill the Agent or win (when Agent eats all the food).


Pac-Man game environment, referred to as an env in this article, is a grid system with i rows and j columns. North is the direction of increasing y. A coordinate (i,j) is referred to as an independent state in this environment. The states can be of below types:

  • Current State: Current location of Pac-Man (Agent).
  • Walls: define the boundaries of the env and are not accessible for agent
  • Food: Food that Agent has to eat
  • Ghosts: Ghosts that can kill the agent when any one of the ghost is at the same state as agent. They are dynamic and move around the grid.
  • Capsules: When eaten by agent, ghosts are not a threat to the agent

The set of actions available for the Agent are: NORTH, SOUTH ,EAST, WEST, STOP. This game framework defines legal actions available for the Agent for every state. For e.g. Agent may have only two direction to move, NORTH and SOUTH, as states in EAST and WEST directions are occupied by walls. To win a game, the Agent must consume all the food in the env before the ghost(s) eat the Agent. To reduce the threat of ghosts, Agent can consume the capsules to nullify the threat.

VALUE ITERATION algorithm is used to calculate the maximum expected utility (MEU) from all states in the given env and is used by the Agent to select the best possible action a ∈ A(s) to move from current state s to next state (s′).

Initialise Variables for MDP process as below. It is important to note that γ value will not be changed at anytime throughout the game. The action probability for this article is 0.8 for intended direction and 0.1 for direction to- wards LEFT and 0.1 for RIGHT of the intended direction. For e.g. if the Agent intends to move NORTH,P(a=NORTH)=0.8, P(a=WEST)=0.1 and P(a=EAST)=0.1.

"""Define global parameters Static reward values for states in the env"""REWARD_FOOD = 10
class MDPAgent(Agent):
"""Constructor: this gets run when we first invoke"""
def __init__(self):
print "Starting up MDPAgent!"
name = "Pacman"
# Initialise:
# @Gamma(discount factor)
# @action probability
# @utility dictionary for relevant states and
# @utilRecord dictionary for keeping a record of last util for calculating convergence for the Bellman's update in value iteration"""
self.gamma = 0.9
self.actionProb = api.directionProb #0.8
self.utilRecord = {}
self.utility = {}
self.valueMap = {}

Using the above reward value, Agent is discouraged from staying near states with Ghosts or staying at the current state. Highest R(s) is given to food in env. In smallGrid env there are no capsules hence it is ignored. It is important to note that R(s) will not be modified throughout the game and remains static. And each state is initialised with a utility value of 0.

e.g. valueMap = {(7, 3): 10,...,(10, 2): 0}
e.g. utility = {(7, 3): 0,...,(10, 2): 0}
e.g. for state (7,3) with neighbouring (7,4),(7,2),(8,3),(6,3), U (s′ ) for Bellman Update will be:

Note that when Agent tries to move NORTH or SOUTH from state (7,3) it bounces back to current state as there is a wall in both (7,4) and (7,2). Hence, the utility of (7,3) will be used as shown in the equation above and in code below:

# Define directions from the current state 
_west = (state[0]−1, state[1]) # WEST
_north = (state[0], state[1]+1) # NORTH
_south = (state[0], state[1]−1) # SOUTH
_east = ( state [0]+1 , state [1]) # EAST
_dir = [ west , east , north , south]
# Check . . . wall . . .
for d in dir:
if d in map. wallsCoords:
_dir[ dir.index(d)] = state
# If . . . terminal . . .
for terminal in _dir:
if terminal in _map.terminalState:
self.utilRecord[terminal] = TERMINAL_VALUE
"""Calculate Expected Utility for all possible actions from the current state utilRecord{} is used."""
EU.append(self.actionProb*self.utilRecord[_dir[2]]+0.1*self.utilRecord[_dir[0]]+0.1*self.utilRecord[_dir[1]])EU.append(self.actionProb*self.utilRecord[_dir[1]]+0.1*self.utilRecord[_dir[2]]+0.1*self.utilRecord[_dir[3]])EU.append(self.actionProb*self.utilRecord[_dir[3]]+0.1*self.utilRecord[_dir[1]]+0.1*self.utilRecord[_dir[0]])""" Calculate the maximum Expected utility of that state """MEU = max(EU)""" Calculate Utility based on Bellman's update and Update the calculated utility in the utility Dict{} """self.utility[state] = self.valueMap[state] + (self.gamma*MEU)""" Check for convergence for every state and update the local _convergedList"""if self.utility[state] - self.utilRecord[state] < 0.005:
if state not in _convergedList:

For convergence, the code keeps utilRecord{} as a copy of utility{}.

self.utilRecord = dict(self.utility)

Finally, for deciding the action Agent uses the U(s) calculated for neighbouring states s′ and selects the best legal action.

"" Create a local list to store utilities of for each legal action from current state """
_actionUtil = []
for action in legal:
if action == "West":
newState = (_map._currPacLoc[0]-1, _map._currPacLoc[1])
elif action == "East":
newState = (_map._currPacLoc[0]+1, _map._currPacLoc[1])
elif action == "North":
newState = (_map._currPacLoc[0], _map._currPacLoc[1]+1)
action == "South"
newState = (_map._currPacLoc[0], _map._currPacLoc[1]-1)

"""Identify the action with max util from the value iteration. Identify the index of action with maximum utility (argmax)."""_maxActionUtil = max(_actionUtil)
self.last = legal[_actionUtil.index(_maxActionUtil)]
return api.makeMove(self.last, legal)


In the smallGrid layout 1250 iterations and for mediumClassic 250 iterations were executed to find the optimal γ value for both layouts. Below figures show the average time taken by the Agent to move in relation to γ. Agent takes more time in deciding the best policy in a given state as the value of γ increases.

Average time taken to execute a decision for 2 different grids (Small and Medium)

It also takes more time for MDP to converge as γ increases.

Iterations for convergence vs gamma

To find the optimal value for γ it is important to look at the relation between win rate and γ. Below Figures show the correlation for smallGrid and mediumClassic layout.

Win rate

From the graphs it is evident that for smallGrid layout, γ value of 0.5 gives the highest win rate while γ value of 0.9 results in highest win rate for mediumClassic layout. Given that in smallGrid layout, there is not much variation in win rate between 0.5 and 0.9, γ = 0.9 is found to be optimal value.

A run of PacMan using MDP VALUE ITERATION


[1] Peter Norvig Stuart Russel. “Artificial Intelligence: A modern Approach”. In: (2010)

[2] Namco. “Pac-Man”. In: (1980). URL: https : / / en . of Pac-Man video games

[3] UC-Berkley. “Pac-Man software”. In: (2017). URL:

Exploring how AI can benefit core engineering fields

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store