Value and Policy Iteration provide an excellent means for agents in a nondeterministic environment to determine an optimal series of actions through the solving of a Markov decision process (MDP). However, solving an MDP requires that an agent have a great deal of knowledge about its environment: specifically, the rewards for each state and the transition probabilities between states.
When this knowledge is not available to the agent, it can be learned through experience. Reinforcement learning, specifically Q-learning, is a method for doing this. Q-learning is a form of model-free learning; a Q-learning agent can learn an optimal policy without any knowledge about its environment, given enough experience.
In this problem, students implement value iteration, policy iteration, and Q-learning to discover optimal policies for both a toy map and for a realistic campus map. Applying these two approaches to the same set of problems provides students with an understanding of how learning can be used to make up for a lack of available domain knowledge.

This project aims to provide students with hands-on experience implementing value iteration, policy iteration, and Q-learning to solve the same sets of problems.
The learning objectives of this project are:
  • An understanding of the concept of an optimal policy as a solution concept in nondeterministic environments
  • An understanding of the formulation of Markov decision processes
  • The ability to implement solvers for MDPs using value iteration and policy iteration.
  • The ability to implement programs that can learn optimal policies in nondeterministic environments.
  • The ability to articulate how learning can be used to acquire domain knowledge about an environment.
Students should have a basic knowledge of algebra and probability and utility, as well as exposure to basic data structures. Students should also have familiarity with the Python programming language.
The material presented in this project corresponds to chapters 16.1-16.3, 17.1-17.3, and 21.1-21.3 in Russell and Norvig.
The detailed project description is available in the PDF file Value Iteration, Policy Iteration, and Q-learning.. You will need the free Adobe Acrobat Reader to view this file.
You will also need the following Python files:
This project is customizable to accommodate different approaches to teaching and different implementations. Additional exercises are also included for students seeking more extended challenges.
A sample syllabus used at the University of San Francisco when this project was assigned is available at:
Syllabus for AI Course at the University of San Francisco

Additional readings are included in the Background section above.