Overview
Value and Policy Iteration provide an excellent means for agents in a
nondeterministic environment to determine an
optimal series of actions through the solving
of a Markov decision process (MDP). However,
solving an MDP requires that an agent have a
great deal of knowledge about its environment:
specifically, the rewards for each state and
the transition probabilities between states.
When this knowledge is not available to the
agent, it can be learned through
experience. Reinforcement learning, specifically
Qlearning, is a method for doing
this. Qlearning is a form of modelfree
learning; a Qlearning agent can learn an
optimal policy without any knowledge about its
environment, given enough experience.
In this problem, students implement value
iteration, policy iteration, and Qlearning to
discover optimal policies for both a toy map and
for a realistic campus map. Applying these two
approaches to the same set of problems provides
students with an understanding
of how learning can be used to make up for a
lack of available domain knowledge.
