|
Overview
Value and Policy Iteration provide an excellent means for agents in a
nondeterministic environment to determine an
optimal series of actions through the solving
of a Markov decision process (MDP). However,
solving an MDP requires that an agent have a
great deal of knowledge about its environment:
specifically, the rewards for each state and
the transition probabilities between states.
When this knowledge is not available to the
agent, it can be learned through
experience. Reinforcement learning, specifically
Q-learning, is a method for doing
this. Q-learning is a form of model-free
learning; a Q-learning agent can learn an
optimal policy without any knowledge about its
environment, given enough experience.
In this problem, students implement value
iteration, policy iteration, and Q-learning to
discover optimal policies for both a toy map and
for a realistic campus map. Applying these two
approaches to the same set of problems provides
students with an understanding
of how learning can be used to make up for a
lack of available domain knowledge.
|