Learning in a Labyrinth:  Learning from Model-Based Feedback
Jerker Denrell, Christina Fang, Daniel A. Levinthal
iibjd@hhs.se, fang@management.wharton.upenn.edu, levinthal@wharton.upenn.edu


     Current models of experiential learning suffer from an important limitation in that the choice of a
specific action is assumed to be immediately followed by an observable outcome. However in
many situations outcomes can only be observed after a series of actions have been performed. As
a result, our models of learning miss a fundamental challenge of learning tasks --- action and
payoffs are often separated across time.
      We create a formal computational model of learning in these situations in which the actor
develops a mental model of the value of intermediate states.  We model a particular method of
credit assignment, termed temporal differencing that has recently been introduced in the literature
on adaptive systems in computer science. In this method an actor's own mental model of the
environment is used to provide interim feedback regarding the value of actions in lieu of feedback
from the environment. We explore the evolution of an actor's mental model over time as a
complex problem-solving task is repeated.  While the problem structure is assumed to remain
fixed, the initial conditions for this task are varied with each iteration. As experience is gained
with the problem, stage-setting or antecedent actions begin to be recognized as valuable and
distinct problem solving routines develop. These routines become more elaborated with
experience, resulting in increased efficiency at problem-solving. The process, however, requires
repeated trials. The positional values of intermediary states are only gradually imputed. Thus,
while a valid cognitive map quickly emerges for states close to the solution, recognizing the
positional value of more distant states requires numerous trials. Although more extensive credit
assignment can produce faster learning, we show that it may lead to less intelligent associations
between the ultimate outcome and prior actions.  As a result, managing the appropriate degree of
credit assignment is a form of the familiar exploration/exploitation tradeoff.