NetLogo banner

 Home
 Download
 Resources
 Extensions
 FAQ
 References
 Contact Us

 Models:
 Library
 Community
 Modeling Commons

 User Manuals:
 Web
 Printable
 Chinese
 Czech
 Japanese

  Donate

NetLogo User Community Models

(back to the NetLogo User Community Models)

Reinforcement Learning Maze

by Joe Roop (Submitted: 05/08/2006)

[screen shot]

Download Reinforcement Learning Maze
If clicking does not initiate a download, try right clicking or control clicking and choosing "Save" or "Download".

(You can also run this model in your browser, but we don't recommend it; details here.)

WHAT IS IT?

This model implements Q-learning (Watkins 1989) a one-step temporal difference algorithm in the area of reinforcement learning, a branch of artificial intelligence and machine learning.

HOW IT WORKS

The agent (ant) moves to a high value patch, receives a reward, and updates the previous patches learned values with the received reward using the following algorithm:

Q(s,a) = Q(s,a) + step-size * [reward + discount * max(Q(s,a)) Q(s,a)]

The agent keeps moving until it hits a blue patch with a -10pts reward or the goal patch with +10pts reward, which results in a new episode and resetting of the agent to the starting position.

HOW TO USE IT

The buttons and sliders control the setup and all the parameters inside the algorithm. The graph provides the average reward on obtained per episode. The step-size parameter is the amount old values are updated towards new values. Discount is the present value worth of future rewards. Exploration-% is the amount moves the agent takes towards a non-optimum patch, which can help the agent explore more of the maze and not get stuck in local optimums.

THINGS TO NOTICE

The average reward in the graph increases over the number of episodes that the agent has trained on, which shows the learning process of the agent.

THINGS TO TRY

Experiment with the algorithm parameters such as step-size, discount, and exploration-%.

EXTENFDING THE MODEL

Implement different reward schemes allowing more direct and optimal paths, such as -1pts for every move the agent makes forcing the agent to find a more direct approach to the goal square.

CREDITS AND REFERENCES

Written by Joe Roop (Spring 2006): Joseph.Roop@asdl.gatech.edu
Graduate Research Assistant
Aerospace Systems Design Laboratory (ASDL): http://www.asdl.gatech.edu/
Georgia Institute of Technology

References:
1. Sutton, R. S., Barto, A .G. (1998) Reinforcement Learning: An Introduction. MIT Press
2. Watkins, C. J. C. H. (1989) Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.

(back to the NetLogo User Community Models)