NetLogo banner

Home
Download
Help
Resources
Extensions
FAQ
NetLogo Publications
Contact Us
Donate

Models:
Library
Community
Modeling Commons

Beginners Interactive NetLogo Dictionary (BIND)
NetLogo Dictionary

User Manuals:
Web
Printable
Chinese
Czech
Farsi / Persian
Japanese
Spanish

  Donate

NetLogo User Community Models

(back to the NetLogo User Community Models)

[screen shot]

Download
If clicking does not initiate a download, try right clicking or control clicking and choosing "Save" or "Download".(The run link is disabled for this model because it was made in a version prior to NetLogo 6.0, which NetLogo Web requires.)

WHAT IS IT?

This model implements Q-learning (Watkins 1989) a one-step temporal difference algorithm in the area of reinforcement learning, a branch of artificial intelligence and machine learning.

HOW IT WORKS

The agent (ant) moves to a high value patch, receives a reward, and updates the previous patches learned values with the received reward using the following algorithm:

Q(s,a) = Q(s,a) + step-size * [reward + discount * max(Q(s’,a’)) – Q(s,a)]

The agent keeps moving until it hits a blue patch with a -10pts reward or the goal patch with +10pts reward, which results in a new episode and resetting of the agent to the starting position.

HOW TO USE IT

The buttons and sliders control the setup and all the parameters inside the algorithm. The graph provides the average reward on obtained per episode. The step-size parameter is the amount old values are updated towards new values. Discount is the present value worth of future rewards. Exploration-% is the amount moves the agent takes towards a non-optimum patch, which can help the agent explore more of the maze and not get stuck in local optimums.

THINGS TO NOTICE

The average reward in the graph increases over the number of episodes that the agent has trained on, which shows the learning process of the agent.

THINGS TO TRY

Experiment with the algorithm parameters such as step-size, discount, and exploration-%.

EXTENFDING THE MODEL

Implement different reward schemes allowing more direct and optimal paths, such as -1pts for every move the agent makes forcing the agent to find a more direct approach to the goal square.

CREDITS AND REFERENCES

Written by Joe Roop (Spring 2006): Joseph.Roop@asdl.gatech.edu
Graduate Research Assistant
Aerospace Systems Design Laboratory (ASDL): http://www.asdl.gatech.edu/
Georgia Institute of Technology

References:
1. Sutton, R. S., Barto, A .G. (1998) Reinforcement Learning: An Introduction. MIT Press
2. Watkins, C. J. C. H. (1989) Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.

(back to the NetLogo User Community Models)