NetLogo banner

 Home
 Download
 Help
 Resources
 Extensions
 FAQ
 References
 Contact Us
 Donate

 Models:
 Library
 Community
 Modeling Commons

 User Manuals:
 Web
 Printable
 Chinese
 Czech
 Japanese

  Donate

NetLogo User Community Models

(back to the NetLogo User Community Models)

Reinforcement Learning Wargame

by Joe Roop (Submitted: 05/08/2006)

[screen shot]

Download Reinforcement Learning Wargame
If clicking does not initiate a download, try right clicking or control clicking and choosing "Save" or "Download".(The run link is disabled because this model uses external files.)

WHAT IS IT?

This model implements Q-learning (Watkins 1989) a one-step temporal difference algorithm in the area of reinforcement learning.

HOW IT WORKS

The agent (strike aircraft, blue) has the ability to sense the state of the game in the form of health, distances, and number of weapons. After sensing the state and receiving a reward the agent can choose from 8 different actions to manipulate the state space such as evading left or right, flying towards a SAM, and firing a weapon towards the SAM. The following Q-Learning algorithm is used:

Q(s,a) = Q(s,a) + step-size * [reward + discount * max(Q(s’,a’)) – Q(s,a)]

The agent keeps makes moves until it runs out of weapons, dies, or kills the ‘target’ SAM site. The rewards are -2pts for weapons use, -200pts for dying, and +1000pts for killing the ‘target’ SAM. The agent also has the option of turning on the stealth technology, which allows the agent the ability to not be seen by the SAM sites.

HOW TO USE IT

The buttons and sliders control the setup and all the parameters inside the algorithm. The graph provides the average reward on obtained per episode. The step-size parameter is the amount old values are updated towards new values. Discount is the present value worth of future rewards. Exploration-% is the amount moves the agent takes towards a non-optimum patch, which can help the agent explore more tactics and not get stuck in local optimums.

THINGS TO NOTICE

The average reward in the graph increases over the number of episodes that the agent has trained on, which shows the learning process of the agent. With the stealth technology enabled does the agent perform different tactics?

THINGS TO TRY

Experiment with the algorithm parameters such as step-size, discount, and exploration-%. Also, investigate the environmental parameters.

EXTENDING THE MODEL

Implement different reward schemes allowing more direct and optimal paths, such as -1pts for every move the agent makes forcing the agent to find a more direct approach to the ‘target’ SAM. Add a more robust exploration routine. The model is set up for multi-agent learning however, more advanced cooperation vs self-interest algorithms need to be implemented to help solve the unstable environment that multi-agent learning can cause.

TROUBLE SHOOTING

This model requires an outside file (“agent.rtf”) in order to store the learned tactics. If an error is seen for “LOAD-STATE-ACTION-FILE” click the “Clear/Create File” button and the “agent.rtf” file will be created and the file will work as long as there is permission to write in the directory where the model is stored.

CREDITS AND REFERENCES

Written by Joe Roop (Spring 2006): Joseph.Roop@asdl.gatech.edu
Graduate Research Assistant
Aerospace Systems Design Laboratory (ASDL): http://www.asdl.gatech.edu/
Georgia Institute of Technology

References:
1. Sutton, R. S., Barto, A .G. (1998) Reinforcement Learning: An Introduction. MIT Press
2. Watkins, C. J. C. H. (1989) Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.

(back to the NetLogo User Community Models)