NetLogo User Community Models
WHAT IS IT?
This model implements Q-learning (Watkins 1989) a one-step temporal difference algorithm in the area of reinforcement learning, a branch of artificial intelligence and machine learning.
HOW IT WORKS
The agent (ant) moves to a high value patch, receives a reward, and updates the previous patches learned values with the received reward using the following algorithm:
Q(s,a) = Q(s,a) + step-size * [reward + discount * max(Q(s’,a’)) – Q(s,a)]
The agent keeps moving until it hits a blue patch with a -10pts reward or the goal patch with +10pts reward, which results in a new episode and resetting of the agent to the starting position.
HOW TO USE IT
The buttons and sliders control the setup and all the parameters inside the algorithm. The graph provides the average reward on obtained per episode. The step-size parameter is the amount old values are updated towards new values. Discount is the present value worth of future rewards. Exploration-% is the amount moves the agent takes towards a non-optimum patch, which can help the agent explore more of the maze and not get stuck in local optimums.
THINGS TO NOTICE
The average reward in the graph increases over the number of episodes that the agent has trained on, which shows the learning process of the agent.
THINGS TO TRY
Experiment with the algorithm parameters such as step-size, discount, and exploration-%.
EXTENFDING THE MODEL
Implement different reward schemes allowing more direct and optimal paths, such as -1pts for every move the agent makes forcing the agent to find a more direct approach to the goal square.
CREDITS AND REFERENCES
(back to the NetLogo User Community Models)