Skip to content

An AI agent that use Double Deep Q-learning to teach itself to land a Lunar Lander on OpenAI universe

License

Notifications You must be signed in to change notification settings

anh-nn01/Lunar-Lander-Double-Deep-Q-Networks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lunar-Lander-Double-Deep-Q-Networks

An AI agent that use Double Deep Q-learning to learn by itself how to land a Lunar Lander on OpenAI universe

AI-Lunar-Laner-Lander-v2-Keras TF Backend

A Reinforcement Learning AI Agent that use Deep Q Network to play Lunar Lander

Algorithm Details and Hyperparameters:

  • Implementation: Keras TF Backend
  • Algorithm: Deep Q-Network with a Double Fully connected layers
  • Each Neural Network has the same structure: 2 Fully connected layers each with 128 nodes.
  • Optimization algorithm: Adaptive Moment (Adam)
  • Learning rate: α = 0.0001
  • Discount factor: γ = 0.99
  • Minimum exploration rate: ε = 0.1
  • Replay memory size: 10^6
  • Mini batch size: 2^6

**Commplete evolution (training process): https://www.youtube.com/watch?v=XopVALk2xb4&t=286s**

Description of the problem

  • The agent has to learn how to land a Lunar Lander to the moon surface safely, quickly and accurately.
  • If the agent just lets the lander fall freely, it is dangerous and thus get a very negative reward from the environment.
  • If the agent does not land quickly enough (after 20 seconds), it fails its objective and receive a negative reward from the environment.
  • If the agent lands the lander safely but in wrong position, it is given either a small negative or small positive reward, depending on how far from the landing zone is the lander.
  • If the AI lands the lander to the landing zone quickly and safely, it is successful and is award very positive reward.

Double Deep Q Networks (DDQN):

  • Since the state space is infinite, traditional Q-value table method does not work on this problem. As a result, we need to integrate Q-learning with Neural Network for value approximation. However, the action space remains discrete.

Q-learning:


The equation above based on Bellman equation. You can try creating a sample graph of MDP to see intuitively why the Q-learning method converge to optimal value, thus converging to optimal policy.

  • For Deep Q-learning, we simply use a NN to approximate Q-value in each time step, and then update the NN so that the estimate Q(s,a) approach its target:





Difference between Q-learning and DQN:









It has been proven mathematically and empirically that using Deep Q-Network approximation converges to optimal policy in reasonable amount of time.

Training Result:



Before training:

After 800 games:



Learning curve:


  • The Blue curve shows the reward the agent earned in each episode.
  • The Red curve shows the average reward from the corresponding episode in the x-axis and 100 previous episodes. In other words, it shows the average reward of 100 most current episodes.
  • From the plot, we see that the Blue curve is much noisier due to exploration ε = 0.1 throughout the training process and due to the imperfect approximation during some first episodes of the training.
  • Averaging 100 most current rewards produces much smoother curve, however.
  • From the curve, we can conclude that the agent has successfully learned a good policy to solve the Lunar Lander problem, according to OpenAI criteria (the average point of any 100 consecutive episodes is at least 200).



Releases

No releases published

Packages

No packages published

Languages