The task of balancing a vertical pole on top of a movable cart can be achieved using deep Q-learning.
A pole (red) attached by an un-actuated joint to a card (blue) which moves along a frictionless track (cyan)
The system above is controlled by applying a force to the center of the cart along the positive or negative direction of the world frame x-axis. The red pole starts upright with small noise, and the goal is to prevent is from falling over.
A reward of +1 was given for every timestep (0.02 seconds) the pole remained upright. An episode ends when the pole is more than 15 degrees from the vertical, or the cart moves more than 0.6 m from the center, or the time limit (10 seconds) has been reached. The episode is a success if the pole never falls off the allowable range for the whole episode. The maximum final reward of the episode is 500.
In this situation, the environment consisted of a 4-dimensional vector [cart position (m), cart velocity (m/s), joint angle (rad), joint velocity (rad/s)]. The action space is discrete with three options {0, 1, 2} which correspond to -10N, 0N, and 10N applied to the cart respectively.
The network architecture used consisted of an input layer with a number of neurons equal to the size of the environment vector (4), two rectified linearly activated layers consisting of 120 and 84 neurons, and an output layer with a number of outputs equal to the size of the action space (3).
The implementation of deep Q-learning with experience replay required two networks, and action network and a target network to be initiated, in addition to a replay memory buffer of size 10,000. For each episode, the episode reward was set to zero, the environment was reset, and the first environment observation was obtained.
For each timestep in the episode, an action was selected and executed using the action network. The next state was observed and the episode reward was incremented. A transition tuple consisting of (current state, current action, reward, next state) was stored in the replay buffer.
A minibatch of transitions from the replay buffer was then randomly sampled, and batch gradient descent was preformed on the action network with a MSE loss function taking the difference between the predicted Q-value of the action network and the Q-value of the target network.
The current state is then set to the next state, and if the state is terminal, then break out of the timestep iteration loop.
Each episode's reward is reported and a checkpoint of the action network is saved if the reward is higher than a threshold. The target network is also updated every four episodes.
The video below shows the cart being controlled by an action network with reward of 500.