Gym is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. Since its release, Gym's API has become the field standard for doing this.
Control theory problems from the classic RL literature.
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity. A reward of +1 is provided for every timestep that the pole remains upright and the maximum number of steps per episode is 500. Hence, a perfect agent would be able to achieve a reward of +500 every episode.
A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum. A reward of -1 is provided for every timestep until the goal is reached or 200 timesteps have passed.
A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum. Here, the reward is greater if you spend less energy to reach the goal
The acrobot system includes two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height. A reward of -1 is provided for every timestep until the goal is reached or 500 timesteps have passed.
The inverted pendulum swingup problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.er link up to a given height. A reward of -1 is provided for every timestep until the goal is reached or 500 timesteps have passed.
Continuous control tasks in the Box2D simulator.
Navigate the lander to its landing pad. Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.
Navigate the lander to its landing pad. Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Action is two real values vector from -1 to +1. First controls main engine, -1..0 off, 0..+1 throttle from 50% to 100% power. Engine can't work with less than 50% power. Second value -1.0..-0.5 fire left engine, +0.5..+1.0 fire right engine, -0.5..0.5 off.
Train a bipedal robot to walk. Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.
Train a bipedal robot to run through an obstacle course. Hardcore version course contains ladders, stumps and pitfalls. Time limit is increased due to obstacles. Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.
Reward: 305.40 ± 21.35
Continuous control tasks, running in a fast physics simulator.
An inverted pendulum that needs to be balanced by a cart. The agent gets a reward for every timestep that the pendulum has not fallen off the cart, with a maximum reward of +1000.
Reward: 920.71± 224.04
Balance a pole on a pole on a cart. The agent gets a reward for every timestep that the pendulum has not fallen off the cart.
Reward: 9359.88 ± 0.08
A 2D robot trying to reach a randomly located target. The robot gets a negative reward the furthest away it is from the target location.
Reward: -4.75 ± 1.67
Make a 2D cheetah robot run. The robot gets a positive reward the furthest it travels.
Reward: 10374.65 ± 202.81
A 2D robot that learns to hop. The agent gets a positive reward the furthest it travels.
Reward: 3625.53 ± 9.00
A 2D robot that learns to walk. The agent gets a positive reward the furthest it travels.
Reward: 5317.38 ± 15.86
Simulated goal-based tasks for the Fetch and ShadowHand robots.
Move fetch to the goal position. A goal position is randomly chosen in 3D space. Control Fetch's end effector to reach that goal as quickly as possible. A negative reward is given at every timestep that the agent has not reached the goal position.
Reward: -1.78 ± 0.88
Retro Atari video game environments.
Pong is a table tennis–themed twitch arcade sports video game. You control the right paddle, you compete against the left paddle controlled by the computer. You each try to keep deflecting the ball away from your goal and into your opponent’s goal. You get score points for getting the ball to pass the opponent’s paddle. You lose points if the ball passes your paddle. Each episode contains 21 games.
Reward: 21.00 ± 0.00