Proposal PDF
Proposal PDF
Proposal PDF
1. Domain Background
Out of the three broad types of machine learning (unsupervised, supervised and
reinforcement), reinforcement learning stands apart. While the first two learns from data, RL
learns from experience. RL tends to be mimic the way living organisms interact and learn
from their environment. The RL agent interacts with the environment by taking actions to
maximize the cumulative reward obtained in the long term. Positive rewards encourage the
agent to take desirable actions (in the direction of the goal state). Negative rewards act as
deterrent towards undesirable actions and states.
Fig 1: The reinforcement learning feedback loop (Sutton and Barto 2017)
The history, tools and techniques used to solve RL problems are well documented in the
book “Reinforcement Learning: An introduction” by Sutton and Barto (2008/2017) [1].
The RL problem is well captured as a Markov Decision Process. Solutions to the MDP
problem varies from those which need model of the environment (dynamic programming) to
those which are model free (monte carlo and temporal difference learning) [1].
“Reinforcement learning offers to robotics a framework and set of tools for the design of
sophisticated and hard-to-engineer behaviors.” - J. Kober, J. Andrew (Drew) Bagnell, and
J. Peters in Reinforcement Learning in Robotics: A Survey [2]
1.1 Motivation
I am drawn to reinforcement learning by the range of problems it aims to solve and it’s
approach to doing so. It is foundational to solving Artificial General Intelligence. Recent
advancements in beating the Go champion and Atari games have further energized the
field. My interest lies in application of RL to training robots to accomplish tasks. The RL
framework is well suited for such problems. I am also influenced by Pieter Abbeel’s [8] work
and talks.
2. Problem Statement
For my MLND Capstone project, I propose to explore the domain of reinforcement learning
and apply it to the problem of training a robotic arm to accomplish a specific task.
Fig 2: A 4 DOF robotic arm with gripper in the V-REP environment. The blue cube
represents the maneuvering space.
In this project, the robotic arm will learn to grab an object (the cylinder in the above
diagram) and put it in a desired location (in the bin in the above diagram).
The objective of the agent will be to grab the object and successfully put it in the bin using
least possible actions.
The arm will have to learn to locate the object, grab it, lift it, maneuver it to top of the bin and
release it. Appropriate reward strategy needs to be designed to enable the agent to do so.
This being a reinforcement learning approach will not need any specific dataset to start with.
The solution involves using Q-learning to generate the optimal policy. Hence over time, we
will have a Q-table which will capture the optimal values for a state-action pairs.
3.1 Inputs
The agent will be provided with the coordinates of the space where it can maneuver it’s arm
(the cubic space in figure 2).
Though in real life positions of the arm and the objects are in continuous space, I will use
positions spaced out in discrete steps of 0.005 units. This will create discrete locations in
the coordinate system where the arm can move to. This is enough for the gripper to hold the
object and maneuver it. This will make the state space finite too.
The arm can reach any point in this space within a resolution of 0.005 units. For e.g., along
the z-axis, there are 24 points where the gripper can be positioned ((0.12 - 0)/0.005). This
will be fine tuned if needed.
4. Solution statement
Classic Q-learning will be used for generating the optimal policy. It is an online, model-free
and off-policy approach.
(From Wikipedia)
5. Benchmark model
While researching this project, I came across approaches to teach robotic arms to grasp
objects (Google, Cornell, Google). These approaches use computer vision to learn the task.
They focus on finer aspects like grasping objects of various shapes, sizes and textures.
I am using sensory data (position, gripper state) to accomplish the task mentioned in
section 2. Hence using the above mentioned research will not be a good benchmark to
compare to.
I propose to use a uniformly random agent as a benchmark. Though it is highly unlikely
that the benchmark agent will be able to accomplish the task, it will be interesting to
compare the cumulative reward collected per episode, and how the actual solution improve
upon its reward compared to the random agent.
6. Evaluation metrics
● Success/failure per episode: When an episode terminates, it will either have failed
in it’s goal or succeeded. Initially we will see failures and eventually success. This
metric will let us know how many episodes it took for training before the arm learnt
the task.
● Cumulative reward per episode: How much reward did the bot collect before the
episode ended? We should see a gradual increase in accumulated reward: and it
should move from negative to positive. Even when positive, we will see failures
initially if the episode ends in one of the fail states. Gradually it should start
succeeding.
● Actions per episode: The challenge of the arm is to take the least number of
actions to accomplish the task. This metric will track that.
7. Project design
As shown in figure 2, a scene will be created in the V-REP environment with three objects:
1. Uarm with gripper.ttm (controlled by agent)
2. A cylinder (object)
3. A bin (target)
The V-REP python remote client library will be used to interact with this scene.
Apart from V-REP framework, the other requirements will be python 3.5, matplotlib and numpy.
The figures below show the class interfaces of RobotArm and RobotArmAgent.
Fig 5: The RobotArm class
Fig 6: The RobotArmAgent class
References:
1. http://incompleteideas.net/sutton/book/the-book.html
2. http://www.ias.tu-darmstadt.de/uploads/Publications/Kober_IJRR_2013.pdf
3. https://en.wikipedia.org/wiki/Reinforcement_learning
4. https://www2.informatik.uni-hamburg.de/wtm/ps/Yan_DAAAAM09.pdf
5. http://www.coppeliarobotics.com/
6. https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf
7. https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
8. https://people.eecs.berkeley.edu/~pabbeel/