Rosen Diankov Thesis
Rosen Diankov Thesis
Rosen Diankov Thesis
Manipulation Programs
Rosen Diankov
CMU-RI-TR-10-29
26 August 2010
Thesis Committee:
Takeo Kanade, Co-chair
James Kuffner, Co-chair
Paul Rybski
Kei Okada, University of Tokyo
Copyright
c 2010 by Rosen Diankov. All rights reserved.
Keywords: robotics, autonomous manipulation, object recognition, planning, grasping,
simulation, inverse kinematics, motion control, camera calibration
Abstract
Society is becoming more automated with robots beginning to perform most tasks in facto-
ries and starting to help out in home and office environments. One of the most important
functions of robots is the ability to manipulate objects in their environment. Because the
space of possible robot designs, sensor modalities, and target tasks is huge, researchers end
up having to manually create many models, databases, and programs for their specific task,
an effort that is repeated whenever the task changes. Given a specification for a robot and
a task, the presented framework automatically constructs the necessary databases and pro-
grams required for the robot to reliably execute manipulation tasks. It includes contributions
in three major components that are critical for manipulation tasks.
The first is a geometric-based planning system that analyzes all necessary modalities of
manipulation planning and offers efficient algorithms to formulate and solve them. This
allows identification of the necessary information needed from the task and robot specifica-
tions. Using this set of analyses, we build a planning knowledge-base that allows informative
geometric reasoning about the structure of the scene and the robot’s goals. We show how
to efficiently generate and query the information for planners.
The second is a set of efficient algorithms considering the visibility of objects in cameras
when choosing manipulation goals. We show results with several robot platforms using
grippers cameras to boost accuracy of the detected objects and to reliably complete the
tasks. Furthermore, we use the presented planning and visibility infrastructure to develop
a completely automated extrinsic camera calibration method and a method for detecting
insufficient calibration data.
The third is a vision-centric database that can analyze a rigid object’s surface for stable
and discriminable features to be used in pose extraction programs. Furthermore, we show
work towards a new voting-based object pose extraction algorithm that does not rely on
2D/3D feature correspondences and thus reduces the early-commitment problem plaguing
the generality of traditional vision-based pose extraction algorithms.
In order to reinforce our theoric contributions with a solid implementation basis, we discuss
the open-source planning environment OpenRAVE, which began and evolved as a result of
the work done in this thesis. We present an analysis of its architecture and provide insight
for successful robotics software environments.
Contents
2 Manipulation System 13
2.1 Problem Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Task Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Robot Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 System Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Component Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Execution Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 General Guidelines for Autonomy . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
v
vi CONTENTS
8 Conclusion 203
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.2 Future of Robotics: Robot-Task Compilers . . . . . . . . . . . . . . . . . . . 207
References 229
List of Figures
1.1 A system that grasps cups from a bartender robot’s tray and puts them in a
dish rack for washing. The bottom shows the minimal set of components that
have to be considered to create a functioning autonomous bartender system.
Each component takes a lot of time to construct making the entire system
development time on the order of a year. Our goal is to reduce this time
by automating the construction of the components related to manipulation
planning and target object recognition. . . . . . . . . . . . . . . . . . . . . . 3
1.2 Some of the robot platforms we have tested basic manipulation on. Each
robot’s internal world continuously models the geometric state of the environ-
ment so planners can make the most informed decisions. . . . . . . . . . . . 5
1.3 Given a set of robot and task specifications, we construct the databases nec-
essary for the robot to robustly execute the task. The specifications feed into
the planning and vision knowledge-bases. Each knowledge-base analyzes the
specifications and constructors models to help drive the manipulation plan-
ning, sensor visibility analysis, and pose recognition algorithms. These basic
set of algorithms are used in the run-time phase to calibrate the robot sensors
and execute the task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The OpenRAVE Architecture is composed of four major layers and is designed
to be used in four different ways. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 A basic task is composed of a set of manipulable objects, their goal criteria,
and the real-world training data used to measure sensor noise models and
recognition programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 A robot specification should provide a CAD model with annotations of which
parts of the geometry serve what purpose. . . . . . . . . . . . . . . . . . . . 17
ix
x LIST OF FIGURES
2.3 The modules forming a manipulation system that this thesis concentrates
on. The knowledge-bases and goal configuration generators are automatically
generated from the task and robot specifications. . . . . . . . . . . . . . . . 21
2.4 The robot and task knowledge database stores relationships between two or
more components. The relationships are color coded where green represents
task-related relations while blue represents robot-related relations. . . . . . . 24
2.5 The levels of an execution architecture and a description of their inputs and
outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 The steps necessary for executing the target tasks. . . . . . . . . . . . . . . . 26
2.7 Example of localization in a 3D map using a camera. . . . . . . . . . . . . . 28
3.1 Shows the connection between the robot databases when used for manipula-
tion and grasp planning. The most involved and important process is sampling
the goal configurations while taking into account all constraints and task pri-
orities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Shows the environment distances for every point on the object surface (blue
is close, red is far). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 An example of a breakdown of the manipulation configuration space of a task. 36
3.4 Grasp planning involving just the arm and gripper can sometimes require both
the grasp and release goals to be considered before a grasp is chosen.. . . . . 37
3.5 Shows examples of collision-free, stable grasps that are invalid in the environ-
ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Divides the grasps that pass GraspValidator into the set of reachable (has
inverse kinematics solutions) and unreachable grasps for one target. . . . . . 44
3.7 Shows all the valid, reachable grasps that can be sampled from simultaneously.
The grasp set size is 341 and there’s 6 target objects for a total of 136 valid
grasps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Several robot arms performing grasp planning. . . . . . . . . . . . . . . . . . 46
3.9 A robot hand opening a cupboard by caging the handle. Space of all possible
caging grasps (blue) is sampled (red) along with the contact grasps (green). . 47
3.10 The basic framework used for planning discrete paths {qi }|n1 in robot config-
uration space to satisfy paths {qtarget,i }|n1 in object configuration space. . . . 51
3.11 Comparison of fixed feasibility regions (left) and relaxed feasibility regions
(right) for the Manus and Puma robot arms. . . . . . . . . . . . . . . . . . 53
3.12 Example simulations using the relaxed grasp-set planning formulation. . . . 55
3.13 WAM arm mounted on a mobile base autonomously opening a cupboard and
a fridge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
LIST OF FIGURES xi
3.14 WAM arm autonomously opening a cupboard, putting in a cup, and closing
it. Wall clock times from start of planning are shown in each frame. . . . . . 57
3.15 Shows a gripper whose fingers need to open, but cannot due to the table.
Planning by setting the free joints to the arm can move the gripper to a safe
location to open its fingers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.16 The planner is also applied to moving the torso joints of the robot such that
the arms do not hit anything. . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.17 When a robot is far away from its goal, it must also plan for moving its body
along with its arm. The robot should first find the possible grasps from which
it can sample robot goal locations. The planning algorithms will then join the
two configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.18 Several configurations returned from GoalSampler BasePlacement con-
sidering both robot arms, the humanoid’s torso, and multiple objects. . . . . 62
3.19 Humanoid upper body structure used to extract arm chains. . . . . . . . . . 63
3.20 Having just a static navigation configuration qnavigation is not enough to safely
go into navigation mode, sometimes the target object might collide with the
robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.21 Robot needs to plan to a configuration such that all its limbs are within the
base navigation footprint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.22 BiSpace Planning: A full configuration space tree is grown out from the robot’s
initial configuration (1). Simultaneously, a goal back tree is randomly grown
out from a set of goal space nodes (2). When a new node is created, the
configuration tree can choose to follow a goal space path leading to the goal
(3). Following can directly lead to the goal (4); if it does not, then repeat
starting at (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.23 Comparison of how the distance metric can affect the exploration of the arm.
The top image shows the search trees (red/black) generated when the distance
metric and follow probability is weighted according to Equation 3.14. The
bottom image shows the trees when the distance metric stays uniform across
the space; note how it repeatedly explores areas. The goal space trees are
colored in blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.24 Scenes used to compare BiSpace, RRT-JT, and BiRRTs. . . . . . . . . . . . 72
3.25 Hard scene for BiSpace. The forward space tree (red) does not explore the
space since it is falsely led over the table by the goal space tree (blue). . . . 73
4.1 Computational dependency graph of all the components extracted from the
relational graph in Figure 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
xii LIST OF FIGURES
4.2 An outline of the parameterizations ikfast can solve along with how the equa-
tions are decomposed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 The labeled joints (black) of the PR2 robot’s right arm. . . . . . . . . . . . . 85
4.4 The labeled joints (black) of the Kuka KR5 R850 industrial robot. . . . . . . 86
4.5 Two types of separable kinematic chains, which allow rotation joints (red) to
be solved separately from translation joints (blue). T0 is the base link frame,
Tee is the end effector link frame. . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6 The parameterization of the ray (green) with the last two axes interesting. . 93
4.7 The parameterization of the ray (green) using a sub-chain of the Barrett WAM
arm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.8 Several robot platforms whose inverse kinematics can be solved by ikfast. . . 96
4.9 The gripper first approaches the target object based on a preferred approach
direction of the gripper. Once close, the fingers close around the object until
everything collides, then contact points are extracted. . . . . . . . . . . . . . 100
4.10 A simple way of parameterizing the grasp search space. . . . . . . . . . . . . 100
4.11 Examples of the types of robot hands that can be handled by the grasp strategy.101
4.12 Grasps for the HRP2 gripper with a wrist camera mounted. Grasps contacting
the wrist camera are rejected automatically. As the gripper approach varies
the grasps change from power grasps to pinch grasps. . . . . . . . . . . . . 102
4.13 Fragile grasps that have force closure in simulation (contacts shown). . . . . 102
4.14 Example caging grasp set generated for the Manus Hand. . . . . . . . . . . . 104
4.15 A fragile grasp that was rejected by the random perturbation in the grasp
exploration stage, even though it mathematically cages the handle. . . . . . 106
4.16 Goal is for both manipulator tips to contact the surface of the target object.
The left box shows failure cases that are pruned, the right box shows the final
grasp set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.17 Idustrial robot using the magnet tip grasps to pick parts out of a bin. . . . . 107
4.18 Barrett WAM and Yaskawa SDA-10 kinematic reachability spaces projected
into 3D by marginalization of the rotations at each point. . . . . . . . . . . . 109
4.19 The HRP2 reachability space changes when the chest joint is added (left vs
right). Also, the HRP2 wrist has a unique joint limits range, which can
increase the reachability density by 2x if handled correctly. . . . . . . . . . . 112
4.20 Shows one of the extracted equivalence sets for the Barrett WAM (869 sets),
HRP2 7 DOF arm (547 sets), and HRP2 8 DOF arm+chest (576 sets). Each
was generated with ∆θI = 0.15, ∆zI = 0.05. . . . . . . . . . . . . . . . . . . 114
LIST OF FIGURES xiii
4.21 Base placements distributions for achieving specific gripper locations; darker
colors indicate more in-plane rotations. . . . . . . . . . . . . . . . . . . . . . 117
4.22 The grasp reachability map using validated grasps to compute the base distri-
bution (overlayed as blue and red densities). The transparency is proportional
to the number of rotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.23 Examples of convex decompositions for the geometry of the robots. . . . . . 121
4.24 Examples of convex decompositions for the geometry of the robots. . . . . . 122
4.25 Convex decomposition makes it possible to accurately prune laser points from
the known geometry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.26 The swept volumes of each joint, where the parent joint sweeps the volume of
its children’s joints and its link. . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.27 Example scenes used to test the effectiveness of the configuration metric. . . 126
4.28 Camera locations that can successfully extract the object pose are saved. . . 128
4.29 The probability density of the detection extents of several textured objects
using a SIFT-based 2D/3D point correspondence pose detection algorithm. . 129
5.6 When initially detecting objects, the robot could have a wrong measurements.
If it fixes the grasp at the time of the wrong measurement, then when it gets
close to the object, the grasp could be infeasible. This shows the necessity to
perform grasp selection in the visual feedback phase. . . . . . . . . . . . . . 145
5.7 The scenes tested with the visibility framework. The images show the robots
at a configuration such that the visibility constraints are satisfied. . . . . . . 149
5.8 The calibration error from the head camera can be very inaccurate when
detecting over large distances. . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.9 Can efficiently prune out all the visibility candidates before sampling by using
reachability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.10 The full process of mobile manipulation using visibility, reachability, and base
placement models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.11 Manipulation with dynamic obstacles from range data (red boxes) using visi-
bility configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.12 The full process of mobile manipulation using visibility, reachability, and base
placement models. In order for the robot to grasp the object, it has to move
to the other side. But at the other side, the only way to see object is with its
wrist camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.13 When sampling base placements with visibility, the reachability of grasp sets
also have to be considered, otherwise the robot will need to move again before
it can grasp the object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.14 Industrial bin-picking scene has a ceiling camera looking at a bin of parts and
a gripper camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.15 Collision obstacles around target part need to be created for modeling un-
known regions because the robot does not have the full state of the world. . 155
5.16 Grasp Planning is tested by simulating the unknown regions. . . . . . . . . . 156
7.1 A well established pose extraction method using 2D/3D point correspondences
to find the best pose. Works really well when the lighting and affine invariant
features can be extracted from the template image. . . . . . . . . . . . . . . 177
7.2 Database is built from a set of real-world images, a CAD model, and a set
of image feature detectors. It analyzes the features for stability and discrim-
inability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3 A 3DOF motorized stage and a checkerboard pattern for automatically gath-
ering data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.4 Example rendering a blue mug from novel views using only one image in the
database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.5 Examples CAD models (first row), the training images (second row), and the
extracted depth maps (third row) of several objects. . . . . . . . . . . . . . . 182
7.6 Orientations from image features are projected onto the object surface and
are clustered. By re-projecting the raw directions on the surface plane and
converting to angles (right graph), we can compute all identified angles (in
this case 0 ◦ and 180 ◦ ) and the standard deviation of the particular mean
direction ±15.5 ◦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.7 Shows the stability computation for a hole detector and the stability detected
locations (red). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
xvi LIST OF FIGURES
7.8 Stability computation for a line detector (upper right is marginalized density
for surface). The lower images show the lines from the training images that
are stably detected (red) and those that are pruned (blue). . . . . . . . . . . 184
7.9 Shows the statistics of a hole detector on industrial metallic parts. Each part
was trained with 300 images around the sphere. The right side shows the final
filtered features, which are used for extracting geometric words. . . . . . . . 185
7.10 Shows the effect of poorly labeled data sets on the raw features (blue) and
the final filtered features (red). The filtered clusters for the correctly labeled
set become much more clear. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.11 The histogram and labels of a feature’s descriptor distribution on the object
(in this case, it is the size of the holes). . . . . . . . . . . . . . . . . . . . . . 187
7.12 The major shape clusters extracted. The clusters on the left (simpler shapes)
are more frequent than the clusters on the right (more complex shapes). . . . 187
7.13 PostgreSQL database that stores the relationships and dependencies of all
offline processes used for pose extraction and object analysis. . . . . . . . . . 188
7.14 An example of the induced poses for a corner detector. The corner confidence
ψ(f ) (top) is used to prune out observations with low confidence. The final
induced poses T (f ) for a corner detector are stored as a set (bottom). . . . . 189
7.15 Several induced poses for a particular curve (red). . . . . . . . . . . . . . . . 190
7.16 Mean images of the induced poses for an edge detector of a paper cup (in
simulation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.17 The individual detector scores for good and bad poses for the object in Figure
7.3. Using Adaboost, the misclassified good poses and misclassified bad poses
are marked on the data points. . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.18 For an object that has four holes, there are many valid poses of the object in
the real image that can hit all four holes. In fact, using hole features is not
enough to extract a confident pose, 1D curve features are absolutely necessary. 194
7.19 The search process first matches image features to a database, then randomly
selects a set of features and tests for consistency by intersecting their induced
pose sets. Each pose candidate is validated based on positively supporting
and negatively supporting features lying inside the projected object region. . 195
7.20 Several examples of chosen image features (blue) and their pose hypotheses
(in white). The bottom row shows the positively (green) and negatively (red)
supporting features of the pose hypothesis (blue). . . . . . . . . . . . . . . . 196
7.21 Results of pose extraction using induced pose sets. . . . . . . . . . . . . . . . 198
7.22 Ground-truth accuracy results gathered for each DOF of the pose. . . . . . . 199
LIST OF FIGURES xvii
7.23 Some geometric words cluster very densely around specific points on the object
surface. These stable points can be used for alignment. . . . . . . . . . . . . 200
8.1 The construction process can be treated as an offline compiler that converts
the specifications into run-time programs while optimizing the speed of its
execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
A.1 The OpenRAVE Architecture is composed of four major layers and is designed
to be used in four different ways. . . . . . . . . . . . . . . . . . . . . . . . . 211
A.2 Distance queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
A.3 Physics simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
A.4 Several simulated sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
A.5 Simple functions a viewer should support. . . . . . . . . . . . . . . . . . . . 225
xviii LIST OF FIGURES
List of Tables
3.1 Statistics for the scenes tested showing average planning times (in seconds)
and size of the grasp sets used. . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Increase in Feasibility Space when using relaxed planning compared to fixed-
grasp planning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Average planning time in seconds for each scene for the scenes in Figure 3.24. 72
4.1 Average time for calling the ikfast solver with the success rate of returning
correct joint values for transformation queries. . . . . . . . . . . . . . . . . . 97
4.2 Analytic Inverse Kinematics Parameters . . . . . . . . . . . . . . . . . . . . 97
4.3 Force-closure Grasping Parameters . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 Kinematic Reachability Parameters . . . . . . . . . . . . . . . . . . . . . . . 111
4.5 Inverse Reachability Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.6 Grasp Reachability Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.7 Statistics and final distance metric weights for the Barrett WAM joints. . . . 126
4.8 Statistics and final distance metric weights for the Barrett WAM joints. . . . 126
4.9 Convex Decomposition Parameters . . . . . . . . . . . . . . . . . . . . . . . 127
4.10 Detectability Extents Parameters . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1 Average processing times for the first visibility stage for thousands of simula-
tion trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.2 Average planning times (with success rates) of the visual feedback stage. . . 149
6.1 Initial intrinsic calibration statics showing the mean and standard deviation
of calibration results using 5 images. Note how the standard deviation is much
smaller when the principal is fixed to the center of the image. . . . . . . . . 165
xix
xx LIST OF TABLES
6.2 Reprojection and 3D errors associated with calibrating the extrinsic parame-
ters of the gripper camera. Both intrinsic error and reprojection errors are in
pixel distances, with reprojection error using only one pose for the checker-
board by computing it from the robot’s position. . . . . . . . . . . . . . . . . 169
Acknowledgments
First of all, I would like to thank my two advisors Takeo Kanade and James Kuffner. Thank
you both for the amount of patience with my work and the time you have put into explaining
the intracacies of research presentation and the power of language. Kanade-sensei, it has
been a great honor working with you for four years. I have learned more about doing solid
research and its meaning from you than anyone else. You have taught me skills that few
graduate students ever get a chance to experience during their time of study. James, from
the beginning of starting my graduate studies, I have felt a deep connection between us.
Both of us have come from a very mathematics- and graphics-heavy background, and we
eventually decided that our skills are better spent on solving serious robotics problems. Our
conversations about planning, AI, and software architectures have always been exhilarating.
Both of you have been very intrumental in shaping my thoughts about the robotics field and
the attitude I should approach all challenges with.
I would like to thank the rest of my committee members Kei Okada and Paul Rybski for
giving invaluable feedback throughout the thesis writing. It has been an amazing learning
experience and your comments and criticisms really helped clear up the content and final
thesis message.
Throughout my graduate studies, I had the honor of working with some of the best re-
searchers in the robotics field on many robot platforms. I would like to thank Satoshi Kagami
and Koichi Nishiwaki from AIST Japan who supported the initial research on autonomous
manipulation on the HRP2 humanoid robot. I owe many thanks to our initial robotics team
at Intel Research Pittsburgh: Sidd Srinivasa, Dave Ferguson, Mike vande Weghe, and Dmitry
Berenson. The numerous projects on autonomous manipulation, the contant demos, and the
great conversations will be always remembered. I would like to thank everyone at Willow
Garage for working hard to have the PR2 arms ready for testing the manipulation planners.
Special thanks to Morgan Quigley, Brian Gerkey, Ken Conoly, and Josh Faust, and the other
ROS (Robot Operating System) developers for making an awesome distributed robotics plat-
form. I would like to thank Ryohei Ueda, Yohei Kakiuchi, Manabu Saito, and Kei Okada
for the tremendous help with HRP2 humanoid experiments at University of Tokyo. I didn’t
xxi
think it was possible to stay and sleep in a lab without going home for a week straight until
I worked with you guys. I would like to specially thank Furukawa Makoto coming from the
industry. All the countless conversations we had on automating factory processes and the
current state of the art really helped put this thesis into perspective. You’ve been a great
friend and teacher, and it was always fun to bounce new ideas off of you. Finally, I would
like to thank my lab mates Jun-sik Kim and Jeronimo Rodrigues for the great work we have
done together.
I would like to thank my office mate Tomoaki Mashimo. It has been really fun discussing
with you the state of the robotics, how it will impact society. You have taught me a lot
about Japan and were always there when I had something on my mind to say. I also thank
Aki Oyama for the great conversations we’ve had about robotics venture companies and how
we can meet the demands of today’s consumers. Your passion for starting communities that
achieve great things as a whole is contagious, keep it up.
Finally, I would like to thank the entire OpenRAVE community, which rapidly grown
over the past years. All your great feedback has made OpenRAVE a very powerful tool for
planning research. Thank you for all the patience at times of major changes, it was all for
the best. Your help has let OpenRAVE become a success story for open-sourced research.
To my parents, who showed how to make dreams come true
Chapter 1
Today robots can be programmed to perform specific, repetitive tasks both in the industrial
setting and at the home. The continuous demand of new products and new services in our
society implies that there will be no shortage of new tasks that robots can perform without
human intervention. However, the current set of programming tools put a limit on the rate
at which new programs can be developed to meet the changing task specifications. For most
companies today, it takes on average six months to a year to develop a new assembly line for
mass producing one product. Considering the current trends of automated production and
services using robots, the amount of human resources available will continue to limit growth
until we can find a process that can automatically construct all the necessary programs for
completing a designated task. Because of increasing lifestyle demands in the 21st century,
a more efficient automation technology has to step in to compensate for the world’s limited
resources and fill the growing labor gap [Ausubel (2009)]
Today more than 1,000,000 industrial robots are operational worldwide, but very few,
if any, of them actually employ complex planning systems and geometrically reason about
their environments. The problem is that working with robots in the real world is plagued
with uncertainties like positioning error, which prevents an autonomous robot to completely
trust the output of the automatically constructed programs. A robot’s behavior needs to
be confirmed by a person before running on the real assembly line, and this requires a lot
of engineering manpower and detailed adjustments of the programs. To be able to trust a
completely automated behavior, the errors due to robot manufacturing, sensors acquisition,
and simulation modeling need to be quantified by calibrating the robot, a crucial process
for robot automation. Calibration corrects the internal models of the robot sensors and its
geometry such that the robot can begin to compute an accurate measure of its surroundings
at all times.
1
A system that can automatically generate programs should handle common industrial
environments, reason about the kinematic and sensing capabilities of the robot, learn to
recognize the target objects, and minimize uncertainty due to calibration, perception, and
execution errors by using sensing and planning in a feedback loop. As part of the manipula-
tion process, we dedicate a significant part of this thesis to explicitly consider the information
camera sensors provide when planning robot movement.
Possibly the weakest link in the entire chain of components is the manipulation system
and how it interacts with the perception system to complete its tasks. Manipulation requires
the highest precision to localize and detect the object and is the source of most of the failures
in the final system. Even for the simple bartender system, the interplay between the per-
ception and planning systems has a very big impact on overall task performance. Therefore,
the automated construction process has to explicitly consider the quality of information each
sensor is capable of producing in order to make the best decisions about robot placement. For
more complex manipulation systems involving mobile manipulators, multiple arms, onboard
camera sensors, and a plethora of target object types, it becomes an even bigger challenge
to analyze all the pieces to achieve working reliable robot behavior.
We target the manipulation framework for industry professionals. Although professionals
thoroughly understand their working environment and constraints, they are not experts in
motion planning, geometric analysis, and the intricacies of computational theory. Therefore
it is most advantageous to give them an input space that represents the minimal domain
knowledge necessary to define the task. This domain knowledge is separated from the rest of
the information like algorithmic parameters, which can be automatically generated by logical
reasoning and analysis. In the presented manipulation framework, professionals specify their
domain knowledge in the form of specifications that define the CAD models of the robot
and the task, semantic identification of the robot parts and environment locations, task
constraints, and training data. Using the specifications as inputs, the framework should use
its tool-set of generic algorithms to construct the knowledge-bases and execution programs
so that the robot can automatically calibrate itself, perceive it surroundings, plan in the
environment to pick-and-place objects, and reliably execute the plans for the task.
combine the algorithms into a final working system. The areas we specifically focus on
are geometric-based analyses of the robot and the task, whose outcome leads to a more
automated construction process. With our framework, it is possible to construct programs
requiring minimal explicit user input, which we quantitatively define as the robot and task
specification. The framework has the ability to adapt to new tasks and robot kinematics
so even people with minimal planning and vision knowledge can quickly get a robot to
automatically complete a basic manipulation task.
The goal is a minimal set of algorithms such that the system components mentioned so far
can come together into a coherent system that achieves the domain of tasks we define. Figure
1.3 shows a breakdown of the construction process and the components that we play a key
role in the culmination of the framework. The task and robot specifications are the domain
knowledge input into the system, they provide the CAD models and several basic semantic
definitions like where the robot arms are; their complexity has been kept to a minimum
so that non-experts can easily define their own configurations. We define a manipulation
planning knowledge-base that organizes all databases, models, and heuristics necessary for
manipulation tasks. The knowledge-base is generated from the task and robot specifications,
Figure 1.3: Given a set of robot and task specifications, we construct the databases necessary
for the robot to robustly execute the task. The specifications feed into the planning and vision
knowledge-bases. Each knowledge-base analyzes the specifications and constructors models to help
drive the manipulation planning, sensor visibility analysis, and pose recognition algorithms. These
basic set of algorithms are used in the run-time phase to calibrate the robot sensors and execute
the task.
and represents frequently used information that can be computed offline and is dependent on
robot and task structure. The object recognition programs are trained at this point using the
target object CAD model and a set of training images showing the appearance of the object.
Similar to the planning knowledge-base, a vision-based database can help in organizing
the object-specific information for pose extraction programs. Using the knowledge-bases,
the generic planning and sensor visibility algorithms responsible for robot movement are
selected depending on the task and instantiated with the specific knowledge-base models. We
motivate the need for a minimal set of motion planners and goal configuration samplers that
use this knowledge-base to complete all required manipulation tasks. After all databases and
planners are trained and before the robot can execute its plans, the system has to calibrate
the robot’s sensors so its internal representation matches the real world. This calibration
method should be completely automated when considering the manipulation system as a
whole so that it can be performed at any time. Once the calibration phase is done, it the
task can be executed using a simple execution pipeline.
Although we pose no time constraints on the planner and sensing algorithm computation,
we loosely aim for the entire robot system to complete its target tasks on the order of minutes.
We stress here that the planning algorithms and metrics we analyze are meant to prioritize
reliable execution of the task rather than generating smooth, time-optimal, energy-optimal,
safety-optimal, or natural-looking robot motions. Most of these heuristics would require a
different set of algorithms that also incorporate velocities and time into the planning process,
which are not covered in this thesis. However, given the same inputs and outputs, replacing
algorithms with different priorities is simple. In the context of industrial robots, execution
failures come at the highest cost where an assembly line has to be completely stopped in
order to recover from the error. Given that most problems plaguing today’s autonomous
robots are failures due to perception and calibration errors, we dedicate all goals of the
thesis to achieving reliability.
1.4 OpenRAVE
For facilitating the development of this framework, we developed a planning environment
titled OpenRAVE, the Open Robotics Automation Virtual Environment. OpenRAVE forms
the foundation of all the results presented in this thesis, and its open-source nature has
allowed over a hundred researchers across the world to quickly run manipulation programs.
The OpenRAVE architecture shown in Figure 1.4 simplifies the implementation and testing
of planning algorithms and allows new modules to easily inter-operate with other modules
through a well-defined Application Programming Interface (API). The structure of operating
systems, computer languages, and debugging tools has motivated the division of OpenRAVE
into four major layers: a core layer for the API, a plugins layer for extending functionality,
a scripting layer for run-time analysis, and a database layer for generation and quick
retrieval of the knowledge-bases. These layers combine into a set of tools non-experts can
use in analyzing their problem without requiring an in-depth knowledge of software systems,
manipulation planning, and physics.
The OpenRAVE architecture began as a result of this thesis work and has evolved through
the testing of many robot platforms and the feedback of a rapidly growing OpenRAVE
community. Having a large user base composed of planning researchers and non-exerts like
mechanical engineers has provided invaluable feedback into how a manipulation framework
Figure 1.4: The OpenRAVE Architecture is composed of four major layers and is designed to be
used in four different ways.
should be designed and how the user should interact with it. As a consequence, OpenRAVE
has gone through a re-design phase two times, each new design has automated more basic
manipulation functions while removing functions that spread the environment too thin and
don’t provide much value. Such a process has allowed OpenRAVE to concentrate only
in manipulation and geometric analyses, which has allowed new innovations beyond any
capabilities of existing planning environments.
• An explicit construction process that takes domain knowledge into the form of specifi-
cations, and generates a knowledge-base that allows quick runtime execution of plans.
• Proposed several new manipulation planning algorithms that allow efficient planning
with goal spaces, mobile manipulators, and sensor visibility. We present a generalized
goal configuration sampler that encodes everything about the robot and task when
prioritizing goals.
• A planning knowledge-base that automated analyses for object grasping, inverse kine-
matics generation, kinematics reachability, and object detectability. The algorithms
are described in the context of previous work and their contributions lie in their non-
parametric formulations and flexibility in handling different tasks and objects.
• Methods that efficiently plan with the camera sensor to provide reliable execution.
We present a two-stage approach to manipulation planning and prove its necessity for
easily working with sensors without tight feedback loops.
• An automated camera calibration method that can compute the camera model and
camera placement on the robot surface with a single button click. The method using
motion planning to handle environment collisions and guarantee visibility with the
calibration pattern. Furthermore, we contributed a way to measure to quality of the
calibration data.
• A method to analyze the a feature detector’s distribution on the object surface to
compute stable and discriminable features, which allows for extraction of geometric
and visual words for the specific object.
• The induced-set pose extraction algorithm that can accurately extract pose under very
cluttered scenes and works under no assumptions on the object surface and the feature
detectors. The algorithm is voting-based and uses a novel learning-based process for
evaluating pose hypotheses in the image.
Manipulation System
The system architecture dictates how all the individual components in a system interact
with each other to form a collective system that can perform its target tasks. The most
basic systems center around a Sense→Plan→Execute loop where the information flow is
strictly defined between the three components; the formulation allows a divide-and-conquer
approach to solving robotics problems [Russell and Norvig (1995)]. Although several complex
and capable systems have been designed with this structure [Srinivasa et al (2009)], the robot
execution needs to consider feedback processes in several hierarchy levels in order to recover
from errors and changes in the environment. Each hierarchy level can represent sub-loops
of Sense→Plan→Execute depending on the time-constraints of the solution and at what
semantic level the planning is being done at. Recent research in autonomous vehicles [Baker
et al (2008)] has shown that a critical component to robust execution is an execution monitor
that continually validates the current path the robot follows and takes a corresponding action
when an error is detected. The ALIS architecture [Goerick et al (2007)] further generalizes
the concept of hierarchies and monitoring across the system by mathematically modeling the
relationships between information flow. A common observable pattern is that all successful
robotics systems have many sensor feedback loops beyond low-level controllers that each
cooperate to move the robot.
The goal of this chapter is to setup the flow of the execution system and dictate the
structure of the rest of the thesis in explaining the manipulation processes. We begin by
defining the manipulation problem to be solved and covering all the system elements in-
volved in producing the final robot behavior and computational infrastructure. We cover
robot and task specifications that clearly state what domain knowledge is required in the
problem, which helps define the extensibility of the system. In the simplest form, the robot
specification defines the CAD model and kinematics of the robot along with several semantic
13
labels for what geometry corresponds to attached sensors and arms. We then provide the
system map of all the functional modules and how they pass information across each other.
Compared to other systems, there are two new modules in our system which play a key role:
the goal configuration generators and the knowledge-base of precomputed, planning-specific
information. Furthermore, we analyze the manipulation system from the angle of individual
components that play key roles in the planning algorithms and help us formulate and design
the planning knowledge-base. For example, the arm component and gripper components
are relate to each other through the kinematics. Using the basic module and component
definitions, we formulate an execution process that defines the general flow in which the
robot will move in order to complete its task. Towards the end, we briefly introduce set
of autonomy guidelines to formally define the requirements of each of the modules in our
execution system.
• Place the object in designated locations with respect to another object or the map of
the environment.
• Act on the object by moving it a relative distance from its initial configuration, placing
it a designated location, or pushing it like a button.
Figure 2.1 shows the required information the user should specify to the system. We
require the CAD model of every manipulable object be specified. We are interested in
manipulation specific objects that have a fixed geometric and material properties, and are
not interested in generalizing manipulation algorithms semantic classes of objects. Because
there is a lot of research on how to automatically acquire the geometry of an object using
LIDAR and vision sensors [Curless et al (1995); Zhang et al (2002, 2003)], we can safely
Figure 2.1: A basic task is composed of a set of manipulable objects, their goal criteria, and the
real-world training data used to measure sensor noise models and recognition programs.
assume the geometry is known. Furthermore, the task specification should provide real world
images or other sensor readings in order to learn how the object’s appearance is associated
with its pose, which is used to build more robust object-detection programs. Because the
training data comes from sensors, it is possible to measure the uncertainty introduced by
the sensor acquisition process. Having fixed geometry and appearance greatly simplifies
the object pose extraction and grasping programs because the framework does not have to
generalize or learn this information. This allows the generated programs to be more accurate
and less prone to modeling error.
Other information in the task specification deals with defining the goal of the task and
any constraints on the target object or robot motion. For example, one of the simplest task
constraints would be to carry a bucket full of water while maintaining its orientation with
respect to the ground so as to avoid spilling the water. Finally, an optional specification is
a geometric map of the environment the robot moves in, which can be used for navigating
and collision avoidance. Although there exists many algorithms to acquire environment
semantic information during run-time [Rusu et al (2008)], knowing the map can avoid the
extra information-gathering step and make execution faster. Table 2.1 lists the parameters
and their descriptions.
• The kinematic and geometric models of how the actuators affect the physical robot.
Figure 2.2: A robot specification should provide a CAD model with annotations of which parts
of the geometry serve what purpose.
• The current position and velocity of the robot’s joints at high communications rates.
• A service to control velocity at high communication rates.
• A service to follow a time-stamped trajectory of joint positions and velocities.
The rate at which communication occurs between the robot hardware greatly impacts the
robot’s responsiveness to the environment and should be greater than the fastest feedback
loops in the execution architecture. In our case, the controller loops should be faster than
the visual servoing, environment sensing, and execution monitoring loops.
The only perception sensor we examine deeply in this thesis is the camera sensor. Dense
information can be extracted from camera sensors, which allows more robust object detection
and object pose computation as compared to other sensors. Once a user specifies what robot
links the cameras are attached to, our framework can automatically calibrate the sensor
using the method in Chapter 6. We can also start computing the sensor visibility models of
objects (Section 4.7) necessary for reliable planning and execution.
Robot Kinematics
The robot kinematics should be divided into:
• a mobile base that controls the in-plane 2D translation and 1D orientation of the robot,
• a robot arm that controls the 3D translation and 3D orientation of the gripper,
• and a gripper that is used to make contact with the environment and manipulate
objects.
The separation of the mobile base is because navigation planners commonly treat the
robot directly in the 2D workspace. Furthermore, the mobile base is responsible for moving
across the environment where there are uncertainties in the composition and shape of the
terrain. Such uncertainties mean that the mobile base can be very inaccurate in moving
to its desired goal position, therefore it is treated as a separate component that receives
goal commands and roughly moves the robot to the destination. Depending on the mobility
method and target environments, the control inputs can become arbitrary complex, so we
refer to the vast controls research literature on providing simple interfaces to complicated
humanoid and nonholonomic robots [Chestnutt et al (2006); Chestnutt (2007); Li and Canny
(1993); An et al (1988)].
For the robot arm, we plan directly in the joint configuration space and build the neces-
sary heuristics to speed up the planning process. Although not a requirement, the execution
performance will greatly increase for robot arms that lend themselves to easy computation of
the analytical inverse kinematics equations1 . Most common robot arms today [Wyrobek et al
(2008); Kaneko et al (2004); Albu-Schaffer et al (2007); Barrett-Technologies (1990-present);
Exact-Dynamics-BV (1991-present)] do satisfy these properties, so the thesis concentrates
on making planning easier and faster for such kinematic structures. However, inverse kine-
matics solvers only speed up the planning process, and are not required for robots with low
number of joints such as the Katana arm [Neuronics (2001-present)].
For the gripper, the algorithms we cover are capable of handling most kinematic struc-
tures. Because, we can simulate all possible gripper preshapes and directions of approach,
there are no particular restrictions on geometry or degrees of freedom. However, the entire
process will greatly speed up if the following information is specified for each gripper:
• A preferable set of approach directions for the gripper; typically close to the normal
vector of the gripper palm.
• The direction that joints should move when applying force on a grasped object. Like
the human hand, gripper mechanisms are optimized for applying forces in designated
directions and withstanding contact forces at them.
• Calibrating the robot sensors to predict the robot state in the real world.
• Finding the target objects and sensing the environment obstacles.
• Breaking down the task into robot goal configurations.
• Moving the robot throughout the environment while avoiding obstacles.
• Considering sensor visibility for better information extraction.
1
This requires having six or more degrees of freedom
• Recovering from errors due to perception and execution.
An execution monitor that constantly monitors the current robot state for possible ob-
stacle collisions or unexpected sensor readings is a necessary module for robust execution
[Okada et al (2006); Baker et al (2008)]. Each phase of the planning outputs a path of
robot positions that are then be executed by the controller. The controller should have the
Figure 2.3: The modules forming a manipulation system that this thesis concentrates on. The
knowledge-bases and goal configuration generators are automatically generated from the task and
robot specifications.
capability to stop a trajectory at any point in time and restart a new trajectory avoiding
the possible error condition. This ability allows us to implement the entire spectrum of error
recovery conditions.
A fundamental module is the ability to safety move the robot around the environment
without hitting any obstacles. This requires the robot maintain up-to-date knowledge of the
obstacles in the environment. Besides obstacles, the system also needs a module to find the
locations of all the target objects that must be manipulated by the robot. Combining this
information with sensor noise models, we can construct an accurate environment represen-
tation in simulation where the robot can geometrically start reasoning about how to avoid
obstacles and manipulate the target objects. Once a robot has found the target object, using
knowledge of physics and the object geometry, we can automatically build models of how
to stably grasp the object and manipulate it. Combining the grasping models and robot
kinematics with the task constraints defined by the problem specification, we can compute
the space of all possible robot goals and movements that will satisfy the task. Goal config-
uration selection should also consider sensor visibility so that the robot can guarantee the
quality of the target object information extracted from the sensor data.
Robot movement is divided into planning for the base of the robot and planning for the
manipulators responsible for achieving contact with the target object. Base movement, also
called navigation, deals with moving the robot on a 2D map such that dynamics constraints,
real-time constraints, and collision-avoidance constraints are met [Ferguson (2006); Kuffner
et al (2003)]. Manipulator movement is slightly more complex because of its high number
of degrees of freedom, so randomized algorithms are introduced that trade-off optimality for
algorithm tractability. Once a global robot trajectory is computed from the planners, it is
sent to the low level robot controller that filters the movements through the robot dynamics
and commands the actuators. During execution of the trajectories, an execution monitor
must continually validate the robot target path with the new environment information to
guarantee the robot is still collision-free, the target object is still in the expected position,
and the goal requirements can still be satisfied. Some task motions require very careful
positioning of the robot gripper with respect to the target object. In this case, a different
class of planning algorithms is used that form a tight feedback loop with the vision sensor
measuring the object location.
• A pose recognition program taking camera images and producing a list of object loca-
tions within it, further discussed in Chapter 7.
• A map showing how the object detection relates with camera direction and distance,
discussed in Section 4.7.
The main purpose of the models we identify and build for the manipulation framework
is to make the planning and vision problems computationally tractable during runtime.
Furthermore, we show how gathering statistics about these relations can allow us to prioritize
the search spaces in the planners, and can lend themselves to optimizing the design and
placement of a workplace for industrial settings.
Figure 2.5: The levels of an execution architecture and a description of their inputs and outputs.
It is possible to view the execution process from a task-level perspective. Figure 2.6
shows a flow chart for completing a manipulation problem by defining four types of planning
situations that control the execution flow:
1. Searching for the object. In the beginning, the robot first searches for the object
and once found, plans for a closer robot configuration where the object pose can be
better estimated. Methods can involve moving the robot randomly until an object is
found or moving to maximize information [Hollinger et al (2009); Saidi et al (2007)].
Regardless of the method, the output of the module should be a rough location of the
object.
2. Approaching the object. Once a rough location is found, the robot should plan
to move toward the object so that its sensors can observe the object unobstructed
(discussed in Chapter 5). As the robot gets closer to its target, the new environment
information could change the visibility of the target, so the planner should be called
again. This step requires knowing the properties of the object detector, the (inverse)
kinematics of the robot arm, and the location of the visual feedback sensor.
3. Grasping the object. Once a better estimate of the target object pose is achieved,
the system should start considering potential collision-free ways to grasp the object
(Section 3.3). Since the robot is already close to the object and the arm is directly
looking at it. As an alternative to open-loop grasping, we present a stochastic gradient-
descent algorithm to approach the object while considering changing grasps (Section
5.3.1). The planner should simultaneously consider the possible grasps and maintain
visibility constraints necessary for continuing to track the object. If the goal position
is known at this time, each grasp can be also validated by checking for the existence
of a feasible goal robot configuration.
4. Moving the object. Once the object is successfully grasped, the planning framework
switches to maintaining any specific task constraints and move the object to its goal
region. Task constraints can include not tipping the object [Stilman (2007)] or moving
the object along a hinge [Diankov et al (2008b)]. During robot movement, an execution
monitor validates the robot configuration with the current environment estimate.
Each of these steps requires the perception modules to constantly run and generate a map
of the environment and target objects. In order to take a snapshot of the environment state
at a particular instance in time, each sensor needs to timestamp its data. All perception
algorithms maintain this timestamp when publishing their data out to other modules. For
every time instance, all sensor data are combined with the robot position at that time and
transformed into the world coordinate system. This allows sensors to be attached anywhere
on the robot without any coordinate system confusion. The should be constantly publishing
their data regardless of how it is being processed, this independence on the system state
simplifies many start and stop events and making it much easier for users and robots to work
with. Most sensor algorithms publish at 10Hz or lower, therefore most of their computation
can be offloaded onto a server of computers, which greatly reduces the requirements of load
balancing.
We rely on range data from stereo or laser range finders for sensing the environment
obstacles. The laser range data is used to accumulate a run-time obstacle map. This data
is then processed in the world coordinate system to fill holes and create collision obstacles
like boxes that are easier to use than point clouds [Rusu et al (2008)]. Processing the data
in the context of manipulation is discussed in Section 4.6.2.
Figure 2.7: Example of localization in a 3D map using a camera.
Very frequently in robotics, there is a map of the environment that the robot can use
to navigate from place to place. Sometimes generating the 3D geometry from raw sensor
data is too time consuming, so a 3D map is present that the robot can localize to. For
example, Figure 2.7 shows how a robot localizes with respect to the hand-crafted 3D kitchen
model using patterns randomly placed inside the real kitchen environment. The localization
module should publish the best expected robot position and its mean error.
For each different object type specified in the task, a pose recognition program will be
generated and will be continually running during the execution phase. Using this program,
we can gather in an offline phase all the directions from which the pose can be detected by
the camera (Section 4.7). This allows the planning process to prioritize robot configurations
that generate preferable views for the camera. In Chapter 7, we analyze object-specific
recognition programs and present several analyzes for automatic detection of good features
along with a new pose extraction algorithm. During runtime, each pose recognition program
will continuously publish the detected object types, poses, and the expected error on the
extracted pose. Because there can be a lot of overlap in the low-level feature sets used for the
object detectors, we can separately run programs for each feature detector for each camera.
Navigation planning for indoor environments consists of creating an offline map with
SLAM and planning on the 2D plane considering velocities and the non-holonomic nature of
the robot base. Navigation and robot localization have been deeply studied in many other
works [Likhachev and Ferguson (2009); Ferguson (2006); Urmson et al (2008); Rybski et al
(2008)], therefore for this thesis we use already available navigation planning libraries from
Player/Stage [Gerkey et al (2001)] and ROS [Quigley et al (2009)]. On a segway mobile base
using a standard 2D laser range finder, they can usually move the base within 5cm of the
intended goal. The robot localization is within 2cm when using range data and 0.5cm when
using vision-based localization (Figure 2.7). Even when treating navigation as a black box,
finding the correct goal to give the navigation planner so that it can handle errors of 5cm
is vital to robust mobile manipulation. Nevertheless, the target objects always have to be
re-detected after the base stops. In Chapter 5, we motivate the use of cameras attached to
the gripper for the final reconfirmation phase. Gripper cameras allow very accurate results
of the entire process.
There are two main ways to achieve robust grasping, and we choose a combination of
the best of both worlds in the manipulation framework. The first way is to make the
grasps themselves as robust as possible to errors in perception. In Section 4.2 we discuss a
repeatability metric that allows us to measure if a grasp can slip or not based on simulated
noise. The second way requires careful positioning of the gripper with respect to the object
using visual servoing techniques, or otherwise the gripper might push the object out of
the way and fail the grasp [Prats et al (2008a); Jain and Kemp (2008)]. However, visual
servoing first requires the camera to view the object. In Chapter 5 motivate the need for
sampling robot configurations that can achieve object visibility in the camera field of view
before we begin to grasp it. Our technique is just as effective as visual servoing, except
it does not rely on fixed grasps and to maintain visibility constraints as the robot moves
throughout the scene. In most cases the object is static in the environment, and so viewing
the object once as close is possible is more than enough to get a good accurate statement.
By combining the repeatability metrics in grasping and the visibility sampling, we can assure
very reliable grasping without needing to analyze perfect sensor-less grasping algorithms or
always maintain visibility.
Environment Complexity
We define the maximal complexity on the environment geometry and its changes.
1. The system should be able to handle most living and factory environments as long as
the obstacle regions can be confidently extracted and are well defined. Environments
where it is not possible for the robot to reach its target objects without having to
move or touching anything should not be considered. Interacting in such environments
requires the robot to have a higher level reasoning module, which is not part of the
task specification (Section 2.1.1).
2. The environment lighting should be similar to the training data given by the task
specification. The system should not have to adjust to lighting changes or unexpected
sensor data from difficult environments.
3. The robot should be able to compensate for changes in the environment within several
seconds of the change.
Sensor Uncertainty
We define the extent to which the robot should monitor and recover from execution and
calibration errors.
1. Robot should reason about target object observability and guarantee extents on pose
uncertainty.
2. The noise for sensors measuring distances to obstacles should be modeled. A robot
should treat any unexplored region as an obstacle and should give obstacles safe bound-
aries.
4. Robot should continually validate its current position with the expected goal position.
If goal position is defined with respect to an object’s coordinate system, that object
has to be continually detected.
Error Recovery
We define the extent in which a robot should be able to recover from errors. In all cases, the
robot hardware and computational framework should be fully functional at all times; we do
not expect the robot to perform introspection [Morris (2007)] and detect and recover from
internal problems.
1. If an obstacle or other object blocks another target object, wait for the target object
to become available or choose another target.
2. If an obstacle or other object blocks the robot path, restart the planning process,
locally modify the trajectory, or wait.
3. If the robot becomes stuck during a trajectory because of an obstacle collision, restart
the planning process.
4. If a target object moves before grasped, the robot should compensate for the change
or restart the planning process.
5. While the object is grasped, continually monitor the grasp and restart the planning
process if object is lost.
2.6 Discussion
In this chapter we covered the outline of the manipulation architecture and discussed how
all elements relate to each other and where our contributions lie within the framework. We
defined a set of robot and task specifications that represent all domain knowledge being
injected into the system. Then we presented a set of functional modules of a system that
can achieve all requirements. Unlike other systems, we specifically make the goal generation
process a module. It analyzes the scene and determines where a robot should go in order
to achieve its higher level goals. In previous systems, most of these goals came from a
person manually specifying handles or other workspace parametriztions. We also presented
a robot component-based view of the manipulation system. Each component like the robot
arm serves a particular purpose in the plan. Relations between these component allow us
to formalize what information a particular task relies on. This analysis helps us define the
planning knowledge-base in Chapter 4 that organizes and tracks all possible offline relations
between the robot components we defined.
We also discussed the rough execution outline and the steps the robot takes in order
to complete the manipulation plans. Each of these high level steps are decomposed into
a set of primitive planner calls whose theory is covered in Chapter 3. We discussed many
of the practical issues encountered when executing the manipulation pipeline. Through
this discussion, we revealed the requirements for each of the components and discussed the
trade-offs that have to made in order to get them to work.
We closed this section with several general guidelines of robot autonomy. For many
different research groups, the definition of an autonomous robot differs, which causes a lot of
confusion when comparing results between frameworks. The guidelines on autonomy defines
the basic assumptions of the scene and environment, all algorithm developed to support this
thesis are designed with them in mind. Eventually, we hope that such simple prerequisites
can serve as the basis for a formal definition of levels of autonomy. Such a taxonomy will
greatly help in comparing robotics architectures. However, it will only become possible once
the process of combining all components of a robot system becomes as simple as installing
an operating system and setting up a computer network on it. But even before such a time
will come, it is necessary to understand all the elements in the most basic and fundamental
level.
Chapter 3
Motion planning is responsible for finding a path in the robot configuration space between an
initial configuration to a goal-satisfying configuration while maintaining constraints on the
search space. The primary contributions of this chapter are to introduce a set of planning
algorithms specific to manipulation planning and define how a planning knowledge database
can be used within the framework of these planners. We divide a planner into two compo-
nents: a generator for feasible, goal-satisfying robot configurations, and a search algorithm
that explores the feasible space and composes the robot path. Even though the space of all
search algorithms is infinite, we only concentrate on the Rapidly Exploring Random Trees
[LaValle and Kuffner (2001); Kuffner et al (2003); Oriolo et al (2004)] and A* [Ferguson
and Stentz (2004, 2005); Diankov and Kuffner (2007)]. RRTs have been shown to explore
the feasible solution space quickly, so they are ideal for high dimensional search spaces like
manipulator arms. On the other hand, A* algorithms give more power to heuristics to guide
the search and are used for problems whose state space is much smaller. In order to spe-
cialize a planner to the task and robot specification, we present modifications to the goal
space definitions and state space samplers. The driving principle in the planning framework
is that giving a search algorithm informed configurations that get the robot closer to its
goal is a much more important capability than being able to efficiently search through many
uninformed configurations.
In this chapter we define the planning framework and specifically delve into the search
components of planning algorithms. The search components are the static code that stays the
same throughout different tasks and robots. We start with the overall algorithmic structure
and show the role of the robot knowledge database as discussed in Chapter 4. Furthermore,
we cover the general case of mobile manipulation when considering the complexities of grasp-
ing and visibility goals. A pattern that becomes quickly apparent in the planning algorithms
33
Figure 3.1: Shows the connection between the robot databases when used for manipulation and
grasp planning. The most involved and important process is sampling the goal configurations while
taking into account all constraints and task priorities.
we cover is that the way goals and heuristics are defined and used can have a bigger effect on
the planning process than search algorithms themselves. In fact, all task-specific information
is encoded in the databases, so it would be natural to expect databases to take a lead role
in influencing the search.
In order to make the most informed decisions, the following elements should be considered
in the manipulation planning process:
1. Reachability - Stores a density map measuring how close the gripper is to a position
that the robot arm can actually move to (Section 4.3).
2. Grasp Ranking - Ranks grasps depending on how probable they are to be reached by
the robot. The ranking involves evaluating the environment around the object (Figure
3.2) along with using the reachability density map.
3. Target at Goal Scene - If the task involves placing the object in a particular location,
then we must ensure that the gripper can actually fit in this location. By having the
a second scene with the predicted object goal, we can validate if a grasp is achievable.
4. Grasp Validation - Once a grasp is picked, it should be checked for collisions and
consistency of task constraints. Checking if any links on the gripper collide with the
environment is the quickest check. Next we perform the grasp strategy and check to
make sure that the gripper only collides with the target object. If the gripper collides
with anything else, we cannot guarantee the object is grasped as it was predicted, so
the grasp should be rejected.
6. Final Robot Configuration - Base placements are sampled according to their rank-
ing and we check for the existence of collision-free inverse kinematics solutions.
where the gripper is responsible for interacting with the object, the arm is responsible for
precisely moving the gripper to its desired orientation, and the base is responsible for moving
the arm and gripper around the environment (Figure 3.3). For robots moving on a flat floor,
qbase is 3 DOF and represents the translation and in-plane rotation. If the target is attached
to the environment and has a door, then qtarget is the angle of the door; otherwise, qtarget
represents the workspace of the target. qarm and qgripper are the joint values. Usually the
gripper is treated as in independent articulated body, whose transformation is the referred to
as the end-effector of the arm. Even before defining the goal of the plan, the decomposition
of the space immediately raises questions about:
For example, the forward kinematics relationship between the arm and gripper is clear,
but it is common for planners to use inverse relationships: given the gripper position, com-
pute the configuration qarm placing the gripper there.
If planning for mobile manipulation, then the question can turn out to involve all com-
ponents: given a target location qtarget , what are all possible configurations qbase such that
the target is graspable.
Figure 3.4: Grasp planning involving just the arm and gripper can sometimes require both the
grasp and release goals to be considered before a grasp is chosen..
Because the search space is very big, it becomes very difficult to guarantee an optimal
solution; therefore samplers, heuristics, and hierarchical subdivision of the problem have
great influence over the quality and computability of a solution. For example when finding
a feasible path to a graspable location, should the robot base be sampled first, which affects
the grasps that can be achieved, or should the grasp be sampled first, which affects the
base placements? If the goal is also to place the grabbed target to a specified location
as shown in Figure 3.4, then it becomes necessary to consider the feasibility of the path
and goal configuration of the release process so that a robot doesn’t choose a bad grasp
to begin with only to discover that it cannot achieve its final goal. Instead of delving into
the search complexities of planning horizons as in [Stilman et al (2007b)], we perform most
of the computation inside the goal-configuration samplers, which use the inter-relations of
the configuration space to make the most informed decisions. The theory and generation of
these relations are covered in the robot knowledge database.
It is also possible to fit dynamics and time-sensitive information into this definition [Harada
et al (2008); Kuffner et al (2003); Hauser et al (2008); Stilman et al (2007a); Berenson et al
(2009a)].
Although, several works have shown the possibility of planning without sampling explicit
goal configurations [Bertram et al (2006); Diankov and Kuffner (2007)] by using gradients
and cost functions, the planning process becomes much slower thus making it a worthwhile
effort to develop explicit goal configuration generators. Traditionally, planning algorithms
have assumed Cgoal is either one specific goal configuration, or a set of configurations with a
distance threshold defining how close to the goal a robot should get in order to terminate
with success. However, many researchers cite the possibility of Cgoal being so complex that
the only way to accurate use it in a planner is by random sampling. For example, the goal
of manipulation planning is to have the gripper interact with the target object, meaning the
goals are specified in the workspace of the gripper and not directly in the configuration space
being searched for. It is this type of discrepancy that motivates a periodic random sampling
of goals during the search process.
RRTConnect* (Algorithm 3.1) shows the modifications to the RRT-Connect algorithm
[Kuffner and LaValle (2000)] that samples goals. The algorithm maintains a tree of configu-
rations, with edges representing the ability for the robot to move from one configuration to
the other. In every step the RRT randomly grows trees simultaneously from both the initial
and goal configurations by first sampling a random configuration and then extending both
trees to it using a step size . After every extension operation, the planner checks if the two
Algorithm 3.2: SampleWithConstraints(C)
1 while true do
2 q ← ProjectConstraints(fC ,Sample(C))
3 if q ∈ C then
4 return q
5 end
• A sampler on the feasible configuration space Sample(Cfree ). The free space describes
where the system can safely go, but it doesn’t provide any information about its neigh-
boring locations. By defining a state space cost function, we can use it to prioritize
sampling of regions that will most likely contribute in finding the path.
• A distance metric δ(q0 , q1 ) for C used to methodically discretize and step through the
configuration space. Distance metrics are commonly used for two purposes: finding
nearest neighbors and finding the appropriate step resolution for RRT extensions and
line collision checking [Cameron (1985); Schwarzer et al (2003)]. We present an auto-
mated way of setting distance metrics in Section 4.6.3.
• A sampler on the goal space Sample(Cgoal ). Defining goals usually involves the
workspace and the reachability properties of the robot. We present a generalized
manipulation sampler in Section 3.1 involving both grasping locations and the robot’s
arm and base.
• A parameterized constraint function fC that allows the robot to continuously move
from one configuration to another. A ProjectConstraints(fC ,q) function modifies
the input configuration to the closest configuration that meets all the constraints.
Of the four, we primarily concentrate on goal samplers Sample(Cgoal ) for the target task
types described in Section 2.4.
We represent the space of all stable grasps as G, which includes the 6D pose of the
end-effector in SE(3), the grasp preshape, and the approach direction. From that space,
we extract a discretized ordered set Gstable that represents all stable grasps of the object.
Gstable is used for finding goal configurations of the end effector and only concerns itself with
how the gripper links interact with the target object. At any target configuration qtarget , we
denote the grasp set in the world frame by
world
Algorithm 3.4: GraspValidator(Tgripper ,q,v world ,δ)
1 SetConfiguration(gripper,q)
/* Check collision with gripper along negative direction. */
2 GripperTransforms ← ∅
3 for d ∈ [δ, δ + ] do
I3x3 −dv world world
T ← Tgripper
4 0 1
5 SetTransform(gripper,T )
6 if CheckCollision(gripper) then
7 return ∅
8 Add(GripperTransforms,T )
9 end
world
10 SetTransform(gripper, Tgripper )
11 qnew ← GraspStrategy()
12 SetConfiguration(gripper,qnew )
13 if AllCollisions(gripper) ⊆ {target} then
14 return GripperTransforms
15 return ∅
grasp strategy GraspStrategy(qgripper )), and it needs to have small room to maneuver
along its direction of approach. Figure 3.5 shows examples of collision-free grasps that are
not valid and all must be pruned.
All grasps in Gstable are validated using the GraspValidator function as described in
GraspValidator (Algorithm 3.4). First the gripper is set to its preshape and checked for
environment collisions as it moves along its negative direction of approach. is proportional
to the error expected on the object’s pose and the robot execution2 . Then the gripper is set
to the final transformation and the grasp strategy is executed in simulation. Because the
grasp strategy guarantees stability only if the gripper’s contacts are from the target object,
we check that all collisions with the final gripper configuration are only with the target
using AllCollisions and the space inclusion operator ⊆. It is possible to just check the
world
gripper collisions since we know its joint values qnew and its workspace transform Tgripper .
On success, GraspValidator returns the set of gripper transformations that are collision
free. When planning for just the arm, these transforms are later checked for the existence of
inverse kinematics solutions.
In order to maintain clarity, we assume that every function presented saves the state of
all objects it interacts with and restores it upon returning. The state in this case is each
link’s position, orientation, and collision-enabled values.
The simplest form of grasp planning is when the configuration space is {qarm , qgripper },
and inverse kinematics can be directly applied to find the goal configurations. The first
goal sampling algorithms using grasp sets is presented in GoalSampler Grasps Arm
(Algorithm 3.5). It first samples a grasp from Gstable and validates it with the current
environment. The set of transforms returned by GraspValidator are all checked for
inverse kinematics solutions and that they entire robot configuration is in the free space. If
there is a grasp transform that does not have a solution, then the robot will not be able to
complete the task and the grasp is skipped. Otherwise, the transform δ + distance away
from the original grasp is taken as the goal to feed to the RRT planner. This distance is
extremely important in making the planning robust to errors when executing on a real robot.
Given a target end-effector transformation, we define the set of robot configurations qarm
that achieve it with
where F K(qarm ) is the forward kinematics transforming robot configuration. For each in-
verse kinematics solution produced by IK(T ), we check it for the free space and return it the
arm and gripper configuration to the planner. The next time GoalSampler Grasps Arm
is called, it should resume from the point it left off instead of starting from the same grasp
again. In practice, this programming construct can be easily implemented with coroutines
[Knuth (1973)] or Python generators without changing the function.
Using the environment in Figure 3.4, the size of the Barrett Hand/mug grasp set is 341.
The time it takes to sample one valid goal configuration is 0.04 ± 0.01s. It is actually very
2
Usually set between 1cm and 2cm
Figure 3.6: Divides the grasps that pass GraspValidator into the set of reachable (has inverse
kinematics solutions) and unreachable grasps for one target.
common for the GraspValidator to return grasps that are completely unreachable by the
robot. Figure 3.6 shows an object that is slightly far away from the robot, of the 120 grasps
that are accepted by the GraspValidator, only 11% have inverse kinematics solutions.
The entire planning phase to the initial pickup is 0.7 ± 0.1s.
As was mentioned previously, it is also necessary to consider the possible destinations
of the target during the grasp selection process. Every possible target destination has to
be checked for possible collisions with the target object, gripper, and arm. GraspValida-
tor WithDestinations (Algorithm 3.6) shows the entire grasp validation process. The
destination check requires the target to be moved temporarily to that location in order
to accurately simulate the expected collisions. Collisions with the robot are ignored since
world
Algorithm 3.6: GraspValidator WithDestinations(target,Tgripper ,q,v world ,δ)
world
1 GripperTransforms ← GraspValidator(Tgripper ,q,v world ,δ)
2 if GripperTransforms 6= ∅ then
3 for Ttarget,new ∈ TargetDestinations do
4 SetTransform(target,Ttarget,new )
5 if AllCollisions(target) ⊆{robot} then
world −1 world
6 Tgripper,new ← Ttarget,new Ttarget Tgripper
world
V
7 if ∃ qarm s.t. qarm ∈ IK(Tgripper,new ) (qarm , q) ∈ Cfree then
/* IK exists and target is collision-free. */
8 return GripperTransforms
9 end
10 return ∅
Figure 3.7: Shows all the valid, reachable grasps that can be sampled from simultaneously. The
grasp set size is 341 and there’s 6 target objects for a total of 136 valid grasps.
the correct robot configuration is not set at that point. It should be noted that the inverse
kinematics check is only valid if just planning for the arm. When planning for the base, it
becomes very difficult to consider robot reachability, so these types of checks are handled
elsewhere as discussed in Section 3.6. Using the environment in Figure 3.4 where there is
102 potential target destinations, the average time to sample a configuration is 0.2 ± 0.1s.
The goal sampler itself can easily consider multiple targets simultaneously by creating a
super-grasp set as shown in Figure 3.7.
Even with such a straightforward goal sampler, we can immediately start seeing potential
bottlenecks that could slow down planning. For example, the set of grasps might be too big
(more than 1000+ grasps), so need to prioritize or sample them well so valid grasps are
tested first. We have to be careful that the prioritization function doesn’t require a lot of
Figure 3.8: Several robot arms performing grasp planning.
computation. Another common problem is that the GraspValidator does not consider
inverse kinematics, but some grasps are obviously not reachable as was shown in Figure 3.6.
In [Berenson et al (2007)], we used an ad hoc method to determine if a grasp is reachable,
however a much quicker and more formal method is using the reachability space of the robot
to prune grasps (Section 4.3).
Figure 3.8 shows other robots performing the same grasp planning task with all param-
eters auto-generated from the robot knowledge-base.
the robot by not having to enforce rigid constraints between the end-effector and the target
object. We illustrate our approach with experimental results and examples running on two
robot platforms.
The expanded range of motion comes at the cost of algorithmic complexity. In the
absence of a rigid grasp, care must be taken to ensure that the object does not slip out of
the robot end-effector. Of greater concern is the fact that there no longer exists a one-to-one
mapping from robot motion to object motion: since the object is loosely caged, there can
exist end-effector motions that produce no object motion, and object motion without explicit
end-effector motion.
One of the relaxed planning assumptions is that the end-effector of any configuration of
the robot always lies within the grasp set G with respect to the task frame. Because this
couples the motion of both the object and the robot during manipulation, their configurations
need to be considered simultaneously. Therefore, we define the relaxed configuration space
C as
In this case, Cfree includes (qtarget , qarm , qgripper ). Given these definitions, the relaxed task
constraint problem becomes:
start goal
Given start and goal configurations qtarget and qtarget of the object, compute a
continuous path {qtarget (s), qarm (s)}, s ∈ [0, 1] such that
start
(3.6) qtarget (0) = qtarget
goal
(3.7) qtarget (1) = qtarget
(3.8) {qtarget (s), qarm (s)} ∈ Cfree
(3.9) F K(q(1)) ∈ Tqtarget (1) Gcontact
Equation 3.6 and Equation 3.7 ensure that the target object’s path starts and ends
at the desired configurations. Equation 3.8 forms the crux of the relaxed task constraint
planning problem. Because the caging criteria dictates that each grasp be in the grasp set
G, Equation 3.8 ensures that any robot configuration q(s) produces a grasp F K(q(s)) that
lies in the world transformed grasp set Tqtarget (s) G. Equation 3.9 constrains the final grasp
to be within a contact grasp set Gcontact ⊆ G. This set is formally defined in Section 4.2.
Informally, any grasp in this set is in contact with the object and guarantees that the object
will not move away from the goal.
While the above equations describe the geometry of the problem, we make several as-
sumptions about the physics of the problem. These assumptions constrain the automatically
generated grasps we use for planning as well as the motion of the robot and object during
manipulation. Our analysis is purely quasi-static. The robot moves slow enough that its
dynamics are negligible. Furthermore, we assume that the object’s motion is quasi-static as
well. This can be achieved in practice by adding a dash pot to the hinges, damping their
motion, or by a sufficient amount of friction in the case of an object being dragged across a
surface. We also assume that we have access to a compliant controller on the robot. Under
this assumption, we are guaranteed that the robot will not jam or exert very large forces on
the object being manipulated. We discuss the generation of the caging grasp set in Section
4.2.2.
In the following sections we discuss the planning algorithms and robot results assuming
a grasp set has been generated. We describe two planning algorithms to solve the relaxed
constraint problem: a discretized version and a randomized version. The randomized algo-
rithm is more flexible and makes less assumptions about the problem statement, however the
discretized algorithm is simple to implement and useful for explaining the concepts behind
relaxed planning (as well as the motivation for a randomized algorithm).
The underlying assumption of the discrete formulation is that a desired path of the target
object is specified. Specifying the path in the object’s configuration space as an input to
the planner is trivial for highly constrained objects like doors, handles, cabinets, and levers.
The configuration space of these objects is one dimensional, so specifying a path from a to
b is easily done by discretizing that path into n points. In the more general case where an
object’s configuration space can be more complex, we denote its desired path as {qtarget,i }|n1
where each of the configurations qtarget,i have to be visited by the object in that order.
The discrete relaxed constrained problem is then stated as: given a discretized object
configuration space path {qtarget,i }|n1 , find a corresponding robot configuration space path
Algorithm 3.7: Q ← DiscreteSearch()
1 for i = 1 to n − 1 do
2 Gi ← Tqtarget,i G
3 for g ∈ Gi do
4 if (IKi,g ← IK(g)) 6= ∅ then
5 break
6 Gi .remove(g)
7 end
8 if Gi = ∅ then
9 return ∅
10 end
11 for g ∈ Tqtarget,n Gcontact do
12 for qarm ∈ IK(g) do
13 Qnext ← DiscreteDepthFirst(qarm , n − 1)
14 if Qnext 6= ∅ then
15 return { Qnext , qarm }
16 end
17 end
18 return ∅
where Equation 3.10 and Equation 3.11 constrain the end-effector to lie in the current
grasp set defined for the object and Equation 3.11 guarantees the final grasp is in contact.
To satisfy the continuity constraint on the robot configuration space path, Equation 3.12 and
Equation 3.13 ensure that adjacent robot and grasp configurations are close to each other.
A straightforward discrete planning approach to solve this problem is provided in Dis-
creteSearch (Algorithm 3.7). We begin by first running a feasibility test through the
entire object trajectory. This step is also used to initialize the grasp and kinematics struc-
tures used for caching. We assume an inverse kinematics solver is present for every arm.
Furthermore, if the arm is redundant the solver will return all solutions within a discretiza-
Figure 3.10: The basic framework used for planning discrete paths {qi }|n1 in robot configuration
space to satisfy paths {qtarget,i }|n1 in object configuration space.
tion level. We compute the set of contact grasps that will keep the object in form-closure
at its desired final destination qtarget,n (line 3.7). For each grasp in this set we compute IK
solutions for the complete configuration of the robot, and for each IK solution we attempt to
plan a path through configuration space that tracks the object path {qtarget,i } using depth
first search (line 3.7)3 .
3.10 provides a diagram of the discrete search framework. Given an object path {qtarget,i }|n1
we search for a robot path {qi }|n1 that consists of a sequence of robot configurations qi , 1 ≤
i < n such that F K(qi ) ∈ Tqtarget,i G and F K(qn ) ∈ Tqtarget,n Gcontact . Each of these configu-
rations qi is generated as an IK solution from one of the grasps in the grasp set Tqtarget,i G.
The depth first search process takes a robot configuration at a time step j and calculates
all the robot configurations that correspond to valid grasps at time j − 1 (i.e. are members
of set Tqtarget,j−1 G), then recursively processes each of these configurations until a solution is
found.
increase or a mobile base is considered, the discretization required for IK(g) to reasonably
fill the null space grows exponentially. Second, the desired object trajectory is fixed, which
eliminates the possibility of moving the door in one direction and then another to accomplish
the task (see [Stilman et al (2007a)] for an example where this is required).
To overcome these limitations, we also applied a randomized planner to the problem.
We chose the Randomized A* algorithm [Diankov and Kuffner (2007)], which operates in a
similar fashion to A* except that it generates a random set of actions from each state visited
instead of using a fixed set. Randomized A* is well suited to our current problem because it
can use the target object distance to goal as a heuristic to focus its search, it can guarantee
each state is visited at most once, it does not need to generate all the IK solutions for a given
grasp, and it can return failure when no solution is possible. The key difference between
Randomized A* and regular A* is the sampling function used to generate neighbors during
the search. For our relaxed constraints problem the task of this sampling function is to select
a random configuration (qtarget,new , qnew ) and a random grasp gnew ∈ Tqtarget,new G such that
qnew ∈ IK(gnew ). Ideally, this should be done efficiently without wasting time considering
samples previously rejected for the same configuration. The A* criteria will ensure that
the same configuration isn’t re-visited and that there is progress made towards the goal, so
the sampling function needs only return a random configuration in Cfree around the current
configuration (qtarget , qarm ) as fast as possible.
SampleNN (Algorithm 3.8) provides the implementation of the sample function. It first
samples a target object configuration qobject,new close to the current configuration qtarget (line
Figure 3.11: Comparison of fixed feasibility regions (left) and relaxed feasibility regions (right)
for the Manus and Puma robot arms.
3), then searches for feasible grasps from the new grasp set Tqobject,new G0 (line 6), and then
samples a collision-free IK solution close to q (line 8). In order to guarantee we sample
the entire space, RandomCloseConfig should discretize the sampling space of the target
configuration so that the number of distinct qobject,new that are produced is small. This is
necessary to ensure that sampling without replacement is efficient. Each time a sample is
chosen (line 6), it is removed from Gqobject,new so it is never considered again, an operation
that takes constant time. If the target is close to its goal then G0 is the contact grasp set
Gcontact , otherwise G0 is the regular grasp set G. Once a grasp is found, SampleIK samples
the null space of the IK solver around q until a collision-free solution is found. If not, the
entire process repeats again. If all grasps are exhausted for a particular target configuration,
the sampler checks for termination conditions and returns false (line 9).
Table 3.1: Statistics for the scenes tested showing average planning times (in seconds) and size
of the grasp sets used.
task constraints. The results shown in Figure 3.11 indicate that relaxing task constraints
through caging grasps provide a much a greater motion envelope for the robot as well as
versatility in base placement. This expanded range of allowable motions of the robot directly
results in: 1) improvements in the efficiency and the success rate of planning for a variety of
constrained tasks; 2) greater success in executing the desired motion and achieving the final
object goal state.
Discrete Randomized
6DOF Manus Arm 441% 503%
6DOF Puma Arm 130% 126%
Table 3.2: Increase in Feasibility Space when using relaxed planning compared to fixed-grasp
planning.
In each scene, the robot is randomly positioned and oriented on the floor, and then the
planners are executed. Thousands of random positions are tested in each scene to calculate
average running times (Table 3.1). The parameters for the randomized algorithm stayed
the same across all robots. To show that relaxed grasp sets really do increase the regions
the arm can achieve its task from, Figure 3.11 shows the feasibility regions produced with
the relaxed grasp set method and the fixed grasp method. The fixed grasp method uses a
single task-frame grasp throughout the entire search process. To make things fair, we try
every grasp in G before declaring that the fixed grasp method fails. Table 3.2 shows how
many times the feasibility region increased for the relaxed methods compared to the fixed
method. As expected, the lower dimensional manipulators benefit greatly from relaxed task
constraints. Furthermore, the door can be opened much further using the relaxed approach
than with the fixed grasp method.
Figure 3.12 shows experiments with several robots in simulation to show the generality of
the algorithm. Each grasp set was trained in a matter of minutes. Along with the analytic
Figure 3.12: Example simulations using the relaxed grasp-set planning formulation.
inverse kinematics solvers presented in Section 4.1, we can immediately get each of these
robots to open doors.
Figure 3.14 shows the grasp planning and caging grasp formulations combined to imple-
ment a more complex task. The task is to put a cup from the table on the right into the
cupboard on the upper left corner. The Barrett WAM first opens the cabinet to a specified
angle and then plans to grasp the cup while taking into account the geometric constraints of
the cup’s destination. Once the cup is placed inside the cupboard, the robot releases it while
making sure its fingers do not collide with the obstacles. Then it plans to a safe position to
move to its target preshape, and then closes the door.
Finally, it is possible to consider multiple preshapes and disjoint grasp sets to allow the
robot to change grasping strategies. For example, consider the top example in Figure 3.13
where a mobile robot is attempting to open a refrigerator as wide as possible. Because the
door handle makes a wide arc, it is impossible for the robot to open it all the way with just
a hook grasp. Therefore, we add push and pull grasps to do caging grasp definition: instead
of checking for caging in both directions, pushing and pulling requires that the object is
constrained in only one direction. The algorithm using disjoint sets is simple. First, we
randomly choose a disjoint grasp set and attempt to pull the door as wide as possible. When
Figure 3.13: WAM arm mounted on a mobile base autonomously opening a cupboard and a
fridge.
Figure 3.14: WAM arm autonomously opening a cupboard, putting in a cup, and closing it. Wall
clock times from start of planning are shown in each frame.
the door cannot be pulled any further and it still isn’t at its desired configuration, then we
choose another disjoint set and continue.
Figure 3.13 shows three different tasks that have become possible with pushing/pulling
and disjoint sets. Mobile bases usually have a ±5 cm accuracy in getting to their goal des-
tination; therefore, it becomes necessary to use the feasibility maps in Figure 3.11 to aim
the base to a dense probability region to avoid planning failures. Being able to explicitly
consider the feasibility regions can greatly help in designing the robot work space for indus-
trial situations. Furthermore, allowing grasps to change while planning for tasks that do not
require great precision can greatly increase the free space of the task. This increased space
can be used to allow robots with lower number of joints to complete the same tasks as more
complex, and expensive robots.
The free-joints problem is unique in that the robot has a small number of joints qf ixed
that need to safely move in a desired path τf ixed (t), but collisions can prevent them from
moving there. In order to solve it, we treat the rest of the joints as free and the problem
reduces into finding a robot configuration with the free joints where the fixed joints can safely
move along τf ixed (t). This problem is also applicable to robot joints not directly in qarm and
qgripper like moving the torso joints in humanoid robots (Figure 3.16). The search itself
given in RRTExplore (Algorithm 3.9) just expands an RRT tree and checks if τf ixed (t) is
achievable after every extension operation.
The more interesting part is in the termination condition function given by Check-
GoalCondition (Algorithm 3.10). The fixed joints are moved through all points in the
Figure 3.15: Shows a gripper whose fingers need to open, but cannot due to the table. Planning
by setting the free joints to the arm can move the gripper to a safe location to open its fingers.
Figure 3.16: The planner is also applied to moving the torso joints of the robot such that the
arms do not hit anything.
desired trajectory and the combined configuration with qf ree is used to check for collisions
and other constraints. We would also like to guarantee that the links attached to the fixed
joints are not close to any obstacles at their goal configuration given by τf ixed (1). The links
attached to the fixed joints are randomly moved by a small perturbation ∆ that defines a
ball around the workspace in which the fixed links should be collision-free. The attached
links for the fixed joints in Figure 3.15 all the gripper links. However, the attached links for
the fixed joints in Figure 3.16 are all the upper-body links since the torso joints affect all
of them. Therefore, GetDependentLinks returns all the rigidly attached and dependent
links of the fixed joints.
As will be shown in later sections, we frequently extract dependent sub-parts of the robot
to make computations faster. Because the fixed joints are usually very few, we set τf ixed (t)
as the linear segment that joints the initial configuration and the destination configuration;
in other words, no planners are involved in determining the fixed joint path. Using this
approximation, the total algorithm runtime itself is on the order of 0.1 s for example in
Figure 3.15, and 1.0 s for example on 3.16. The time greatly depends on the correct
discretization levels and scene complexity, which is why the humanoid case is slower.
takes in the desired arm to move and the transformed world grasp set, and returns a base
placement sample along with all information about the grasp. The returned grasp is vali-
dated using GraspValidator (Algorithm 3.4) and therefore the gripper transforms are also
returned. In this section we cover its usage in the context of sampling base placements.for
planners.
GoalSampler BasePlacement (Algorithm 3.11) shows the final goal sampler includ-
ing base placement. The information used for sampling is: the poses of all the target objects,
the set of arms for grasping, a set of grasps for each target/arm pair, the inverse reachability
of each arm, and the obstacles in the environment. An arm is first sampled and all the grasp
Figure 3.18: Several configurations returned from GoalSampler BasePlacement considering
both robot arms, the humanoid’s torso, and multiple objects.
sets belonging to the gripper of that arm are transformed into the world and stored into
G. The reachability sampler lazily calls GraspValidator before committing to a certain
grasp; therefore the return grasp information is valid in the current environment up to the
gripper links. These grasps are them passed into the grasp reachability sampler, which re-
turns a potential base placement along with a chosen grasp. However, grasp reachability does
not encode obstacles in the environment, so any base placement it returns has to be validated
for collisions and inverse kinematics solutions. As an early pruning step, all links that are
dependent on the base, but not the arms are extracted with GetIndependentLinks(base)
and tested for collisions with the current environment. If links are not in collision, we start
iterating across all inverse kinematics solutions until a full base-arm-gripper configuration
inside Cfree is found.
Figure 3.18 shows the results of the goal sampler on a kitchen environment with a hu-
manoid robot and four target objects. The average time to produce one full configuration
is on the order of 0.5 s where six different arms are being sampled from. In order to prove
that grasp reachability can outperform randomized base placement sampling, we performed
several hundred experiments with the scenes in Figure 4.22. With grasp reachability, it takes
on the order of 0.1 − 0.4s to return the first base placement that has a collision-free inverse
Figure 3.19: Humanoid upper body structure used to extract arm chains.
kinematics solution to grasps the object. With randomized sampling, the time was about
1.2 times slower for the WAM scene, and on average 4-5 times slower for the HRP2 because
its reachability is more constrained.
Each arm consists of a chain of consecutive joints where different arms can share common
joints. The humanoid in this example has 2 torso joints before its kinematic structure
branches into the left and right 7DOF arms (Figure 3.19). Therefore, we can define six
different, semantically meaningful arms: three left arms and three right arms each of 7, 8,
and 9 joints. Unfortunately, analytical inverse kinematics solutions become very difficult
for higher number of joint, and sometimes the torso joints are used for different tasks, so
we cannot just consider the 9 DOF versions of the left and right arms. Choosing the arm
in GoalSampler BasePlacement() takes into account the complexity of the inverse
kinematics solvers for that chain.
Because we’re dealing with a distribution on a 2D plane, it is possible to speed up
the sampling process by projecting the robot base and all collision obstacles on the floor
and compute the Minkowski sum. The Minkowski sum could be multiplied with the grasp
reachability distribution to form a new distribution that further encodes collision obstacles.
Such a method would be beneficial for static industrial environments, but would not speed up
computation for quickly changing environments unless the Minkowski sum is progressively
computed as configurations get rejected.
3.5. Once qbase has been validated and the robot moves to it using the navigation module,
the target object pose is re-confirmed and the grasp planner is executed as shown in Section
3.3.
Because navigation planners work on the 2D plane by assuming the robot has a 2D
footprint, it is necessary to move all the robot’s limbs in the convex prism defined by the 2D
footprint. If the robot is not grasping anything, this is this step is trivial since a qnavigation
configuration for the arms can be preset. However, if the robot has just grasped an object
and needs to folds its arms to navigate, there is a chance that the object might get into self
Figure 3.21: Robot needs to plan to a configuration such that all its limbs are within the base
navigation footprint.
the following process we assume the robot base is holonomic; however BiSpace can also be
applied to non-holonomic robots like humanoids; by setting a big base, the resultant goals
produced by BiSpace have a high probability to be reached by the base since a path is pro-
duced. We specifically focus on using BiSpace’s special characteristics to explore the work
and configuration spaces of the environment and robot. Furthermore, we use the reachability
space of the robot for constructing informed heuristics to informatively search through the
high-dimensional spaces involved with arms.
The core idea of the BiSpace algorithm is to grow two search trees in different configu-
ration spaces at the same time. It combines elements of both bidirectional RRTs and the
RRT-JT algorithm [Weghe et al (2007)]. One tree explores the full robot configuration space
starting from the initial configuration and guarantees feasible, executable, and collision-free
trajectories, while the other tree explores a goal space starting from the set of goal config-
urations and acts as an adaptive, well informed heuristic. The BiSpace algorithm proceeds
by extending RRTs in both spaces. Once certain conditions are met, the forward tree at-
tempts to follow the goal space tree path to the goal. Figure 3.22 shows how the robot path
eventually follows the workspace trees of the gripper.
For clarity, we denote a configuration with q and a goal space configuration with b,
usually a goal space configuration is the transformation of the gripper Tgripper . We assume
that there exists a mapping F (·) from the configuration space to the goal space such that
F (q) maps to exactly one goal space configuration. Using this notation, given a goal space
distance metric δb (F (q), b), the goal of planning is to find a path to a configuration q such
that δb (F (q), bgoals ) < goal .
The flow is summarized by BiSpace (Algorithm 3.13). The forward variable is used
to keep track of which tree to grow. If forward is true, then the configuration space tree
Algorithm 3.13: BiSpace(qinit , bgoals )
/* ρ ∈ [0, 1] - uniform random variable */
1 forward ← false
2 Init(Tf , qinit ); Init(Tb , bgoals )
3 for iter = 1 to maxIter do
4 if forward then
5 for fiter = 1 to J do
6 q ← Extend(Tf )
7 if ρ < FollowProbability(q) then
8 bfollow ← NearestNeighbor(Tf , q)
9 {success, q 0 } ← FollowPath(q, bfollow )
10 if success then
11 return success
12 end
13 else
14 for biter = 1 to K do Extend(Tb )
15 forward ← not forward
16 end
17 return failure
is extended J times, using the standard RRT extension algorithm Extend [LaValle and
Kuffner (2001)]. Alternatively, if forward is false, then the goal space tree is extended
K times. After each iteration, the value of forward is flipped so that the opposite tree is
extended during the subsequent iteration. After a new node q is added to the configuration
space tree, a follow step is performed from q with probability FollowProbability(q). If
a follow step is performed, then q is extended toward bf ollow and its parents.
The differences between BiSpace and BiRRTs become clear in the follow step. In the
BiRRT case, following consists of connecting the two trees along the straight line joining q
and bf ollow ; this is possible since bf ollow is in the same configuration space as q. Because each
branch of the both the forward and backward trees in the BiRRT algorithm represent a valid
collision free path in the configuration space, connecting the two trees immediately implies
a path can be found from the start configuration to the goal configuration. However, that is
not true with the BiSpace algorithm. Since the goal space is different from the configuration
space, the path suggested by the goal space tree must be validated in the configuration space.
Each unique path from a node in the goal space tree to a goal can be used by the forward
tree as a heuristic to informatively bias extension toward the goal. Starting from bf ollow ,
such a path can be extracted by recursively following its parents. The forward tree can use
the goal space path generated by bf ollow as a bias to greedily follow it. If the forward tree
succeeds in reaching the goal, a solution is returned (Figure 3.22). Otherwise, the search
continues as before.
Path following is an integral part of the BiSpace algorithm. It generates a very powerful
bias as to where the configuration tree should grow by using the nodes in the goal space
tree. Each goal space node has already validated a subset of the conditions necessary for the
configuration tree to follow it.
We present a simple, but effective, implementation of path following using a stochastic
gradient approach as shown in FollowPath (Algorithm 3.14). The forward tree slowly
makes progress by randomly sampling configurations that get close to the target goal space
node b. Whenever the forward tree stops making progress, it checks if b has any parents. If
it does, b is set to its parent and the loop repeats. If there are no more parents, the goal
space distance from q to the final parent b.root is checked: if this distance is within the goal
threshold, the function returns success; otherwise it returns false.
FollowPath can require a lot of samples if SampleNeighborhood uniformly samples
the neighborhood of q. This is especially a problem for the high-dimensional configuration
spaces used in manipulation planning. Instead, we sample each of the dimensions one at a
time while leaving the rest fixed. This type of coordinate descent method has been shown
to perform better than regular uniform sampling in optimization and machine learning al-
gorithms [Luo and Tseng (1992)]. Because it is not always beneficial to be greedy due to
many local minima, we introduce γinf lation to relax the distance metric we are minimizing.
The inflation has a similar effect to inflating the goal heuristic in A*; γinf lation = 1.4 is used
for all results.
We use inverse kinematics solution validation as an optional addition to FollowPath.
After the forward tree terminates at a configuration q, an IK solution can be checked for
a subset of the DOFs of the configuration space. If there exists a solution, we can run a
bidirectional RRT using the subset of DOFs used for IK to find a path from q to the new
goal configuration. For example, if a 7 DOF arm is mounted on a mobile platform, its full
configuration space becomes 10 dimensional, however, the arm’s IK equations will still remain
7 dimensional. Having such a check greatly reduces planning times and is not prohibitively
expensive if the IK equations are in closed form. While some algorithms ignore IK solutions,
BiSpace can naturally use inverse kinematics to its advantage. Empirical results suggest
that BiSpace can experience a 40% decrease in planning time when exploiting available IK
solutions.
Algorithm 3.14: FollowPath(q, b)
/* ρ ∈ [0, 1] - uniform random variable */
1 success ← false
2 for iter = 1 to maxFollowIter do
3 best ← null
4 bestdist ← γinflation ∗ δb (F (q), b)
5 for i = 1 to N do
6 q 0 ← SampleNeighborhood(q)
7 if δb (F (q 0 ), b) < bestdist then
8 bestdist ← δb (F (q 0 ), b)
9 best ← q 0
10 end
11 if best is null then
12 if b.parent is null then
13 break
14 b ← b.parent
15 else
16 q ← Tf .add(q, best)
17 end
18 success ← δb (F(q), b.root) < goal
/* Optional IK test */
19 if not success and (q’ ← IKSolution(q, Tb .goals)) then
20 {success, q} ← BiRRT(Tf , q, q’)
21 return {success, q}
Reachability Heuristic
It is clear that successfully moving to a specific grasp requires the robot move its base so
that its reachability volume coincides with the particular grasp. Given a target grasp g, we
would like to compute a probability distribution Pg (p) of completing the grasp as a function
of the base placement p. In Section 4.4 we introduce a reachability map that can retrieve a
the set of all possible base placements for a given end-effector transformation. This set can
be treated as a Mixture of Gaussians and probably can be computed directly.
where ω weights the important of base exploration vs arm exploration. When the robot
base is far away from the goal, the weight ω should be small so that the robot takes bigger
steps on average. This suggests a simple monotonic function for ω:
where σ is proportional to the length of the arm. Figure 3.23 demonstrates the behavior
of BiSpace when using the modified distance metric, and empirical results show that planning
times reduce by 20% when this metric is used.
Follow Probability
The farther the robot is away from the goal, the less chance it will have of reaching it through
FollowPath. The reason is because FollowPath itself is not exploration-centric like
RRTs; it is meant for greedily approaching the goal when the body and hand of the robot
are relatively unobstructed by complex environment obstacles. We present two metrics to
compute the follow probability: the kinematic reachability explained in Section 4.3 and the
distance falloff ω(q) given by Equation 3.15. Both metrics monotonically decrease as the
robot gets farther from the goal. The kinematic reachability is more informed since it is a
6D table reflecting the real arm kinematics while ω(q) is much easier to compute and often
very effective as shown in Figure 3.23. The correct follow probability can have a dramatic
effect on planning times, sometimes reducing it by 60-70%.
BiSpace Experiments
We compare BiSpace with RRT-JT [Weghe et al (2007)] and the two-stage navigation ap-
proach described in Section 3.6.2. A mobile base adds three degrees of freedom to the
configuration. Randomized algorithms are known to have a long convergence tail, we termi-
nate the search after 10-20 seconds and restart. This termination strategy produces much
faster average times for all algorithms. Note however that every termination counts against
the final planning time for that particular algorithm. Termination times were uniquely set
for each algorithm in order to give it the fastest possible average time. The average planning
time is recorded in Table 3.3 where each algorithm is run on each scene 16-30 times. Other
parameters like RRT step size and goal thresholds were kept the same for all algorithms. To
demonstrate the generality of these algorithms, we evaluated scenes using both the HRP2
humanoid and a WAM arm loaded on a segway (Figure 3.24).
Since BiRRTs operate only in the full configuration space it would be unfair if they were
seeded with the final solutions without any penalties. In order to make comparison fair, we
Figure 3.24: Scenes used to compare BiSpace, RRT-JT, and BiRRTs.
Table 3.3: Average planning time in seconds for each scene for the scenes in Figure 3.24.
randomly sample full configuration solutions for a given target grasp until a collision-free,
feasible configuration is generated. The recorded time is added to the final planning time.
The sampling takes somewhere from 2-9 seconds for HRP2 and less than 1 second for the
WAM on segway.
When planning for the HRP2 robot, we make the assumption that its base can freely
travel on the floor and the legs do not need to move. Once BiSpace has planned a global
trajectory, later footstep planners can add the necessary leg movements and dynamics to
make the HRP2 move. In order to allow for leg space, an invisible cylinder is super-imposed
over the lower body. Thus the planning space for HRP2 is reduced to 11 degrees-of-freedom:
3 for the base, 1 for the waist, and 7 for the arm. As can be seen from Figure 4.19, most of
the hand reachability lies shoulder height to the side of the robot. This makes it hard for
the robot to manipulate objects in front of it at waist height, which is why all the planners
require significant planning time.
One of the hardest scenes for BiSpace is when the target object is on a table and HRP2
has to circle the table to get to it (Figure 3.25). Here, the goal space tree produces many
false paths directly over the table, which the HRP2 cannot follow to the end. This process
Figure 3.25: Hard scene for BiSpace. The forward space tree (red) does not explore the space
since it is falsely led over the table by the goal space tree (blue).
goes on until the rest of the configuration space tree finally explores the space on the other
side of the table. This limitation is characteristic of bi-directional RRTs also and provides a
good example of why exploration is always a crucial ingredient in sampling-based planners.
We tested two main scenes for the WAM: a living room scenario where the WAM is
mobile, and a scenario where the WAM has to put cups in a dishwasher. The WAM arm
has 7 degrees of freedom and very high reachability making planning very fast. BiSpace
compares relatively well with BiRRTs, however it is a little slower due to the extra overhead
in the FollowPath function.
The BiSpace algorithm can efficiently produce solutions to complex path planning prob-
lems involving a goal space that is different from the configuration space. We presented
several heuristics that exploit the kinematic structure of the robots to speed up planning.
3.7 Discussion
We have worked to find a reasonable and sufficient set of planning algorithms that can cover
the entire spectrum of manipulation planning problems proposed in Chapter 2. We have
shown their application in a number of different tasks with a number of different robot plat-
forms. Most importantly, all information that the planning algorithms use can be computed
by a geometric analysis of the environment. In fact, we showed how the planning knowledge-
base plays a key role in instantiating samplers and functions used throughout planning. It
is very difficult to measure the time-complexities of sampling-based planners because of the
high dependency on the environment and the sampling distribution. Furthermore, the RRT
family of planners are known to have long tails for planning time distributions, making it
difficult to prove theoretical time bounds for them. Instead, we have presented timing ex-
periments of thousands of simulations for environments that are representative of home and
industrial settings.
All the planning theory culminated in a generalized configuration sampler that can ef-
ficiently reason what grasps to use in an environment and can sample informative base
placements. Along with covering the basic grasp planner, we also covered other common
planning situations like planning to open a gripper or planning to a navigation mode when
using in conjunction with navigation planners. Planning with a mobile base requires not
only efficient searching for the paths as shown in the BiSpace algorithm, but also careful
selection of the goals. For example, choosing a grasp requires simulating the grasp in the
environment to assure only the targeted object contacts with the gripper. Quickly choosing
a base placement requires computation of the arm reachability along with simultaneously all
grasp sets for all target objects. All these planning algorithms have been deployed on real
robots systems and the computation times written at the end of each respective sub-section
have been verified over many trials.
Chapter 4
Manipulation Planning
Knowledge-base
The information a task relies on can be divided into information defining the task and
independent of the current state of the environment and information obtained at run-time
like obstacles and target positions that the robot has no way of pre-computing beforehand.
In this chapter we analyze the relational structure of information pertaining to the robot
and task specifications and develop several offline algorithms that can pre-compute this
information into a form that makes online retrieval as quick as possible. We organize all this
information into a planning knowledge-base where all knowledge-base models are specifically
optimized to make the manipulation processes discussed in Chapters 3 and 5 as quick and
accurate as possible. We achieve this by identifying basic-building blocks and explicitly
encoding their inter-dependencies. The goal is to quickly compute answers to questions about
the inter- and intra-relationships of these blocks by approximating the acquired models and
caching the information for quick retrieval.
Ideally, the planning knowledge-base should give a sense of how the robot operates within
its workspace and how the task definition can affect the choices the robot has to make. Most
of the building blocks are based on the geometry of planning problems where precisely de-
fined computations do not have a trivial solution to their inverse computation, or require
analyzing the behavior of the computation across the entire input space. Using the main
components from Figure 2.4, we can precisely define their relationships to produce the plan-
ning knowledge-base graph of Figure 4.1. This shows a computational dependency graph of
the components commonly used to solve manipulation tasks. At the left we start with the
robot and task specifications. Along with the kinematics and geometry, each robot has a
labeled set of chains that serve as the arm and gripper groups. Each gripper is attached to
75
Figure 4.1: Computational dependency graph of all the components extracted from the relational
graph in Figure 2.4.
an arm and has a specific 3D direction of approach. On the other hand, the target object
just contains the CAD model and a set of training images that allow vision algorithms to
create an object-specific pose recognition program for the target object.
We begin with a motivation for all the components in Figure 4.1. Analytic inverse
kinematics equations for each arm are required for fast planning and many of the geometric
analyses following. In Section 4.1, we introduced an algorithmic approach named ikfast for
finding the most robust closed-form solutions. Inverse kinematics allows exact computation of
the arm reachability space that is used for many planning heuristics (Section 4.3). Inverting
the reachability and projecting it down to 2D planar movements allows us to start computing
base distributions (Section 4.4). From the target object perspective, we generate all the
grasps that handle the object according to the task specification. In Section 4.2 we cover
two algorithms for computing grasps: force closure grasps and caging grasps. We show the
usefulness of these grasps for a wide range of tasks. For each pose recognition program trained
from the object specification, we gather a map of all the camera locations that can robustly
detect the object (Section 4.7). This combined with the preshapes of the grasps allows
us to compute a sensor visibility map for the robot. By combining grasp sets and inverse
reachability, we can start reasoning about how the robot base gets distributed depending on
the location of the target object; we call this map grasp reachability and cover its generation
and usage in Section 4.5. We also discuss the advantages of using convex decompositions for
the collision geometry of a robot (Section 4.6). Convex decompositions give us much simpler
geometries to work with allowing us to pad the robot and to approximate the volume of
each of its links. Once we have an estimate of the volume a link takes, we can compute
swept volumes for each of the joints and from there compute much better configuration
space distance metrics (Section 4.6.3).
The knowledge-base should be optimized for accurate modeling of the underlying geome-
try and fast usage times during online planning. Because every model has several parameters,
we provide experimental data to help provide an intuition for the discretization factors and
thresholds involved with each of the models. However, it should be noted that we are not
concerned with the computation time of the generation process for each component. We
assume that such databases can be computed offline in a cluster of computers and then
stored on a server for quick download and usage. In fact when we cover the OpenRAVE
robotics environment responsible for the experiments in this thesis, we cover the computa-
tion of unique hashes for the robot and target bodies so that the data can be indexed more
consistently (Section A.1.5).
4.1 Inverse Kinematics
The relationships between the workspace movement of the robot sensors and grippers with
the configuration of the robot is expressed through kinematics. For planning algorithms and
other workspace analysis, the inverse problem of going from desired frames of reference on
the robot to joint angles is equally important. In this thesis we identify several problems
in inverse kinematics and provide an automated analysis and solutions for them. The most
popular inverse kinematics problem is to compute all the joint angles qarm that move a
link to a specified translation and orientation T = R t . Pure mathematical attempts to
analytically solve the general problem could not handle singularities and resulting programs
are not always optimal [Low and Dubey (1987)]. Therefore, many flavors of numerical
methods have been developed involving numerical gradient descent using δT δq
. In fact, most
of the literature revolving around generalized inverse kinematics solutions revolves around
numerical methods disregarding the analytical problem as too difficult to solve. Although
general analytical solutions to the problem have been proposed [Manocha and Zhu (1994)],
the methods can become numerically unstable due to the heavy mathematical machinery
like eigendecomposition.
In this section, we look at the automatic generation of a minimal, numerically stable
analytical inverse kinematics solver using an algorithmic search-based approach. We note
that it is difficult to prove existence of a solution with algorithmic approaches; however, we
will show that by sacrificing a little generality, the presented analysis can produce much more
stable closed-form solutions to most of the common robot arms available today. Closed-form
solutions are vital for motion planning for two reasons:
1. Numerical inverse kinematics solvers will always be much slower than closed form solu-
tions. But because planners require being able to process thousands of configurations
per second, it is critical to have fast IK solvers. The closed-form code generated by
our proposed method can produce solutions on the order of 6 microseconds. As a
comparison, most numerical solutions are on the order of 10 milliseconds and have to
worry about convergence.
2. The null space of the solution set can be explored because all solutions are computed.
This gives planning algorithms more freedom to move the robot around.
In this thesis, we develop the ikfast algorithm to perform this generation process. As a
program, ikfast generates a C++ file that directly compiles into a solver. Although inverse
kinematics can become arbitrarily complex with closed-chains, we only consider kinematic
chains employing hinge and prismatic one degree of freedom joints.
Figure 4.2: An outline of the parameterizations ikfast can solve along with how the equations
are decomposed.
Figure 4.2 shows an outline of the four parameterizations and general approach to solving
each. In the following sections we discuss the ikfast implementation in detail and evaluate
its performance at the end.
We derive all constraints used for solving the IK by changing the frame of reference of
Equation 4.1:
(4.4) T0−1 Tee Tn−1 = J0 T1 J1 T2 J2 ...Jn−1 (i = 0)
J0−1 T0−1 Tee Tn−1 = T1 J1 T2 J2 T3 ...Jn−1 (i = 1)
T1−1 J0−1 T0−1 Tee Tn−1 = J1 T2 J2 T3 J3 ...Jn−1 (i = 2)
...
−1 −1
Tn−1 Jn−2 ...T0−1 Tee Tn−1 = Jn−1 (i = 2n − 2)
Ree,i tee,i Ri ti
For each of the equations above, we define Fee,i = and Fi = such
0 1 0 1
that Fee,i = Fi . In order to simplify the equations that will be covered below for several real
robots, we let cd = cos jd , and sd = sin jd .
In the past, there have been four different approaches to solving the IK problem. The
first is an algebraic approach where the problem resolves into a high-degree univariate poly-
nomial. Such an approach can easily become ill-conditioned when searching for roots. A
second approach is based on analyzing the structure of solutions sets of polynomial systems
[Wampler and Morgan (1991)], but suffers from numerical ill-conditioning. The third ap-
proach is based on linear algebra where the problem can be formulated as eigendecomposition
[Raghavan and B.Roth (1990); Manocha and Zhu (1994)]. Although the linear algebra ap-
proach is the fastest from all the proposed methods, it still suffers from numerical problems
due to eigendecomposition of 48x48 matrices. Our method differs in the past approaches in
that we emphasize numerical stability over solution generality.
The constraints in Equation 4.5 correspond to equations of the form:
p q r
X Y
(4.5) ai cj i,j sji,j ji i,j = 0
i j
where pi,j + qi,j + ri,j <= 1. Combining with the c2j + s1j − 1 = 0 trigonometric constraints
for all revolute joints, we have a system of up to 2n unknown variables. The challenge to
building a stable IK solver is to exploit as much structure of the problem as possible by first
starting with all variables that can be solved with low-degree polynomials. In this paper,
we cover the kinematic complexity up to solving intersections of conics. The problem can
become more complex with pairs of joint variables being coupled, in which case intersections
of quadrics is necessary [Dupont et al (2007)]. Although extremely rare in today’s commercial
robots, if more joint coupling is present, we fall back to a general solver like [Manocha and
Zhu (1994)].
4.1.2 Evaluating Equation Complexity
The first challenge is using the forward kinematics equations to set up all possible constraints
on the variables. The solver searchers for solutions prioritizing the simplest and most unam-
biguous solutions first. A variable can be solved in multiple different ways, each way having
its own set of degenerate cases. We consider two levels of complexity:
• Solution complexity deals with the number of discrete solutions a joint value can
take. Introducing square roots and inverse sin/cos methods introduces two possible
solutions. Sometimes this is unavoidable because the underlying kinematics dictate
several possible solutions, but most of the time one of the solutions can be inconsistent
with the rest of the kinematics.
When deciding which variable to solve for, the solution complexity is evaluated for
each variable and a set of solutions having the same minimum complexity are chosen. Inside
each of these equations, the one with the minimum numerical complexity is chosen.
Given the current set of known variables that have been solved, we search for equations
with the following priority:
• cos jd and sin jd can be solved from two linearly independent equations polynomial in
cos jd and sin jd . Polynomials of degrees more than 1 are penalized.
• If no linearly independent equations exist, search for equations of the form a cos jd +
b sin jd = c. This will allow us to solve for the correct angle within π radians.
• Finally resolve to equations of the form (cos jd )2 = a, which can have up to four
answers:
√ √ √ √
(4.6) jd = {cos−1 a, − cos−1 a, cos−1 − a, − cos−1 − a}
If there are degenerate cases like divides by zero, an explicit branch occurs in the gener-
ated code that sets all the instances of the condition to zero and re-analyzes the equations.
There are two categories of divides by zero. One where the condition can be directly eval-
uated to a value of a joint variable like cos jd = 0; the other cannot be solved so easy or is
not related to joint solutions like p2x + p2y . The former allows us to explicitly create branches
for jd = π2 and jd = − π2 , which we will show below is necessary to get correct solutions. The
latter divide by zero category makes it more difficult to propagate changes in compile time,
so is penalized more.
For example, when presented with the following set of equations:
(4.7) cos j0 = a0
(4.8) a1 ∗ sin j1 + a2 ∗ cos j1 = a3
(4.9) a4 ∗ sin j1 + a5 ∗ sin j1 = a6
• atan2 function is extremely robust to zeros and infinities and will return a solution in
the full circle [0, 2π]. It is available on all math libraries.
• The domain of cos−1 and sin−1 is [−1, 1] making the functions unstable due to numerical
imprecision. There are also two solutions, one of which might not be consistent with
the robot kinematics.
Therefore solutions like Equation 4.11 should be prioritized over Equation 4.10; in fact,
√
equations in the form of sin−1 x, cos−1 x, and x should be only used as a last resort when
there is nothing else available.
In order to apply a solution in a real working IK solver, there needs to be guarantees that
the solution will always produce the correct answer. Developing an IK solver that works 100%
of the time is difficult because many degenerate cases arise. For example, it is a common
mistake for researchers to apply Equation 4.11 as is without considering the degenerate case
a1 a5 − a2 a4 = 0 when solving for sj1 and cj1 . This divide by zero is a serious problem and
completely changes the priority equations have to be considered. As we will see in examples
below, this can actually affect what variable is solved first. Unfortunately, we cannot just
cancel out the denominators in the hope of eliminating the zero because
b0 b1
(4.14) atan2( , ) 6= atan2(b0 , b1 ).
c c
The sign of c can affect the quadrant the angle is returned in, thus potentially returning
an angle that is π radians from the correct result. Therefore, the correct approach is to first
evaluate the potential divide by zero conditions and branch to a different set of solutions
where the divide by zero is delayed and other variables are prioritized; eventually a different
a solution will be found for the original variable.
In the following sections we will be using the Kuka KR5 R850 industrial manipulator
(Figure 4.4 and the PR2 personal service robot 4.3 for explaining inverse kinematics gener-
ation. The PR2 service robot has seven joints, which adds a redundancy to the kinematics.
This issue will be discussed in further detail below, but for now we assume that the value of
the first joint j0 is set by the user.
(4.15) tee,i − ti = 0
(4.16) ktee,i k − kti k2 = 0
2
We show how to solve a subset of the joints of the PR2 for translation. We treat the wrist
position where the last three joints intersect as the translation target. The PR2 kinematics
equations are
We treat j0 as the free parameter, so c0 and s0 are known. The first solution found is for
j3 where we get the constraint:
ktee,4 k2 −kt4 k2 =−0.066959−0.08c1 +0.2px c0 +0.2py s0 −0.8pz s1 +0.8px c0 c1 +0.8py c1 s0 −px 2 −py 2 −pz 2 =0
This equation has two solutions, and no other constraint on j3 can be computed, so we
add both as possible solutions. Geometrically thinking, j3 is the elbow of the robot, and
usually there are two solution when computing the length of the base to the arm tip.
Treating j3 as a known value, the next solvable variable is j2 :
⇒ s2 =−3.115264797 px s0s−c
3
0 py
There are no other constraints that can be formed, so we compute two solutions for j2
from sin j2 . Geometrically this makes sense because two angles can achieve the same direction
in two different ways; once one is set, the other angle is uniquely determined. However, you’ll
note that there is a potential divide by zero problem with sin j3 . Because this is the only
possible solution, we have to compute a new set of constraints given j3 ∈ 0, π. For example,
we get the following new equations when setting j3 = 0:
Figure 4.4: The labeled joints (black) of the Kuka KR5 R850 industrial robot.
px
0.1c0 +0.721c0 c1
t00 = 0.1s0 +0.721c1 s0 , t0ee,0 = py
pz
−0.721s1
0.1+0.721c px c0 +py s0
1
t01 = 0 , t0ee,1 = py c0 −1px s0
−0.721s1 pz
0.721
−0.1c1 −1pz s1 +px c0 c1 +py c1 s0
t04 = 0 , t0ee,4 = py c0 −1px s0
0 −0.1s1 +pz c1 +px c0 s1 +py s0 s1
Going back to the original non-zero branch, with j2 and j3 computed, the final solution
to j1 becomes
„ «
pz (0.1−c0 px −py s0 )−0.321c2 s3 (0.4+0.321c3 ) (0.4+0.321c3 )2 −pz 2
(4.18) j1 =atan2 ,
−(0.4+0.321c3 )(0.1−c0 px −py s0 )+0.321c2 pz s3 −(0.4+0.321c3 )(0.1−c0 px −py s0 )+0.321c2 pz s3
which is unique. Therefore the translation component of the PR2 can have up to 4 unique
solutions.
We also provide the equations to the Kuka manipulator in order to show the similarities
of the analyses. The first thing worth nothing is that the joint axes for j1 and j2 are aligned,
which hints that j0 can be decoupled from the computations. The kinematics equations are:
0.075c0 +0.365c0 s1 +0.405c0 c1 c2 +0.09c0 c1 s2 +0.09c0 c2 s1 −0.405c0 s1 s2
px
t0 = −0.075s0 −0.365s0 s1 +0.405s0 s1 s2 −0.405c1 c2 s0 −0.09c1 s0 s2 −0.09c2 s0 s1 , tee,0 = py
pz
0.335+0.365c1 +0.09c1 c2 −0.405c1 s2 −0.405c2 s1 −0.09s1 s2
px c0 −1py s0
1 c2 +0.09c1 s2 +0.09c2 s1 −0.405s1 s2
0.075+0.365s +0.405c
1
t1 = 0 , tee,1 = px s0 +py c0
0.335+0.365c1 +0.09c1 c2 −0.405c1 s2 −0.405c2 s1 −0.09s1 s2 pz
0.405c
2 +0.09s2
0.335s1 −0.075c1 −1pz s1 +px c0 c1 −1py c1 s0
t4 = 0 , t4,ee = px s0 +py c0
0.09c2 −0.405s2 −0.365−0.335c1 −0.075s1 +pz c1 +px c0 s1 −1py s0 s1
Conic Sections
The constraints in Equation 4.15 might not be so forgiving as to always guarantee there is
one joint variable singled out. It is very common for manipulators to have two joint variables
coupled which give rise to the following equations:
(4.20) Aj,0 + Aj,1 c0 + Aj,2 s0 + Aj,3 c1 + Aj,4 s1 + Aj,5 c0 c1 + Aj,6 c0 s1 + Aj,7 s0 c1 + Aj,8 s0 s1 = 0
where Aj,k is a set of constants of the j th constraint computed from all frames of reference
of Equation 4.15. Treating c0 , s0 , c1 , and s1 as independent unknown variables, it is very
difficult to find a closed form solution of this family of equations. However, by taking
advantage of the property that c2k + s2k = 1, we can solve Equation 4.20. First we treat the
set of equations as linear with respect to the pairwise variables c0 c1 , c0 s1 , s0 c1 , and s0 s1 ,
and solve for them using basic linear algebra techniques. This allows us to single out the
pairwise variables like this:
(4.21) c0 c1 = B0,0 c0 + B0,1 s0 + B0,2 c1 + B0,3 s1 + B0,4
(4.22) c0 s1 = B1,0 c0 + B1,1 s0 + B1,2 c1 + B1,3 s1 + B1,4
(4.23) s0 c1 = B2,0 c0 + B2,1 s0 + B2,2 c1 + B2,3 s1 + B2,4
(4.24) s0 s1 = B3,0 c0 + B3,1 s0 + B3,2 c1 + B3,3 s1 + B3,4
(4.25) 0 = B4,0 c0 + B4,1 s0 + B4,2 c1 + B4,3 s1 + B4,4
Although there are four equations here, there is two underlying degrees of freedom,
meaning that two equations can be formulated as a combination of the other two. In order
to remove the coupled variables from the left side, we pick two equations, square both sides,
and add them together. For example, (Equation 4.21)2 + (Equation 4.22)2 yields c20 for the
left side. Applying the transformation for all valid combinations gives:
(4.26) c20 = D0,0 c20 + D0,1 c0 s0 + D0,2 c0 + D0,3 s0 + D0,4
s20 = D1,0 c20 + D1,1 c0 s0 + D1,2 c0 + D1,3 s0 + D1,4
c21 = D2,0 c21 + D2,1 c1 s1 + D2,2 c1 + D2,3 s1 + D2,4
s21 = D3,0 c21 + D3,1 c1 s1 + D3,2 c1 + D3,3 s1 + D3,4
where Dj,k is constant for k < 4 and Dj,4 is the remainder. If one of Dj,4 is a constant
value, we can solve for cos and sin of a single variable by formulating the problem as the
intersection of two conic sections setting x = cos jk and y = sin jk :
T
E0 E21 E23
x x
2 2
E0 x + E1 xy + E2 y + E3 x + E4 y + E5 = 0 ⇒ y E1
2
E2 2 y ⇒ xT C 0 x = 0
E4
E3 E4
1 2 2
E5 1
T
x 1 0 0 x
2 2 y ⇒ xT C 1 x = 0
(4.27) x + y − 1 = 0 ⇒ y 0 1 0
1 0 0 −1 1
We first note that any solution x that satisfies both C0 and C1 , will also satisfy a linear
combination of them:
(4.28) xT (C0 + λC1 )x = 0
The goal is to produce a third conic C2 = C0 + λC1 such that C2 is degenerate. A degen-
erate conic has solutions that are: a point, a line, or two intersecting lines. Such geometric
primitives are much easier to work with and we can immediately find the intersection of
them with the unit circle represented by C1 . C2 s degenerate when its determinant is 0.
This amounts to solving a cubic equation in λ:
|C0 + λC1 | = 0
⇒ λ3 + F2 λ2 + F1 λ + F0 = 0
F2 = E0 + E2 − E5
E32 + E42 − E12
F1 = E0 E2 − E0 E5 − E2 E5 +
4
E0 E4 − E1 E3 E4 + E2 E3 2 + E5 E1 2
2
F0 = −E0 E2 E5 + .
4
which can be analytically solved using methods in [Weisstein (1999-present)].
For every real solution to λ, we compute the degenerate conic C2 and find its null space
N S(C2 ), which we can use to compute the line:
(4.29) ∀v ∈ N S(C2 ), x y 1 · v = G0 x + G1 y + G2 = 0.
Intersecting with the unit circle gives us the following quadratic equation:
(4.30) (G1 y + G2 )2 + G20 y 2 = G20
As an example of an application to these equations, we take the PR2 kinematics, but
this time choose j2 to be the free parameter. This gives us many coupled equations, with
the following being the simplest:
0.321s2 s3 = c0 py − px s0
0.2568c3 = −0.253041 − 0.2c0 px − 0.2py s0 + px 2 + py 2 + pz 2
Squaring all sides and adding the two equations gives a conic equation in c0 and s0 :
E0 c20 + E1 c0 s0 + E3 c0 + E4 s0 + E5 = 0
p x 2 py 2
E0 = 9.704874759( 2
− 2 ) + 0.6065546724(py 2 − px 2 )
s2 s2
px p y
E1 = −1.213109345px py + 19.40974952 2
s2
E2 = 0
E3 = −1.534832009px + 6.065546724(px py 2 + px pz 2 + px 3 )
E4 = −1.534832009py + 6.065546724(py px 2 + py pz 2 + py 3 )
2
E5 =0.02906143424−9.704874759 px2 +7.067605371py 2 +7.674160043(px 2 +pz 2 )−15.16386681(px 4 +py 4 +pz 4 +2px 2 py 2 +2px 2 pz 2 +2py 2 pz 2 )
s2
Once we solve for the first joint variable, we can continue searching for the others starting
again at Equation 4.15. The divide by sin j2 is clear in these equations, so we specifically
test the IK when j2 ∈ {0, π}. For example, setting j2 = 0 and re-solving allows us to solve
for j0 immediately:
(4.31) j0 ∈ {−atan2(−py , px ), π − atan2(−py , px )}
c5 s5 s6 c6 s 5 r00 r01 r02
(4.34) R8 = s4 s5 c4 c6 − c5 s4 s6 −c4 s6 − c5 c6 s4 , Ree,8 = r10 r11 r12
−c4 s5 c6 s4 + c4 c5 s6 −s4 s6 + c4 c5 c6 r20 r21 r22
The most apparent equation here is c5 = r00 because it almost allows computation of
j5 . In fact, if r00 ∈ {±1}, we can deduce that s5 = 0, so it is possible to compute j5
directly. Because one of our constraints from Equation 4.32 involves dividing by elements of
R, one very dangerous possibility is that any one of c4 , s4 , c5 , s5 , c6 ,or s6 can appear in the
denominator; and if zero, will make the equation degenerate. Therefore, a the rule of thumb
when generating the equations it take advantage of all computed constraints on the joint
variables. From the above constraint, if r00 ∈ {±1}, then we know for a fact that s5 = 0
and can solve for j5 , therefore we create an extra branch in the generated code setting j5 to
0 and π. This gives the following new matrix evaluated at j5 = 0:
1 0 0 1 0 0
(4.35) R80 = 0 c4 c6 − s4 s6 −c4 s6 − c6 s4 = 0 cos(j4 + j6 ) − sin(j4 + j6 )
0 c4 s 6 + c6 s 4 c4 c6 − s 4 s 6 0 sin(j4 + j6 ) cos(j4 + j6 )
which shows that both j4 and j6 are aligned and can be solved by:
This type of analysis shows that ikfast can also return an infinite amount of solutions
by explicitly storing the relationship between joint values. Testing j5 = π would yield
an opposite relationship j4 − j6 = atan2(r21 , r11 ). It is worth nothing here that these
relationships would have been completely ignored if we did not pursue the c5 = r00 constraint.
Back to the original problem, we get these possible solutions:
j4 = atan2(r10 s5 , −r20 s5 )
j6 = atan2(r01 s5 , r02 s5 )
Ree,6 (j0 , j1 , j2 ) tee,6 (j0 , j1 , j2 ) R6 (j3 , j4 , j5 ) t6
(4.38) =
0 1 0 1
Figure 4.5: Two types of separable kinematic chains, which allow rotation joints (red) to be
solved separately from translation joints (blue). T0 is the base link frame, Tee is the end effector
link frame.
where t6 is constant. This allows us to build up the constraints for just the j0 , j1 , j2
variables by using Equation 4.15 for all i <= 6. After the first 3 joint values are solved, we
can treat Ree,6 (j0 , j1 , j2 ) as a known constant matrix and solve directly for R6 (j3 , j4 , j5 ) by
using constraints built by Equation 4.32.
The second type is when the first 3 joints intersect at a common point (Figure 4.5b),
which produces the following separation:
Ree,6 (j0 , j1 , j2 ) tee,6 R6 (j3 , j4 , j5 ) t6 (j3 , j4 , j5 )
(4.39) = .
0 1 0 1
We first solve for the translation component by building up all constraints in Equation
4.15 for all i >= 6. Then we treat R6 (j3 , j4 , j5 ) as a known constant matrix and solve for
Ree,6 (j0 , j1 , j2 ).
where p is any position along the ray and d is a unit direction for a total of four degrees
of freedom. Given the target ray rn (s) = pn + s dn in the coordinate system of the last
link, the inverse kinematics of a ray is the problem is matching rn to a globally specified ray
ree (s) = pee + s dee (Figure 4.6):
Figure 4.6: The parameterization of the ray (green) with the last two axes interesting.
p
(4.41) ∃s∈R pee + s dee = T0 J0 T1 J1 T2 J2 ...Tn n
1
d
(4.42) dee = T0 J0 T1 J1 T2 J2 ...Tn n
0
Here we present a solution to the inverse ray kinematics problem for the case that:
• The last 2 joints intersect at a common point.
• The problem is transformed sothat the position of the ray in the last link’s coordinate
p
system is at the origin: Tn n = 0. If this is not the case, then Tn could be modified
1
without loss of generality to satisfy this condition.
These constraints allow us to separate ray position from ray direction:
We find a system of equations for j0 and j1 . Because there is coupling due to the cross
product, the system of constraints is a conic section solvable using the techniques in Section
4.1.3. For direction, the equations become
1
“ √ ”
cos j1 = 0.85 pz dx 2 +pz dy 2 −px dx dz −py dy dz ± 0.852 −(px dy −py dx )2 −(pz dx −px dz )2 −(pz dy −py dz )2 )
c2 s3 = d0x
s2 s3 = d0y
c3 = d0z
d00 = dz s1 + c0 c1 dx − c1 dy s0
d01 = c0 dy + dx s0
d02 = c1 dz + dy s0 s1 − c0 dx s1
First dz is checked for being on the boundary {±1}, otherwise it is evaluated as:
d0y 0
(4.49) j3 = atan2 ,d
s2 z
d0
(4.50) j3 = atan2 − x , d0z
c2
the IK solution. In order to make the searching as quick as possible, the discretization value
should be as high as possible without sacrificing solution feasibility. A good rule of thumb
is that joints lying closer to the base are more important.
However, we cannot just get away with choosing the farthest joint from the base; some
joints cannot be picked as free because they can destroy the independence between the
translational and rotational components. For example, a 6D IK solution presented here
requires the first 3 or last 3 active joints to intersect at a common point. The general
heuristic we use for automatically choosing free joints is to start from the joints farthest
from the base and continue to choose closer until a solution is found.
Even though free joints are set by the user, degenerate cases due to their values can still
occur. For example, the Barrett WAM has 4 joints for solving the translation component,
and choosing the free joint to be j2 , we can see that joint axes can align and create degenerate
cases when j2 ∈ 0, π2 , pi, 32π . We create a special case for each degenerate value and re-solve
the IK equation from the start. Finding these degenerate values can be very challenging
at times; however most, if not all, robots have joint axes orthogonal to each other, so it is
safe to assume that configurations become degenerate on on all four π2 boundaries. In fact
such checks, even if not necessary, will only make the solution more robust with the cost of
computation time.
Table 4.1: Average time for calling the ikfast solver with the success rate of returning correct
joint values for transformation queries.
ikfast never returns a wrong solution, so is not an issue here. Looking at the table, the
manipulators with 6 joints succeed all the time. Manipulators with 7 joints had the free joint
discretized at 0.1 radians so had failures because the discretization was not enough; setting
a smaller discretization fixes the problems, but is not practical. The free joint of the WAM,
PA10, and HRP2 manipulators was the 3rd joint. Both WAM and PA10 have very big joint
ranges, so there is always a solution found. The HRP2 joint ranges are small, so there are
times when a solution has not been found. The free joint for the PR2 is the first joint, which
greatly affects the solution space. Consequently, we see that PR2 succeeds a less than its
counterparts when the discretization value is the same.
Although each ikfast solution can compute the exact result, the number of operations
involved in the computation can greatly influence the numerical imprecision that gets in-
troduced due to floating-point operations. We have attempted to prioritize the searching of
solutions so the most stable and most confident solutions are selected. Even if the optimal
solutions are not always found, we have showed that the solutions are very simple, do not
require advanced math computations, and even minimize the number of divides that have
to be done. Therefore, they are much more robust than the previous general linear algebra
methods proposed for analytic inverse kinematics.
Table 4.2 shows the parameters of the ikfast algorithm. We’ve discussed how to auto-
matically set free joints and discretization values. As defaults, we compute all values to 10
significant digits and have a default zero rounding at 10−7 .
ikfast fundamentally changes the way researchers approach manipulation planning prob-
lems. Instead of relying on gradient-based methods plagued with numerical errors and slow
computation times, all researchers can safely integrate analytical IK into their planners and
begin searching with all solutions.
4.2 Grasping
Grasping is one of the fundamental differences between motion planning and manipulation
planning. Grasping places explicit workspace constraints on the motions of the robot gripper
with respect to the target objects. Many grasping models have been proposed that analyze
contact modes and parameterize the geometry and space of grasps. However, there are
many advantages of using grasp sets rather than parameterized models like Support Vector
Machines [Pelossof et al (2004)] or explicit goal regions [Berenson et al (2009b); Prats et al
(2007b)]. Although it might be more compact to explicitly parameterize the space of good
grasps, such methods can be susceptible to modeling error. A grasp can easily become invalid
by moving the gripper just a few millimeters in the wrong direction, if the parameterizations
try to fill in data holes, they could easily allow invalid grasps. Instead, we stress the need
for representing all valid grasp spaces as sets. Planning and sampling from the sets is just as
convenient as parameterized models. In our analyses we attempt to maintain generality by
relying solely on sets of grasps, which can approximate a complex distribution really well,
and sampling the set is a simple matter of weighted selection. Furthermore, grasp sets can
be ordered based on priority and space exploration similar to how deterministic samplers
like the van der Corput sequence progressively cover a space [LaValle (2006)].
Each grasp is parameterized by the preshape of the gripper qgrasp , its start transformation
object
with respect to the target object Tgrasp , and the direction of approach to the object v object .
This is the minimal information necessary for the grasp planners covered in Section 3.3 to
move the gripper to the correct location. The analyses that happen after this point can be
divided into three different components:
• Picking a grasping strategy. Once the gripper is at the desired start position with
respect to the object, the grasping strategy determines how the gripper and its joints
move in order to grasp the object. This could involve analyses with sensors in the loop.
• Evaluating Grasp Performance - Once contact with the object has been deter-
mined, the performance of that contact needs to be evaluated to determine if a grasp
is stable.
• Searching the Space of all Grasps - Even though the grasp computation is an offline
process, the space of all grasps satisfying the evaluation criteria must be efficiently
explored and the manifold must be represented by a method that can be easily used
in a planner like in [Goldfeder et al (2007)].
• Stable Grasps - We show to find a set of physically stable grasps for the object and
gripper pair by analyzing contact forces.
• Caging Grasps - We show how to open doors and drawers without imposing force
closure constraints, which greatly increases the robot’s feasibility space to complete
the task.
• Insertion Grasps - We show examples of how work with grasps that insert a part of
the gripper into objects.
distance that the gripper moves back when hitting the target and before closing its fingers
[Berenson et al (2007)]. Even if the preshape is held constant, efficiently exploring an 8DOF
space is very difficult [Ciocarlie and Allen (2009)].
Our method of parameterizing the space first defines a ray {pg , dg } on the gripper’s
coordinate system, and a ray {pt , dt } on the target. Because each ray is 4 DOF and we have
two rays in two coordinate systems, we can represent the 6 DOF of the grasp space, the
other 2 DOF are manually set as the roll around the approach axis and the final standoff.
object
We compute the gripper pose Tgripper by aligning the two rays together and making the
directions point towards each other (Figure 4.9). Figure 4.10 shows how the rays {pt , dt } on
the target are sampled. We first start with sampling the surface of the object by casting rays
from a bounding box (or sphere) to the target object. Then for each hit point, we compute
the surface normal of the object. Using the surface normal, we can sample all directions that
are within a fixed angle between each normal. Depending on how dense the sampling is, it
can yield anywhere from 1,000 to 100,000 different locations to test grasps. This analysis is
very effective and can be applied to many different grippers as shown in Figure 4.11.
For the 4DOF Barrett Hand, we usually find that 30% of the tested grasps yield force
closure, so it is not necessary to explore the space much. In fact, we could get away with
setting just one approach ray {pg , dg } for the gripper. For the 1DOF HRP2 hand, the results
are much more grim with less than 3% of the explored grasps yielding force closure. We
attribute this with difficulties of 1DOF grippers with the grasp strategies. This requires
several approach rays {pg , dg } to be set for 1DOF grippers. Changing {pg , dg } and the
stand-off can have an immense effect on the type of grasps produced. Figure 4.12 shows
how changing directions can prioritize between pinch grasps and power grasps. Avoiding
certain parts of the gripper is another factor that should be considered when building grasp
sets. Sometimes sensors can be attached to the gripper, and they need to be inserted in the
geometry of the gripper for avoiding spurious contact with anything.
One of the biggest criticisms against computing force closure on point contacts is that
fragile grasps can easily be classified as force closure stable grasps. Figure 4.13 shows several
fragile grasps with high force closure scores with the HRP2 hand. In order to prune such
grasps, many researchers have proposed adding noise to the grasp generation process and
computing force closure afterward. If the randomly displaced grasps fail with a certain
percentage, then they are rejected. Although such a method has been shown to work with
the Barrett Hand where 30% of the grasps are in force closure [Berenson et al (2007)], it
rejects all grasps for HRP2. The reason is because edge contacts that occur when the
gripper’s surface is aligned with the target surface are reduced to point contacts when even
a small amount of noise is added.
We present a new robustness measurement method that computes a grasp’s repeatability
rather than its stability when the target object experiences translational and rotational
Figure 4.12: Grasps for the HRP2 gripper with a wrist camera mounted. Grasps contacting the
wrist camera are rejected automatically. As the gripper approach varies the grasps change from
power grasps to pinch grasps.
disturbances. A ∆T offset in the target will produce a ∆P offset for the final gripper position
along with a ∆Q joint offset. Using knowledge of the geometry of the object and kinematics
of the gripper, we can compute the standard deviations of the movement of each point on
the surface of the target σT and the gripper σG . The main intuition behind computing grasp
repeatability is that for fragile grasps, the gripper will slip and therefore move more than
the object has been disturbed:
σT σG .
For stable grasps, the object and gripper deviations should be the same:
σT ≈ σG .
Figure 4.13: Fragile grasps that have force closure in simulation (contacts shown).
Algorithm 4.1: ComputeGraspRepeatability(σT )
/* ρ ∈ [0, 1] - uniform random variable */
1 TG ← ∅ , Q ← ∅
2 for iteration = 1 to N do
3 RotationContribution ← ρ
σT
4 θ ← RotationContribution * MaxDistance(target)
ρ − 0.5 2ρ − 1
with its final joints into TG and Q. We compute the contribution from the gripper joints
by multiplying the standard deviation of the angle with the maximum distance of any child
link from the joint’s axis. These errors are then accumulated across the kinematic hierarchy
to yield σQ . We compute the final σG by adding σQ to the translational standard deviation
of the gripper positions. This value is compared with σT to yield the final decision for a
repeatable grasp. γ is set to 0.7 for all experiments and N is set to 40 when σT is set to 1cm.
When building grasp sets for the HRP2 hand, out of 18,000 tests 228 succeed the force
closure test. When applying the ComputeGraspRepeatability filter, about 171 of grasps
are left with all fragile grasps pruned out. 75% of the original grasps remain showing that
the method does not reject that many good grasps in comparison to previous robustness
measurements. The computation times increase proportional to N .
Table 4.3 shows the set of parameters necessary to for generating force-closure grasp sets.
Parameter Name Parameter Description
Target Surface Sampling how to sample the target surface for approach rays
Robot Surface Sampling how to sample the robot surface for approach rays
Gripper Preshapes a set of initial joint values for the gripper
Rolls about Direction set of rolls of the robot around the approach axes
Standoffs set of standoffs from the target object
Friction friction between contact surfaces
Figure 4.14: Example caging grasp set generated for the Manus Hand.
for higher configuration spaces, the directions have to be sampled carefully. If the door can
move at least ∆θ along a specific direction without getting into collision with the gripper,
then there is no cage and the grasp is rejected. If M jittered configurations succeed the
caging test, then we return success. We set M on the order of 20 and the number of trials
N for collision-free jittered configurations on the order of 100.
The initial seed grasp g caging the target object can be obtained by randomly exploring
the surface of the object until it passes the caging criteria. Efficiently searching the surface
of an object requires the samples are uniformly distributed and can explore the space rapidly
[LaValle (2006)].
By using caging grasps rather than grasps that fix the target object’s configuration
through contact force, we are able to provide the manipulator with significantly more flex-
ibility in accomplishing the task while still guaranteeing the object cannot escape from the
end effector’s control. One additional detail discussed in Section 3.4 is the contact grasp set
Gcontact . At the end of the plan, we need to localize the target object being caged by the
gripper. By always ending with a grasp that achieves the form closure condition, then we
can guarantee that the object maintained at its desired final configuration,
applied to a very simple example of generating and using grasp sets for a gripper with two
magnetic tips. Figure 4.16 shows a manipulator with two magnets attached at its fingers
and the bracket that needs to be picked up. Since the magnets act as point contacts, grasps
are much more stable when both tips contact the part surface in a flat region such that noise
cannot upset the part’s relative transformation.
In order to generate the grasp tables, we first sample the target bracket surface for
possible approach directions similar to Figure 4.10. For every grasp tested, we gather the
contacts and test for these two things instead of force closure:
• the contact points footprint is as big as the distance between the magnet tips,
• and all contact normals roughly point in the same direction.
We use ComputeGraspRepeatability to make sure that the magnet tips are contacting
Figure 4.17: Idustrial robot using the magnet tip grasps to pick parts out of a bin.
a wide area and not a small obtrusion that is hard to hit when executing the grasp in the
real world. Figure 4.16 shows the results of the grasp generation process. The red box to the
left shows some of the bad grasps that are present in the tested sets. The green box in the
right shows the final set of grasps each of the magnets is contacting a wide flat region. Using
this grasp set, we can immediately apply the grasp planners to simulate a robot picking up
parts from a bin and placing them at a destination (Figure 4.17).
• quickly pruning grasps when planning without base placement (Section 3.3),
• introducing biasing in configuration samplers to help explore regions where the arm is
more maneuverable (Section 3.6.3),
• computing the inverse reachability for base placement sampling (Section 4.4),
• and quickly pruning the visible sensor locations (Section 5.1).
We first need to sample a set of positions in sphere in R3 around the arm. Because
computation is proportional to the volume of the sphere, we use the following analysis to
find the smallest sphere encompassing the end-effector’s possible reach. Every joint has
an anchor ai and a joint axis di . We compute the closest point pi on {ai , di } to the next
consecutive joint on the chain {ai+1 , di+1 } using:
We set the last point pn as the tip of the end effector used for inverse kinematics. Finally,
P
we set the sphere center to p0 and the radius to i |pi+1 − pi |. For prismatic joints, we add
the greatest joint limit into the radius.
Generate KinematicReachability() (Algorithm 4.3) takes the desired resolution of
the translation and rotations samples and produces a list of poses in SE(3) with their solution
density. Uniformly sampling SE(3) with low discrepancy and dispersion is very difficult;
furthermore, it is not desirable to introduce any random components with database building
since random sampling on average has much worse dispersion than informed deterministic
sampling [LaValle (2006)]. Fortunately, there exist good deterministic samplers for R3 and
SO(3), which are covered in Section 4.3.1. The cross product of the independent translation
Algorithm 4.3: Generate KinematicReachability(arm,∆pR , ∆θR )
1 {pi } ← ComputeJointIntersections()
P
2 Radius ← PrismaticLimits + |∆pR | + i |pi+1 − pi |
3 {ti } ← {t | t ∈ Samples(R3 , ∆pR ), |t − p0 | ≤ Radius}
4 {ri } ← Samples(SO(3),∆θR )
5 Reachability ← ∅
6 for {r, t} ← {ri } × {ti } do
7 valid ← 0
8 for qarm ∈ IK({r, t}) do
9 SetConfiguration(arm,qarm )
10 if not IsLinkCollision(GetAttachedLinks(arm)) then
11 valid ← valid + 1
12 end
13 Append(Reachability,{r, t, valid})
14 end
and rotation samples yield a final sampling of SE(3). For every end-effector pose {r, t}, we
count the number of inverse kinematics solutions that are not colliding with the attached
links of the arm and create the final list. Figure 4.18 shows the projected reachability volume
generated for two different robots evaluated with ∆θR = 0.25 and ∆pR = 0.040.040.04 .
4πRadius3
The number of SO(3) samples is 576 and the number of R3 samples is 3∆pR,x ∆pR,y ∆pR,z
.
Each of the operations using the reachability space requires efficient nearest neighbor
retrieval given an end effector pose. We store each rotation r ∈ SO(3) as a quaternion. If
two quaternions are close to each other and on the same hemisphere, then we can define a
simple distance metric between two poses:
p
(4.52) δ({r1 , t1 }, {r2 , t2 }) = const min(|r1 − r2 |2 , |r1 + r2 |2 ) + |t1 − t2 |2
where the euclidean distance for close quaternions is close to the Haar measure on SO(3).
Because most kd-tree implementations work in Euclidean space, we add both {r, t} and
{−r, t} to simultaneously represent the two hemispheres. The nearest neighbors imple-
mentation require a little book-keeping in order not to count a rotation twice. The final
implementation of KR (Algorithm 4.4) uses kd-trees and is reminiscent to kernel density
evaluation. Intersection between a set of poses and the reachability is performed by evalu-
ating KR for each of the poses and thresholding the density of the valid solutions. Because
the nearest neighbor query is a tree search, intersection with a set of M poses is on the order
of O(M log |Reachability|) time.
Algorithm 4.4: KR(r,t)
1 valid ← 0
2 ω←0
p
2
3 ∆R ← c∆θR + |∆pR |2
4 for {ri , ti , validi } ∈NearestNeighborsRadius(Reachability, {r, t}, 2 ∆R ) do
,ti },{r,t})2
5 ωi ← exp(− δ({ri0.5∆ 2 )
R
Figure 4.19 shows the comparison of the reachability spaces of the left arm for a humanoid
robot. If the arm includes one of the chest joints in its definition, then the reachability spaces
will be much spatially bigger and about 1.5 times more dense. The wrist has a differential
drive mechanism that produces a really unique joint limits range of the two motors. By
considering the full range of the joint limits, the reachability density in R3 increases by two
times. Given the small reachability volume of HRP2 compared to the robots in Figure 4.18,
it is vital to use the entire robot free space.
Table 4.4 describes all the parameters necessary for generating the reachability.
fact that SO(3) ∼ ˜ 2 and represent each rotation with Hopf coordinates {θ, φ, ψ} where
= S 1 ⊗S
{θ, φ} ∈ S 2 , ψ ∈ S1
cos 2θ cos ψ2
ψ
cos θ sin 2
quaternion = θ 2 .
sin 2 cos φ + ψ2
sin 2θ sin φ + ψ2
The problem reduces to uniformly sampling S 1 and S 2 and then taking their cross product
to form full set of Hopf rotations. The HEALPix algorithm [Gorski et al (2005)] is used for
sampling a multi-resolution grid of the 2-sphere S 2 . In order to maintain low discrepancy,
the method in [Yershova et al (2009)] works by subdividing each dimension by 2 from the
previous level. This yields a total of m1 m2 8L samples where m1 is the initial subdivisions of
S 1 set to 6, and m2 is the initial subdivisions of S 2 set to 12.
To sample R3 , we use a lattice where the three axes are prime with respect to the field
of fractions on the integers Z [Matousek (1999)]. Assuming we need to generate N samples
inside a cube whose sides are in [0, 1], the ith sample is generated by:
√ !
i 5 √
(4.53) , {0.5 + i }, {0.5 + i 13}
N 2
where {x} denotes the fractional part of x. Using this sampler, the number of samples
required for the average distance between neighbors to be 0.04 is N=4733.
where ∗ denotes the action of applying one rotation after another. Similarly the set of
plane transformations can be defined by
For clarity, we denote p(r, t) ∈ SE(3) is the pose with rotation r ∈ SO(3) and translation
t ∈ R3 . It is homeomorphic to the transformation notation T used so far, so it is used
interchangeably. Given an arbitrary pose p ∈ SE(3), its equivalence class with respect to
2D plane movements becomes:
(4.57) p Pz = {pz ∗ p | pz ∈ Pz } .
We compute the equivalence classes for the entire set X by sampling a pose pi ∈ X , and
extracting all similar poses to form a discrete set Xi :
(4.58) Xi = {p | p ∈ X , p ∈ pi Pz } .
The entire process is repeated with the remaining poses X − Xi until we have a set of
equivalence sets {Xi }. Each extracted equivalence set Xi is essentially indexed by an out-of-
plane rotation rµ modulo rotations on z and a distance from the 2D plane zµ and, which is
3 degrees of freedom in total. For each Xi , we extract {rµ , zµ } and convert each of the poses
in SE(3) into poses in SE(2). The 2D poses are then inverted and stored into the inverse
reachability map indexed by {rµ , zµ }.
Every equivalence class p Pz has a representative element that has no translation along
XY and rotation along Z. ProjectPose(p) (Algorithm 4.5 shows how a pose gets projected
into the representative element of its class.
3 qz cos ψ + qw sin ψ
t0 ← 0 0 tz
4
5 return {r0 , t0 , ψ, tx , ty }
Figure 4.20 shows some of the 2D poses in single equivalence sets. The HRP2 8 DOF
reachability space has 200,000 poses and 576 equivalence sets were extracted from it using
∆θI = 0.15 and ∆zI = 0.05. As a comparison, the 7 DOF arm has 547 sets, which shows
that regardless of how complex the arm chain gets, the complexity of the inverse reachability
Figure 4.21: Base placements distributions for achieving specific gripper locations; darker colors
indicate more in-plane rotations.
map is mostly governed by the thresholds for set membership. Figure 4.21 shows the base
placement distributions for several equivalence classes where the gripper of the robot is
grasping the target object.
The inverse reachability maps can be efficiently sampled by first indexing into the equiva-
lence sets and then sampling the underlying probability distribution represented by a Mixture
of Gaussians. Index InverseReachability (Algorithm 4.7) uses the standard deviation
of the elements in each equivalence set to return the set with the highest probability of
including the pose. Because ProjectPose removes all 2D base contributions, it is possi-
ble to compute the distance to the equivalence sets with a weighted Euclidean norm. The
computation is further sped up by organizing Xall into a kd-tree.
with a valid solution. We use gaussian kernels with a bandwidths that take into account the
∆zI
discretization of the reachability generation process: in-plane rotation bandwidth is ∆θR ∆θ I
and XY translation bandwidth is (∆pR,x , ∆pR,y ). Intuitively the factors should be propor-
tional to the underlying reachability discretization; furthermore, the rotation and translation
∆zI
have to be normalized with respect to each other with the ∆θ I
factor. The distribution of
poses in X2D assumes that the gripper is at the origin of the coordinate system. In order to
transform this into the world, we transform every sampled pose in X2D by the 2D translation
and in-plane rotation extracted by the gripper. Because we were working in the robot base’s
coordinate system, we have to transform back to the world coordinate system to yield the
final qbase configuration. ∗ is the transformation operator on 2D poses. Note that we are
world
not transforming by the full pose Tbase since we’ve already indexed an equivalence set that
takes its height and out-of-plane rotations into account.
In the analyses above, each pose in the reachability set X is treated with equal weight,
which makes the explanations easier. But in practice, the inverse kinematics solver and self-
collision checking allows certain end-effector poses to have many more configurations than
others. When building the inverse reachability maps, we keep track of the count of inverse
kinematics solutions of each original pose in X . This information propagates down into the
2D pose sets for each equivalence set X2D and in the end we end up with a weighted mixture
of gaussian distribution. When considering the full computational cost of sampling inverse
reachability, empirical results show that we can generate feasible base placement results 2.5
times faster than uniform random sampling.
Table 4.5 shows the parameters necessary to generate the inverse reachability maps.
The grasp reachability sampler validates the set of possible grasps and uses them to aggre-
gate the inverse reachability maps into a mixture of gaussians. Sample GraspReachability
Figure 4.22: The grasp reachability map using validated grasps to compute the base distribution
(overlayed as blue and red densities). The transparency is proportional to the number of rotations.
(Algorithm 4.9) shows a naive implementation of the processes necessary to sample a con-
figuration. In the beginning, all the valid grasps are computed and the weight is computed
from the number of poses contained inside its associated 2D pose map. Because the inverse
reachability maps are mixture of gaussians, the weights are additive, so we can first gather
all the equivalence sets of all the different grasps and evaluate a weight for each grasp de-
pending on the number of basis vectors in its distribution. We can sample from the entire
distribution by first choosing a grasp based on its weight. Once a grasp is sampled, we just
call into the inverse reachability map sampler using Algorithm 4.8.
In practice, we progressively build up the distribution and cache previous computations
as the sampler is called again, which allows a new sample to be computed on the order
of milliseconds while taking less than 0.2s to initialize the distribution. Besides sampling,
there are many other practical uses for grasp reachability like performing workspace analysis.
Whenever a task has to be performed by a robot in a factory, engineers have to carefully
consider the robot kinematics, robot placement, part trajectories, and surrounding obstacles
and safety measures. Engineers have to pick the cheapest and most robust solution. By
Parameter Name Parameter Description
Validity Threshold threshold of the distribution to accept grasps
explicitly modeling the workspace of the robot in the context of manipulating parts, it
becomes possible to choose the cheapest robot that can achieve the task and how to minimize
the workspace. Once a workspace heuristic based on the current task has been developed,
it is possible to turn the workspace design problem into a robot kinematics design problem
[III and Shimada (2009)].
Table 4.6 shows the parameters necessary to sample from grasp reachability.
Figure 4.23: Examples of convex decompositions for the geometry of the robots.
of convex hulls can lead to many reductions in the complexity of the math. Figure 4.23
shows convex decompositions for three robots using the library from [Ratcliff (2006)].
One operation commonly performed for planning is to pad the collision meshes with
5mm-10mm such the robot can at least keep a safety margin around environment obstacles.
Adding padding to guarantee a distance δ introduces cylinders for every edge and spheres
for every vertex in the collision mesh, which can be slow to handle. In order to compute
a padded convex hull, we add a plane for every edge and a plane for every vertex. The
normals of the new planes are the average of the normals adjacent planes with each plane
going through their respective edge and vertex. Then all the planes are pushed δ back along
their normals:
x
(4.60) ∀P ∈Π Πedge Πvertex P
S S
≥ −δ.
1
Figure 4.25: Convex decomposition makes it possible to accurately prune laser points from the
known geometry.
Figure 4.24 shows the final padded convex hull. It should be noted that convex decom-
position meshes cannot be used for self-collision tests. Self-collisions are very special should
be checked with the original meshes, otherwise the robot will initially start in self-collision
and will never be able to move. Although hard self-collision checking with convex hulls is not
recommended, it is possible to generate a continuous gradient proximity distance [Escande
et al (2007)]. Such proximity distances are commonly used to apply safety torques to the
robot joints inside the controller real-time loop. The proximity distance can also be used to
prioritize configurations when sampling the configuration space.
where the most difficult question is how to discretize the [0, 1] interval so that the last
number of checks are made. Many space decomposition methods have been proposed for
efficient line collision checking, but the simplest and fastest one is to discretize the interval
with a step size ∆s using a distance metric δ(q, q), LineCollisionDetection (Algorithm
4.10) shows a naive implementation. First a ∆q is computed using subtraction that respects
the 0 and 2π identification for continuous joints. For every step, we find a new configuration
along ∆q such that the distance from the current configuration is exactly ∆s. Since we
are assuming a triangle inequality, we can test if we have surpassed the line-segment and
terminate the collision.
The problem is finding a distance metric that can reduce samples on line collisions while
helping RRTs sample the space of end-effector positions more uniformly. Even without
considering a particular family of distance metrics, we can scale each coordinate of the
configuration space before feeding it into the distance metric δ.
The most natural scaling should be proportional to the importance of each degree of
freedom in moving the geometry of the robot. For the same physical distances, more im-
portant DOFs should have a larger distance than less important DOFs. One measure of the
importance of a joint is by using the swept volume of all the robot links that are dependent
on the DOF as shown in Figure 4.26. Exactly computing the swept volume is a very difficult
problem [Kim et al (2003)], so we use a discretization technique by first discretely sampling
a set of points on the convex decomposition, rotating them around the spans of each of the
joints, using kernel density estimation to get the point density of the volume it spans, and
finally thresholding the density to get a volume. Figure 4.26 shows the volumes generated
using this simple, but powerful technique.
Algorithm 4.10: LineCollisionDetection(q0 ,q1 )
1 q ← q0
2 ∆q ← q1 q0
3 if q ∈
/ Cfree then
4 return true
5 while true do
V
6 q ← q + t∆q s.t. t > 0 δ(q, q + t∆q) = ∆s
7 if δ(q0, q1) ≥ δ(q0, q) then
/* Passed the end, so check the final point */
8 return q1 ∈ / Cfree
9 if q ∈
/ Cfree then
10 return true
11 end
12 return false
Figure 4.26: The swept volumes of each joint, where the parent joint sweeps the volume of its
children’s joints and its link.
Intuitively, the base joints should have more weight because of their larger impact on the
end effector than joints farther in the chain. We compute the distance metric weights for
the ith joint using:
! 13
X
(4.62) ωi = CrossSection(SVi ) ∗ LVj
j∈DependentLinksi
where SVi is the swept volume of the joint axis, and LVj are the individual link volumes.
Using just the area of the cross section is misleading since it measures all possible locations
of the link positions, and does not consider where the links currently are. This could greatly
skew the contribution of each joint, therefore we also multiply by the sum of volumes of all
j0 j1 j2 j3 j4 j5 j6
WAM Volume (m3 ) 4.346 2.536 0.781 0.395 0.143 0.110 0.043
2
WAM Cross Section (m ) 0.2668 0.0306 0.0297 0.0035 0.0023 0.0004 0.00039
LVj (m3 )
P
WAM 0.0066 0.0052 0.0050 0.0039 0.00364 0.00361 0.00357
WAM Weights ωi 0.121 0.054 0.053 0.024 0.020 0.012 0.011
Puma Weights ωi 0.133 0.090 0.015 0.006 0.002 0.001 -
Table 4.7: Statistics and final distance metric weights for the Barrett WAM joints.
Figure 4.27: Example scenes used to test the effectiveness of the configuration metric.
links that are dependent on the ith joint. Finally we take the cubed root since we’re dealing
with a weight on a single axis, but using values computed from sums on three dimensions.
Table 4.7 shows the volumes, cross sections, and final weights for each of the joints on the
7DOF WAM and 6DOF Puma robots.
In order to test the effectiveness of the weights, we compute the average time to compute
Table 4.8: Statistics and final distance metric weights for the Barrett WAM joints.
Parameter Name Parameter Description
Collision δ padding for the surfaces of each convex hull
Range Data δ padding for removing colliding range data
Discretization used for approximating the volume of a convex hull for swept volumes
a path on two scenes (Figure 4.27 ) when using both uniform weights and ωi . The distance
metric is weighted euclidean distance with 0 and 2π identified for continuous rotation joints.
For purposes of fair comparison, we keep the planning step size the same for all tests:
∆s = 0.01. Table 4.8 shows that the planning times are 0.5-0.8 times the original planning
times. It should be noted that the ωi weights allows much higher step sizes and therefore
faster times since the max end effector movement given a ∆s = 0.01 is smaller and more
consistent than with uniform weights.
Table 4.9 shows the parameters related to using convex decompositions.
robot
(i)−1 Tlink
robot camera
(i)−1 Tobject
camera
(4.63) δ Tlink (0) , Tobject (0) = 0
where δ is a distance metric on poses. The error can be used to scale the confidence
values of the detection algorithm. Because no vision algorithm is perfect, there is always
a chance it makes a mistake in the detection process or computes an inaccurate pose. We
combine three different measures for quantifying the amount of trustability of the detected
pose.
• The first is the vision algorithm confidence γvision , every algorithm should be able to
return a value on how well its internal template fits the image.
• The second is the neighbor density around the extents map γstability . Computing this is
a little tricky because it is dependent on the underlying pose sampling distribution. If
using a robot, we can control this density, otherwise it is very difficult. Therefore, for
every point we gatherer all the points around a ball of radius r, and compute how well
Figure 4.29: The probability density of the detection extents of several textured objects using a
SIFT-based 2D/3D point correspondence pose detection algorithm.
the entire ball is filled. Such an operation is easily achieved using nearest neighbors
pruning and counting (Figure 4.28 middle). A stable region means that the camera or
object can move slightly and still be guaranteed to be detected.
• The third is based on the expected error from Equation 4.63. It is converted into a
confidence by modeling the probability as a gaussian.
We use γ to prioritize sampling of the camera extents covered in Section 5.1. Figure 4.29
shows the detectability extents of several objects using a SIFT image features-based plane
detector. Just like with inverse reachability, we can use the detection extents to help with
the design of industrial bin-picking workspaces. The maximum distance an object can be
detected can be used to optimize the robot location with respect to the bin where target
objects are in.
Table 4.10 shows the parameters related to generating detectability extents.
4.8 Discussion
The planning knowledge-base is an important concept for helping achieve automatic con-
struction of manipulation programs. In this chapter, we divided the knowledge required
beyond the robot and task specifications into seven different categories. Our formulations
in each of these categories rests upon a vast amount of fundamental research proposed by
previous researchers; however, one of the most important contributions here is tying each of
these categories together and really focusing on automated generation of each component.
We presented details and algorithmic descriptions beyond the fundamental theories to help
understand these components in the context of applying them to real robot scenarios.
Computing fast and numerically stable analytical solutions to the inverse kinematics
problem has been a classic problem with many mathematical solutions proposed. However,
treating it as a search problem allows usikfast to explore the entire solution space to find
the quickest and most robust method. Although there exist robot kinematics that ikfast
cannot solve yet, the kinematics of most robots used in the world today can be solved for.
Assuming the existence of analytic inverse kinematics solutions can also help shift the focus
of planning algorithms to use them rather than to assume no knowledge of the problem. We
believe that ikfast will become a standard tool for future robotics research.
Grasping is one of the fundamental distinctions between manipulation planning and mo-
tion planning. In order to reason efficiently about grasps, we motivated a discrete grasp
set approach to modeling the space. We presented three different methods that fall into
this definition, and in each of them showed how to parameterize and discretize the space.
Furthermore, we presented several algorithms to remove grasps that can be prone to slipping.
Considering arm reachability is one effective way of relating the configuration space of
the robot to the workspace. Although the concept of reachability has been around for some
time in the motion planning community, the parameterizations and usages different quite
radically. In this chapter we stuck with the most flexible representation of the reachability
by using 6D maps. We presented several algorithms on efficiently generating, sampling, and
inverting the reachability space. Inverse reachability opens up a lot of possibilities in the
context of base placement planning. From this analysis emerged a new concept called grasp
reachability that allows us to explicitly relate a probability distribution on the robot base
with a target object to be grasped. We believe such analyses will become very important in
automating the design of industrial scenarios.
We argue that convex decompositions are the most efficient way to represent body ge-
ometry. They can accurately model any geometry of the robot, and using them for collision
and proximity detection is very straightforward due to their SAT-nature. We showed how
convex decompositions are used for padding and processing range data. Furthermore, they
allow efficient swept volume computation of the robot links. We presented a distance metric
that computes weights for the joints based on the amount of geometry each joint can affect.
We discussed several real-world issues with using convex decompositions like using them only
for environment collisions and not self-collisions. Our hope is that all these issues convince
every planning library developer to natively support and use convex decompositions.
Explicitly modeling the detectability extents of target objects allows us to methodically
analyze the information space offered by a vision algorithm. We can determine where a
camera should be move in the workspace without relying on a human to second-guess and
teach the robot these heuristics. Furthermore, we presented a confidence measure map that
helps prioritize the sampling of the camera locations. Usage of detectability extents is one of
the pillars and major contributions of this thesis and planning with them will be discussed
in Chapter 5.
Rather than individually treating these components in the planning knowledge-base, we
argue the necessity of the information dependency graph in Figure 4.1. By explicitly keeping
track of the information being used for generation of the model, we can split up real domain
knowledge specified by a person from knowledge that can be auto-generated. It is true that
a some of the generation processes are controlled by thresholds and discretization values.
We offer some insight in how to automatically set these values, but this type of second-order
automation still requires a lot of research. Some database components take a really long time
to generate, so it is inefficient to have to rebuild the entire planning-knowledge base whenever
a small modification is made to the robot or target object. Fortunately, the dependencies
allow keeping track of changes and updating only those components that are affected. In
Section A.1.5, we cover one way of indexing changes. In the future, the dependency graph
can form the basis for a global database of robots and tasks.
We explicitly state the parameters each component uses for its generation in Tables 4.2,
4.3, 4.4, 4.5, 4.6, 4.9, and 4.10. Although we have discussed ways of automatically setting
each of the parameters, their optimal values can depend on the precision requirements of
the task, the repeatability of the robot, and the units the entire environment is defined in.
Because the specifications mentioned in Section 2.1 do not encode such information, the
setting of the values cannot be truly automated. For example, in this thesis, all robots are
defined in meters and are human-sized. Such knowledge would allow us to set the reachability
discretization to 0.04 meters without sacrificing too much imprecision since we can argue that
a robot’s behavior should not change much if the target object moved by 0.04 meters. Future
improvements to the robot and task specifications could define a few paramters that would
allow an automatic analysis of error propagation vs speed of retrieval. Such analyses would
allow a user to tweak the accuracy or responsivity of the system without worrying about the
details.
Chapter 5
We present a system that introduces an efficient object information gathering stage into
the planning process. Specifically, we concentrate on how cameras attached to the robot
can be used to compute an accurate target object pose before attempting to plan a grasp
for the object. It is common for robot systems to treat grasp planning independent from
the sensing capabilities of the robot, which results in a separation of the perceptive and
planning capabilities that can lead to gross failures during robot execution. For example,
a robot that sees an object from far away should not attempt to plan a complete path
with grasps involved at that point; an early commitment of a global plan when environment
information is inaccurate can lead to failures in the later execution stages. Instead, the robot
should plan to get a better view of the object. Thus, it is important to setup the planning
process to prioritize gathering information for the target object before prematurely wasting
computation committing to a grasp plan.
There are many methods of handling sensor uncertainty in the planning phase depending
on the level of expected noise in the system. For example, really noisy sensors that frequently
return false positives require the planner keep track of object measurements using EKF
filters [McMillen et al (2005)]. Furthermore, planning in dynamic environments requires the
robot to keep track of what regions are stale due to no sensor presence, what regions are
being actively updated, and where the robot is with respect to the map [Rybski and Veloso
(2009)]. The less assumptions made on the quality of sensing data and the predictability of
the environment means that the planning and sensor feedback loops have to be very tight,
which makes the manipulation problem very difficult. By focusing on industrial settings and
mostly static environments as outlined in Chapter 2, we can separate the problems due to
dynamic, unpredictable changes and those of sensor noise and calibration errors. In Section
4.7, we built a detectability extents model that captures the entire sensing process and its
133
Figure 5.1: Comparison of a commonly used grasping framework with the visibility-based frame-
work. Because the grasp selection phase is moved to the visual feedback step, our framework can
take into account a wider variety of errors during execution. The robot platforms used to test this
framework (bottom).
noise models. It encodes a probability directly proportional to the confidence of the detection
process, the noise introduced in it, and the density of neighboring regions. By using just the
detectability extents, we hope to find better and more confident measures of the locations of
objects before attempting to grasp them. Because detectability extents encodes a probability
distribution of where the camera should observe an object from, the planning process can
mostly set aside the direct output of the vision algorithm and only concentrate on sampling
the extents.
We focus on a framework that can effectively use cameras attached to the gripper. Unlike
cameras mounted to the base or the head of the robot, gripper cameras can be maneuvered
into really tight spaces. This allows the robot to view objects in high cupboards, microwaves,
or lower shelves as easily as grasping them. Furthermore, the detection results from the
gripper camera are transferred directly into the gripper coordinate system, which removes
any robot localization or encoder errors from the final pose of the target object.
In Section 5.1, we show how to quickly sample robot configurations using the detectability
extents models such that the object becomes detectable in the camera image. We use these
samplers as a basis for a two-stage visibility planning process shown in Figure 5.1. In Section
5.2, we present experimental results that show performance of this method is as good as
regular grasp planning, even though a second stage was added.
There are two steps to incorporating visibility capabilities:
1. A planning phase that moves the robot manipulator as close as safely possible to the
target object such that the target is easily detectable by the onboard sensors.
2. An execution phase responsible for continuously choosing and validating a grasp for
the target while updating the environment with more accurate information.
In Section 5.3, we present a visual feedback method that integrates grasp selection. This
allows us to naturally fill the gap between the motion planning stage that picks grasps
and the execution stage that attempts to reach those grasps. In most research, the entire
environment with obstacle avoidance is rarely considered because of the computational com-
plexities with gradient descent approaches. It is also hard for gradient descent to consider
nonlinear constraints like properly maintaining sensor visibility and choosing grasps. To
solve these problems, we present a stochastic-gradient descent method of planning that does
not have the singularity problems frequently associated with gradient-descent visual servoing
methods. Although many proposed robot systems include planning and vision components,
these components are treated independently, which prevents the system from reaching its
full potential.
The visual feedback framework is different from past research [Prats et al (2007a); Kim
et al (2009)]. in that it analyzes how the environment and the plan as a whole are affected
during the visual-feedback execution. By combining grasp planning and visual feedback
algorithms, and constantly considering sensor visibility, we can recover from sensor calibra-
tion errors and unexpected changes in the environment as shown in [Diankov et al (2009)].
The framework incorporates information in data-driven way from both planning and vision
modalities during task execution, allowing the models it uses to be automatically computed
from real sensor data.
Case studies of a humanoid robot picking up kitchen objects and an industrial robot
picking up scattered parts in a bin are presented in the context of this theory. We also show
how to sample base placements guaranteeing a visible configuration for mobile manipulation.
• Detectability. The detectability extents map of the object used from the planning
knowledge-base (Section 4.7). The stored map includes the camera direction v to the
object, and the distance from the object λ.
• Reachability. Once a valid camera configuration can be sampled, the robot kinemat-
ics must allow the camera to be placed there.
• Occlusions. The target needs to be inside the camera view and no other obstacles
should be in front of its path.
Although the final goal is to grasp the object, sampling visible configurations should not
be considering grasps.
Before beginning to sample configurations, we assume a rough location of the target
object. In the four-step manipulation process covered in Section 2.4, the first step is a search
phase that detects a very rough locations of all objects of interest. This visibility phase can
be considered as a reconfirmation process of the initial detection. Furthermore, the initial
detection most likely occurs when the robot is far away from the object, therefore the error
on the pose can be huge.
Due to noise on the target object, a sampled camera location does not necessarily guar-
antee the object will be visible once the robot moves there. Figure 5.2 shows a case where
the object pose is initialized with a ±4cm error, so the robot has to sample three different
configurations before finding the object. Fortunately, the confidence values of the target
detectability extents encode the camera neighborhood density, so camera samples can be
organized keeping in mind that the object pose could be inaccurate.
5.1.1 Sampling Valid Camera Poses
The first step is to sample 6D camera poses with respect to the target coordinate system.
The detection extents are first sampled using the confidence map to get the camera direction
v and distance λ. The remaining parameters left for a full pose is the 2D image offset p2D
and the roll θ around the camera axis. If the camera image was infinite in size, then sampling
any offset and in-plane rotation would suffice. The camera rotation R(v, θ) and translation
t(v, θ) in the object coordinate system is defined as:
cos θ − sin θ 0
0
Robject (v, θ) = Rodrigues 0 × v, cos−1 vz sin θ cos θ 0
1 0 0 1
(5.1) tobject (v, θ) = −λ R(v, θ) K −1 p2D
The usage of Rodrigues [Belongie and Serge (1999-present)] allows the camera z-axis
to be oriented towards the desired direction v by the minimum angle. In order to convert
the 2D image offset into a 3D translation, p2D is first converted into camera coordinates by
K −1 . A multiplication with R(v, θ) converts the point into object coordinates. Finally the
point rests on the z = 1 camera plane, so it is multiplied by the desired distance λ to get
the final offset.
Unfortunately, the image has boundaries, and it is necessary to check that the entire
object lies completely inside the image. Instead of transforming the object into image space,
it is much easier to transform the image boundaries in object space. We start with defining
the camera space boundaries on the z = 1 camera plane:
−fx
fx 0 0
(5.3) Πx = 0 Π−x = 0
Πy = −fy Π−y = fy
width − cx cx height − cy cy
where (width, height) are the dimensions of the image. A 3D point p is inside if:
(5.4) ∀i Πi p >= 0
Since the planes pass through the origin, the w component of the vectors is 0. Given a
pose of the camera R t , it is possible to transform a plane Π2D into a 3D plane in the
object coordinate system with:
0 RΠ2D
(5.5) Π =
−t · Π2D − ∆saf ety
where ∆saf ety pushes the planes forward a small safety margin since the object pose can
be uncertain. The set of 3D planes {Π0 } define the camera visibility volume in the object
coordinate frame Vobject as a pyramid originating at t. To keep track of coordinate frames,
for notational convenience, we use
object
(5.6) Vobject = Tcamera ∗ Vcamera
Containment testing is just computing dot products. Setting aside environment occlu-
sions, as long as the object is inside Vobject and uses the detectability extents, then all of the
sampled camera poses should yield good views. In order to reduce the search p2D is set to
the center of the image convex hull ( width
2
, height
2
).
Gripper Masks
Because the camera is attached to a link on the robot, it could always be looking at a
constant static silhouette of the robot geometry. Figure 5.3 shows two examples of cameras
attached to the gripper links of the robots, with the robot fingers blocking the camera from
observing that part of the environment. It might not always be possible to design sensors
that have a complete unobstructed view; therefore, we go over the computations necessary
to handle these gripper masks as efficiently as possible. Clearly, any visible configuration
should be outside of the gripper mask and inside Vcamera .
Although Section 5.1.2 shows how to handle arbitrary environment occlusions using ray
casting, having a constant mask in the camera image leads itself to a unique reduction in the
equations without any extra computation cost. The idea is to compute the convex polygon
of the largest free space in the camera image that does not contain the gripper mask (blue
regions in Figure 5.3). A valid camera sample has to fully contain the projection of the
object in this convex polygon, which is effectively the same computation as performed with
the camera visibility volume Vcamera . In fact, approximating the region with convex hulls
allows us to define a new visibility volume Vstaticmask and replace Vcamera . We set the new
p2D offset as the center of this new convex hull.
Computing the largest collision-free convex hull is a hard problem [Chew and Kedem
(1993); Borgefors and Strand (2005)], so we use a randomized algorithm that samples sup-
porting points on the boundary of the gripper mask and checks if the resulting convex
Figure 5.3: The real robots with the a close-up simulation of the wrist cameras are shown in the
top two rows. Given the real camera image, the silhouette of the gripper (bottom) is extracted
(black) and sampled (red), then the biggest convex polygon (blue) not intersecting the gripper
mask is computed.
polygon is all inside free space. Depending on the number of supporting points, running this
procedure for an hour should yield a good approximation of the largest volume.
Representing the free space with one convex hull might leave out big chunks of free space
where it is possible for object to be visible in. If this becomes a problem, then representing
it with a convex decomposition should still maintain the computational advantages of the
visibility volumes, except there will be more than one to handle.
The computational efficiency of the mask has one disadvantage which is that if the mask
changes, the convex hull will have to be recomputed. The masks in Figure 5.3 depend on
the preshape of the gripper, and it becomes necessary to always use the same preshape when
performing the visibility testing. If this becomes a problem, caching multiple masks can be
a viable option since the convex hull with N planes requires just 3N numbers. If the DOF
of the gripper or reflected joints are too much, then it is only recommend to cache only the
static unchanging regions of into the mask and leave the rest for the environment occlusion
checking.
Figure 5.4: For every camera location, the object is projected onto the camera image and rays
are uniformly sampled from its convex polygon. If a ray hits an obstacle, the camera sample is
rejected. The current estimate of environment obstacles is used for the occlusion detect. As the
robot gets closer and new obstacles come into view and a better location of the target is estimated,
the visibility process will have to be repeated.
Then the directions of all the rays originating at the camera origin that should be tested
for collisions are defined by
Table 5.1: Average processing times for the first visibility stage for thousands of simulation trials.
• The first scenario is a PA-10 arm with a one degree gripper grasping a box.
• The Barrett WAM with the Barrett Hand inside a kitchen environment.
• The HRP3 humanoid robot with a 6 DOF hand inside a kitchen environment.
For the first planning stage, we record the sampling and planning times. The average
time it takes to sample the first good valid inverse kinematics solution that meets all the
visibility constraints is shown in (Table 5.1). Even with the extra visibility constraints, it
Figure 5.5: The two stage visibility planning method is as efficient as the one-stage grasp planning
method because it divides and conquers the free space.
surprisingly takes a short amount of time to sample a goal. Furthermore, we combine the
sampling with a planner and compute the average time it takes to move the robot from
an initial position to a sampled goal. Note that planning times include the time taken for
sampling the goals.
In fact when performing grasp planning by moving first to to a visibility goal before
attempting to grasp the object, total planning times in general are very similar to grasp
planning from the beginning. This phenomena occurs because the visibility goals can act as
a keyhole configuration by first getting the camera, and consequently gripper, very close to
the target object. Figure 5.5 shows how the search spaces compare when using a one-shot
method vs the divide-and-conquer method with sampling the visibility space. Because the
required precision of robot execution is greatly reduced for the first stage to a visibility goal,
planning becomes ridiculously easy and fast. Only obstacles need to be avoided and the final
gripper position is usually no where near an obstacle. Once at a visibility configuration, the
gripper is already so close to the object, that it becomes possible to start visual servoing
to the final grasp without considering a planner. The next section shows results when the
second stage is replaced by a stochastic gradient descent method rather than a full search-
based planning algorithm.
Figure 5.6: When initially detecting objects, the robot could have a wrong measurements. If it
fixes the grasp at the time of the wrong measurement, then when it gets close to the object, the
grasp could be infeasible. This shows the necessity to perform grasp selection in the visual feedback
phase.
constraints on the environment obstacles, but the reality is that the goal is not just one fixed
pose with respect to the task frame, it is all possible grasps that can manipulate the object.
If the feedback stage does not consider the grasp selection process during visual feedback,
it cannot quickly compensate for changes in environment and will have to restart the entire
planning process from the start. In fact, we stress the importance of performing grasp
selection for the target during visual-feedback execution because more precise information
about the target’s location and its surroundings is available (Figure 5.1).
where the cos−1 is the Haar measure on SO(3). Each grasp is checked for the existence
of a collision-free inverse kinematics solution. The second metric for prioritizing grasps is
the distance between the solution and the current robot configuration.
VisualFeedbackWithGrasps (Algorithm 5.3) shows how the grasps are ordered using
these two metrics before calling the stochastic gradient descent algorithm. Graw is the set of
all possible, validated grasps using GraspValidator (Algorithm 3.4), γ forces the closest
configuration solutions to be considered first before the farthest ones, δ(q, q) is the distance
metric on the configuration space of the robot used for choosing closer inverse kinematics
solutions.
It is also possible to maintain the visibility of the target while computing a path. We
define Cvisible as the space of all collision-free robot configurations in Cf ree that maintain the
visibility constraints with the object:
Cvisible = { q | q ∈ Cf ree ,
object
O ∈ Tcamera ∗ Vcamera ,
(5.10) ∀r∈R(F Kobject
camera (q)) OnObject(r, q, O) }
RandomDescent (Algorithm 5.4) shows the visual feedback algorithm. Given a grasp
transform, we first find the closest inverse kinematics solution in configuration space and
set that as the goal. Then we greedily move closer to the goal and validate with Cvisible .
SampleNeighorhood returns a sample in a ball around q. After the grasp frame Tg gets
within a certain distance τ from the goal grasp, we start validating with Cf ree instead since
it could be impossible for the object to be fully observable at close distances. Because the
gripper is already very close to the object and the object is not blocked by any obstacles
due to the visibility constraints, such a simple greedy method is sufficient for our scenario.
Another advantage of greedily descending is that an incomplete plan can be immediately
returned for robot execution when planning takes longer than expected.
It should be noted that it is not always possible to maintain the visibility constraints. In
fact when the current configuration is close to qgoal , it is not important to sense the target
anymore. Furthermore, if the gripper pose is very far away from the grasp current chosen,
then we can also relax the visibility constraint. It is very common for the robot to get stuck
moving in one direction due to joint limits, so it has to completely reverse directions to get
into the correct reachability region. Relaxing the constraints on Cvisible allows this to happen.
Algorithm 5.4: path ← RandomDescent(qstart , qgoal )
1 path ← {qstart }
2 q ← qstart
3 for i = 1 to N do
4 qbest ← ∞
5 for i = 1 to M do
6 q 0 ← SampleNeighborhood(q)
7 if q 0 ∈ Cf ree and δ(q 0 , qgoal ) < δ(qbest , qgoal ) then
8 if δgripper (F Kgripper (qgoal ), F Kgripper (q 0 )) > τ or q 0 ∈ Cvisible then
9 qbest ← q 0
10 end
11 if qbest 6= ∞ then
12 path.add(qbest )
13 q ← qbest
14 if δ(qbest , qgoal ) < then
15 return path
16 end
17 return ∅
Using the same scenes as in Figure 5.7, we record average time to complete a visual
feedback plan while maintaining visibility constraints (Table 5.2). In order to compute the
how much visibility constraints affect the times, we also record statistics for the feedback
stage ignoring the camera. Although the times vary greatly depending on the situation,
results show that the feedback algorithm can execute at 2-10Hz. Looking at the planning
times, scenes with more obstacles usually finish faster than scenes with fewer obstacles. This
phenomena is most likely because obstacles constraining the feasible configuration space of
the robot and guide it toward the goal faster.
Just like all feedback algorithms, VisualFeedbackWithGrasps is constantly running
and re-validating the scene. When it returns, start executing the path and run it again
assuming that the robot will switch to the new trajectory in ∆t s in the future. The path
to the object is always validated and can sometimes stop the robot [Ferguson and Stentz
(2006)].
As discussed in Section, 5.2, the combined planning times for the first planning stage
and the second visual feedback stage are comparable to the planning times of previous
manipulation systems that do not even consider visibility [Berenson et al (2007); Diankov
Figure 5.7: The scenes tested with the visibility framework. The images show the robots at a
configuration such that the visibility constraints are satisfied.
PA-10 WAM
Few obstacles/Visibility Constraints 0.626s (93%) 1.586s (96%)
Many obstacles/Visibility Constraints 0.512s (83%) 0.773s (67%)
Few obstacles/No Visibility 0.117s (94%) 0.406s (97%)
Many obstacles/No Visibility 0.098s (86%) 0.201s (71%)
Table 5.2: Average planning times (with success rates) of the visual feedback stage.
et al (2008a); Srinivasa et al (2008)]. The second visual feedback stage does not have to
consider complex planning scenarios since the target is right in front of the camera, this
allows it to finish quickly.
ability spaces and camera visibility poses as shown in Figure 5.9. If all the data is cached
in kd-trees, the intersection operation is on the order of a few milliseconds, therefore a lot
of time is saved. Once the robot moves to a visible configuration, it recomputes the target
pose and starts grasp planning. Figure 5.10 shows the sequence of moves made. The grasp
plan always moves the gripper to a retreated grasp along the grasp direction, this allows for
padding object boundaries so robot does not hit it on its way there (Section A.3.1). The
robot then moves along the grasp direction, grasps the target, and moves it to its desti-
nation. Figure 5.11 shows several real-world experiments on similar scenes, but different
objects. After all the components were put together and padding was used for avoiding edge
collisions, the robot could succeed in grasping the objects 9 out of 10 times. All failures were
due to slipping of the gripper fingers while grabbing the object. There are two explanations
for this:
• There could be imperfections in modeling of the gripper geometry and thus getting a
bad grasp in the real world, while it is stable in simulation. The gripper fingers were
custom made, and no CAD model was readily available for them, therefore they were
measured by hand. All fragile grasps were pruned using the repeatability metric in
Section 4.2.1 with a noise of 0.01m. Manually checking every grasp in the grasp set
confirms that no fragile grasps were present.
• The pose of the object was a little off due to the vision algorithm. As we previously
Figure 5.10: The full process of mobile manipulation using visibility, reachability, and base
placement models.
mentioned, none of the camera positions were accurately calibrated because we wanted
to test the power of the gripper cameras. The fact that most of the grasps succeeded
is because the gripper camera usually detected objects from 0.1m-0.2m, which reduces
the calibration error to a few millimeters.
Figure 5.13: When sampling base placements with visibility, the reachability of grasp sets also
have to be considered, otherwise the robot will need to move again before it can grasp the object.
can still be considered. Using the reachability multiplication trick in Figure 5.9, we can start
sampling base placements the prioritize visibility, but also consider the grasp set. Of the few
experiments performed with mobile manipulation, all them succeeded in grasping the object
on the first time.
Figure 5.14: Industrial bin-picking scene has a ceiling camera looking at a bin of parts and a
gripper camera.
Figure 5.15: Collision obstacles around target part need to be created for modeling unknown
regions because the robot does not have the full state of the world.
Figure 5.16: Grasp Planning is tested by simulating the unknown regions.
5.6 Discussion
In this chapter we introduced a planning framework based on the visibility of the target ob-
jects. We showed how to efficiently apply the object’s detection extents to sample visibility
configurations that allow the robot to accurately recompute a target’s pose. By explicitly
considering the visible regions of objects, we can compute an upper-bound on the object
pose error, and therefore greatly reduce the uncertainty the grasping module has to deal
with. Furthermore, planning results suggest that inserting visibility configurations into the
planning loop does not increase computation times by much. This discovery suggestions that
the visibility configuration space acts as a keyhole configuration by divide and conquering
the free space of the robot. The visibility sampler we presented relies on the data-driven de-
tectability extents of the target object and vision algorithm pair. Using it allows us to sample
camera locations that are guaranteed to observe the object, even under pose uncertainties
as large as 4cm. Instead of relying on parallel computation and exotic hardware to speed
up the process, the sampler carefully arranges the criteria that must be tested and uses ray
testing as much as possible.
Once a visible configuration has been reached, we resume grasping either using the grasp
planning pipeline or a visual feedback approach. In order to make visual feedback robust to
large changes in the object pose as error is reduced, we presented an extension to a stochastic
gradient descent algorithm that can reason about grasps while approaching the object. The
gradient descent method is fast enough to be used as a visual feedback algorithm in case the
scene is prone to frequent changes.
Throughout the entire chapter, we have discussed the advantages and usage of cameras
attached to the gripper of the robot. An interesting analogy can be made to human ma-
nipulation to prove the necessity of sensors attached to a gripper. In human manipulation,
eyesight serves to provide context and a rough perception of the scene, objects, and their
locations. However, the final steps in grabbing objects by humans mostly rely on tactile
sensors from the hands rather than eyesight. A human can manipulate as dexterously in the
dark as in light; but as soon as the sense of touch goes away, the manipulation strategies are
significantly slowed down. Unfortunately, tactile sensing for robots is still far from becoming
a mainstream commodity used in research labs. That tactile sensors need contact with the
environment to make measurements, which makes then prone to wear-and-tear damage and
not feasible for industrial scenarios. Until tactile sensors become reliable and cheap, we argue
that gripper cameras will remain for a long time the best and most affordable alternative to
completing a fast loop with the target object.
A gripper camera configuration can provide enormously accurate information because the
camera/gripper can fit into tight spaces that base- or head-mounted cameras cannot observe.
Furthermore, if the camera is attached to the same frame of reference that all grasps are
defined in, we can greatly eliminate any kinematic errors due to absolute joint encoder
errors. We showed results of a humanoid robot using a gripper camera to manipulation
objects inside a sink, which has low-visibility. Even though the humanoid cameras were
not accurately calibrated, it allowed the robot to consistently pick up objects without much
failures. This suggests that visibility algorithms can reduce the required calibration on
robots. This idea is especially applicable in industrial scenarios where robots perform their
tasks without stopping for long periods of time. By relaxing the number of times calibration
needs to be done, we can assure that robots can be operational longer without maintenance.
Chapter 6
159
reference tends to be ignored; there are no known reliable calibration packages that solve
the automation complexities necessary for it. Although many formulations for extrinsic
calibration exist, its data gathering phase usually requires a user to hold a pattern in several
locations in front of the camera, along with providing a set of measurements revealing the
relationship between the camera frame and pattern frame of reference. Because a user is
responsible for providing the data in previously proposed systems, the system just has to
worry about finding the most likely extrinsic model that fits the data, it does not concern
itself with scoring the data itself. It is the data gathering phase that makes camera calibration
painful for users; therefore, such systems are not truly automated.
The unknowns that have to be solved through the calibration process are:
• Calibrating a gripper camera. The goal is to compute the relative camera transfor-
mation with respect to the link the camera is attached to. The pattern is stationary in
the environment. An inverse kinematics solver of the link to the robot base is required.
• Calibrating an environment camera. The goal is to compute the relative camera
transformation with respect to the robot base. The pattern is firmly attached to or
grasped by a robot link, which needs an inverse kinematics solver.
Compared to other systems, one new feature of this system is usage of the visibility
samplers covered in Section 5.1 to determine the best views of the pattern with respect to
the camera coordinate system. It allows us to compute safe configurations for the robot to
move to get the desired measurements. It should be noted that this does not necessarily
guarantee the pattern will be visible in all sampled views due to the unknowns in the system
and the progressive nature they will be computed in. The system should continue gathering
data until the calibration estimates are robust enough.
link
• Tcamera - the relative transformation between the camera sensor and the robot link.
link
We denote the initial estimate from the CAD model by T̃camera .
world
• Tpattern - the transformation of the pattern in the world coordinate system. This is
unknown.
world
• Tlink (q) - the transformation of the link the camera sensor is attached to with respect
to the world coordinate system. q is the configuration of the robot.
Figure 6.1: Frames used in hand-eye camera calibration.
camera
• Tpattern - the transformation of the pattern in the camera coordinate system.
(6.2) ∀i link
ei (ω) = proj((Tcamera )−1 (Tlink
world
(qi ))−1 Tpattern
world
Xj ) − fK (x̄ij ) = 0
where θcone defines the half-angle of the cone, Lmaxdist is the maximum length of the cone,
θdir is the maximum angle the camera direction can deviate. The positions and directions
are uniformly sampled on R3 and S 2 (Section 4.3.1). For every position and direction pair,
Figure 6.2: Describes the automated process for calibration. Assuming the pattern is initially
visible in the camera image, the robot first gathers a small initial set of images inside a cone around
the camera (green) to estimate the pattern’s location. Then the robot uses visibility to gather a
bigger set training images used in the final optimization process.
checkerboard grid fx fx fixed principal standard deviation ratio
9x8 2076 ± 197 1992 ± 152 1.296
8x7 2056 ± 23 2080 ± 17 1.353
7x6 2093 ± 19 2085 ± 16 1.188
6x5 2094 ± 51 2116 ± 23 2.217
5x4 2073 ± 33 2082 ± 22 1.500
4x3 2064 ± 118 2053 ± 31 3.806
Table 6.1: Initial intrinsic calibration statics showing the mean and standard deviation of calibra-
tion results using 5 images. Note how the standard deviation is much smaller when the principal
is fixed to the center of the image.
we find the desired pattern transformation with respect to the original camera coordinate
system:
dy √2 2
d +d
−dx , tan−1 dxz y p
Rodrigues
(6.4) Tcamera (p, d) = .
0
0 1
Using Equation 6.4, we can build up the camera samples Tpatterns and solve for the initial
intrinsic parameters of the camera. Once 5 − 8 images are gathered, the focal length and
radial distortion of the camera are estimated while setting the principal point to the center
of the image. The reason for this reduction in parameters is because the initial set of images
are not enough to create strong constraints for all parameters. In practice, we estimate with
the principal point, the focal length can be an order of magnitude off from the real length,
this makes the later processes impossible to converge. Table 6.1 shows a comparison with
fixing the principal point when calibrating the gripper camera as shown in Figure 6.2. As can
be clearly seen, the standard deviation on the error is much lower when the principal point
is fixed, so the results can be trusted more. Using the initial intrinsic calibration and the
rough initial guess of the camera location, we can estimate the world pattern, this allows us
to start the visibility-based data gathering that can assure we capture a good set of images
that constraint all parameters well.
Ideally, we want to work with an ordered set of poses Tpatterns that represent a progressive
sampling of views for the camera such that the first N views of the set represent the best
N views. We define best views in terms of how well the data can constrain the calibration
Figure 6.3: The detectability extents of the checkerboard pattern (show in black).
parameters in the optimization process. It has been shown views differing by a roll around the
camera axis or distance from the object do not contribute many new constraints. The most
interesting constraints come from out-of-plane rotations in S 2 , with the next most interesting
being large offsets in R2 on the camera plane that help in measuring radial distortion.
Therefore, we can progressively sample S 2 ⊗ R2 to get the ordered set. Furthermore, we
want some guarantees that the camera can actually detect the calibration pattern at the
sampled locations. Therefore, we build up detectability extents of the pattern (Figure 6.3)
and intersect it with the samples on S 2 ⊗ R2 . The extents provide the maximum angle
of incidence to the calibration pattern plane and minimum and maximum distance from
pattern. Because the robot is going through a calibration phase and does not have an
accurate position of the pattern or the camera, views on the edge of the detectability extents
have a very high probability of being missed; therefore, positions in the most densest regions
should be prioritized first. It is possible to force such an ordering on S 2 according to [Gorski
et al (2005)]. Even with taking into account position uncertainty errors and detectability, a
particular view might be blocked by the environment, so the actual sampling process requires
an algorithm to keep track of what is detected and another algorithm to determine when the
data gathering process should terminate.
The basic outline of data gathering is presented in GatherCalibrationData (Algo-
rithm 6.1) where the input is a set of possible desired pattern views Tpatterns in the camera
world link
space. For every pattern, Tp ∈ Tpatterns , the world camera position is Tlink (qcur ) T̃camera Tp
where qcur is the current configuration of the robot. Because the inverse kinematics equa-
tions work with the desired manipulator link, we right-multiply the camera transformations
link
by (T̃camera )−1 to convert to the link space. After all the desired robot link transformations
are computed in Tlinks , we start sampling from the set and taking measurements. In ev-
Algorithm 6.1: CalibrationSamples ← GatherCalibrationData(Tpatterns )
1 CalibrationSamples ← ∅
n o
2
world
Tlinks ← Tlink link
(qcur ) T̃camera link
Tp (T̃camera )−1 | Tp ∈ Tpatterns
3 while Tlinks 6= ∅ do
world
4 T̄link ← NextBestView(Tlinks )
world
5 qtarget ← IK(T̄link )
6 if qtarget 6= ∅ then
7 if PlanToConfiguration(qtarget ) then
8 if IsPatternDetected() then
world
9 CalibrationSamples ← CalibrationSamples + Tlink (qcur ), {x̄ij }
/* Prune similar transformations */
world
10 Tlinks ← {T ∈ Tlinks | δ(T, Tlink (qcur )) > τ }
11 if CheckCalibrationValidity(CalibrationSamples) then
12 return CalibrationSamples
13 else
14 continue
world
15 Tlinks ← Tlinks − {T̄link }
16 end
17 return ∅
world
ery iteration, the next best view T̄link is chosen and inverse kinematics are applied with
collision-checking to yield a target configuration qtarget . The robot attempts to plan a path
and move to qtarget . If successful and the pattern is detected at that configuration, then
world
all similar views from Tlinks are removed and the robot position Tlink (qcur ) and the pattern
i camera
feature points {x̄j } are stored. Note that the pattern transformation Tpattern is not needed to
compute the reprojection error as defined in Equation 6.2, so is not stored. If any step fails,
world
then only the currently tested link T̄link is removed from Tlinks . This process is continued
until the validation function CheckCalibrationValidity decides the data is enough. If
all samples are tested before valid data is found then the function fails.
Using the initial intrinsic parameters, the initial estimate of the world transformation of
the pattern is given by
(6.5) world
T̃pattern = (Tlink T̃camera )−1 Tpattern
world link camera
camera
where Tpattern can be easily solved using classic 2D/3D point correspondence techniques
[Hartley and Zisserman (2000); David et al (2004)]. Once we have an initial estimate of the
Figure 6.4: Several configurations of the robot moving to a visible pattern location.
pattern in the world (Figure 6.2 right), we use the visibility theory developed in Section 5.1
to sample a feasible and informed set of camera transformations around the pattern using
Equation 5.1. Figure 6.4 shows several examples of the robot using the visibility regions of
a checkerboard pattern to calibrate itself. In order to evaluate results, we intersect all the
rays from the image points together to find the 3D world point agreeing with all images:
i i
R t
where the projection matrix = (Tlink Tcamera )−1 . Because Tpattern
world link world
is also com-
0 1
puted from the optimization process, we compute the 3D error by
1 X world
(6.6) |Tpattern Xj − X̃j |
N j
Table 6.2 shows the reprojection and 3D errors associated with calibrating the gripper
camera using several checkerboard patterns. The reprojection errors are usually an order
of magnitude larger than the intrinsic errors because we are constraining to only one world
pattern whose transformation is determined through the robot’s kinematics. The added error
between the robot and the reduction of the degrees of freedom contribute to large reprojection
errors, which is also a sign that intrinsic error is over-fitting to its data. Another interesting
result is that the smaller checkerboards yield much better performance in terms of intrinsic
and 3D errors. 3D error is not used in the optimization process because the gradients on the
error are not continuous, and take longer to compute.
checkerboard square length (m) intrinsic error reprojection error 3D error (m)
9x8 0.0062 0.448 ± 0.106 13.44 ± 6.48 0.00052 ± 0.00031
8x7 0.0069 0.236 ± 0.113 06.84 ± 3.02 0.00030 ± 0.00009
7x6 0.0077 0.249 ± 0.092 16.47 ± 8.66 0.00026 ± 0.00009
6x5 0.0088 0.313 ± 0.159 11.27 ± 8.34 0.00024 ± 0.00010
5x4 0.0103 0.217 ± 0.080 15.43 ± 8.76 0.00025 ± 0.00007
4x3 0.0124 0.213 ± 0.089 08.81 ± 4.08 0.00018 ± 0.00009
Table 6.2: Reprojection and 3D errors associated with calibrating the extrinsic parameters of the
gripper camera. Both intrinsic error and reprojection errors are in pixel distances, with reprojection
error using only one pose for the checkerboard by computing it from the robot’s position.
Figure 6.5: Example environment camera calibration environment. The pattern is attached to
the robot gripper and robot uses moves it to gather data. Sampled configurations are shown on
the right.
n o
(6.8) world
Tlinks ← Tlink link
(qcur ) T̃pattern Tp−1 (T̃pattern
link
)−1 | Tp ∈ Tpatterns .
link
where a rough initial value of T̃pattern has to be known. Every other step is equivalent to
the gripper camera case. Figure 6.5 shows several sampled configurations when the robot is
in front of the camera.
By assuming that all feature points lie on the z = 0 plane, then the homograph can also
be represented as:
(6.10) hi,1 hi,2 hi,3 = sK r1 r2 t
camera
where r1 and r2 are orthogonal vectors of the column vectors of Tpattern . Using the
orthogonality constraint, the following equations can be built
hT1 Bh2 = 0
hT1 Bh1 − hT2 Bh2 = 0
where B = K −T K −1 is the image of the absolute conic. B has 6 values and every image
provides 2 constraints, which can be formulated as a classic Ab = 0 problem. Let {λi }
Figure 6.7: Graphs the second smallest eigenvalue of the matrix used to estimate the intrinsic
camera parameters. As the eigenvalue increases, the deviation fx becomes smaller and more stable.
The left graphs are done for combinations of 5 images while the right graphs are with more than
11 images. The pattern is more apparent for smaller number of observations. More observations
decrease the standard deviation of the solution, but the stability is still dependent on the second
eigenvalue.
and {bi } respectively be the eigenvalues and eigen vectors of A such that λi ≤ λi+1 and bi
represents the upper triangle of B. Then b1 is the best null-vector of A and therefore the best
solution for B. However ill-conditioned data will lead A to have a null space greater than
one dimension, and there so should be more than one eigenvalue λi close to zero. Intuitively,
this surmounts to checking λ2 for ill-conditioning effects. Traditionally the ratio λλ12 has been
used to check for stable null spaces, however image noise and the fact that we are ignoring
distortion makes λ1 fluctuate a lot, thus causing the ratio the yield no information. Figure
6.7 shows how λ2 varies with respect to the estimated focal length fx for several checkerboard
patterns using different numbers of images. The eigenvalue itself is normalized with respect
to the number of equations used to estimate it. An interesting occurring pattern is that
the deviation of fx from its true value grows smaller as λ2 increases. The intuition is that
the bigger it gets, the more stable the null space of A becomes. An interesting analogy can
probably be drawn between the power behind the second smallest eigenvalue λ2 , and the
second smallest value used to solve the normalized cuts problem for image segmentation [Shi
and Malik (1997)].
This pattern suggests that the second eigenvalue can be used to validate the incom-
ing data. Once B is computed, the signs of all eigenvalues in B have to be the same;
otherwise, imaginary solutions would be computed for K, which is not correct. CheckCal-
ibrationValidity (Algorithm 6.2) describes the entire validity process for the calibration
data.
Another necessary property for the intrinsic calibration algorithm is that it does not lose
accuracy as more images are added. Surprisingly, [Datta et al (2009)] performed a set of
experiments showing the OpenCV [Bradski and Kaehler (2008)] and Camera Calibration
Toolbox [Bouguet (2002)] do not have this property. Instead, [Datta et al (2009)] motivates
the usage of ring-based patterns along with an iterative reprojection process.
6.5 Discussion
One prerequisite for true robot autonomy is well-calibrated sensors. Although this basic
prerequisite of robotics is known to any researcher in the field, rarely do researchers consider
automating the entire calibration phase, including the data gathering phase. Consequently,
the most painful part of working with a real robot is maintaining their calibration and
managing printing many calibration patterns. In response to this reality, we introduced a
completely automated method for calibrating the intrinsic and extrinsic parameters by using
the planning and visibility theories developed in Chapters 4 and 5. By focusing on the data-
gathering phase and providing an analysis for how to determine the quality of the gathered
data. Once the planning and vision knowledge-bases are created, it provides researchers a
one-click method to calibrate a robot just by setting a calibration pattern in front of the
camera.
The quality of data problem directly relates to how well the calibration pattern can
be detected in an image along with the constraints it provides. By taking advantage of
the detectability extents of the pattern, we can create a progressive best-view sampler set
that the describes where the camera should be placed. Furthermore, by using the visibility
samplers, we can simultaneously search for possible locations the robot can move to such that
the pattern is unobstructed from view. After data gathering finishes, the pattern position
and camera calibration are computed simultaneously. Using the data and calibration quality
metrics, we can determine whether to stop, or take re-run the algorithm again with the
improved estimates. Furthermore, we showed the flexibility of the method in calibrating an
environment camera.
We discovered several surprising results when correlating error metrics to the stability of
the calibration solutions they point to. The first discovery was that the standard deviation
and reprojection errors really do not hold much valuable information if the data is already
degenerate. The algorithms can overfit to the data and the resulting errors will be very
small, even though the calibration result is off. The second discovery was the power of the
second smallest eigenvalue of a matrix that contained the solution to intrinsic calibration
matrix. We showed a very clear correlation between the magnitude of the eigenvalue and
the deviation of the focal length from its true value. We can use this test to determine when
to stop the calibration sequence, or to continue. Thus, we can provide guarantees on how
well the returned solutions model the data.
Chapter 7
Recognizing specific rigid objects and extracting their pose from camera images is a fun-
damental problem for manipulation. Although many pose recognition algorithms exist, re-
searchers usually concentrate on performance of the run-time process and not on the com-
plexity of the work required to automate the building processing of the program for a new
object. It is important to differentiate object-specific pose recognition from the general ob-
ject recognition problem. In vision research literature, object recognition also deals with
detecting and extracting objects from images that belong to particular classes like faces [Gu
and Kanade (2008)], cars [Li et al (2009)], and other common object categories [Griffin et al
(2007)]. The challenge is not only to segment out the particular objects that algorithms were
trained with, but to find the key features that differentiate one object class from another so
that the algorithm can generalize to objects that were never seen. Because training data is
sparse compared to the space of all possible objects in a category, object recognition research
starts crossing the borders of machine learning, language semantics, unsupervised pattern
recognition, and general artificial intelligence. On the other hand, object-specific recognition
deals with the analysis and identification of one particular object, which is a well-formed
problem that can exercise geometric and photometric analyses that are not so consistent
when entire semantic classes of objects are considered.
There are many advantages to concentrating on object-specific pose recognition. In such
a formulation, the only variations independent of the object are: lighting and environment
changes, object pose changes, and occlusions. Furthermore, it becomes possible to give a
training set that covers viewing the object from all directions, which reduces the machine
learning complexity of the problem. Because an object-specific program can be specifically
tuned to the particular object features, it is possible to speed up the pose extraction process,
and increase the detection rate by over-fitting the models to the problem. Finally, considering
175
the low cost of today’s memory and computation power, thousands of object-specific models
can be easily indexed by a small robot system.
This chapter is divided into two main contributions. The first is creating an object-
specific database that holds meaningful statistical analyses of the object’s features (Section
7.2). not. Every pose extraction algorithm relies on a common set of statistics, which we
organize into a database. In this database, we classify stable and discriminable features
while pruning out features due to noise. Stable features are clustered into a handful of
representative classes. Using these classes, it is possible to determine whether an arbitrary
image feature is part of the object. The second main contribution discussed in Section 7.3 is
a new pose extraction algorithm that builds up its pose hypotheses by hallucinating poses
given a set of image features, it does not rely on one-to-one feature correspondences making
it really powerful for non-textured, metallic object. We term this algorithm induced-set pose
extraction. The new algorithm uses feature clustering. Along with formulating a new search
process, we show how to train an evaluation metric for pose hypothesis. The evaluation
metric uses a combination of Adaboost and Markov Chain Monte Carlo methods to train a
set of weak learners to only recognize the specific object (Section 7.3.4).
We begin with the basic theories of pose recognition systems and go over previous work.
We then introduce the object database and its organizational structure. Finally, we discuss
the induced-set pose extraction algorithm and show results in the context of recognizing
industrial parts.
quickly search for a matching that produces the most stable pose (Figure 7.1).
Some pose extraction systems store a set of views labeled with the object pose [Lepetit
et al (2003)] instead of explicitly extracting the geometry of the target object; during runtime,
epipolar geometry are used between the current image and database images to compute a
consistent pose with all constraints. Furthermore, [David and DeMenthon (2005)] has shown
how to perform pose recognition when the image features are lines. Because matching lines
is more computationally intensive and poses generated by RANSAC are frequently wrong,
several heuristics have to be employed to ensure good performance.
A lot of object pose estimation research [Gordon and Lowe (2006); Lepetit et al (2003);
Chia et al (2002); David and DeMenthon (2005)] formulates the extraction problem as an
iterative algorithm that relies on a previous measure of the pose to seed the search for the
current image. At startup time, these algorithms typically rely on simple methods to seed
the search, but each method has its own shortcomings that does not have any guarantees on
picking the correct object.
Perhaps the set of algorithms closest to our formulation of induced-set pose extraction are
ones that analyze the object geometry and attempt to come up with Gestalt rules on how to
extract a pose estimate from the composition of low level features [Goad (1983); Bolles et al
(1983); Grimson and Lozano- Perez (1985)]. Specifically, [Wheeler and Ikeuchi (1992, 1995);
Gremban and Ikeuchi (1993, 1994)] have proposed ways to automatically extract features on
the object surface and match features to them using discrete search based on aspect graphs.
Our method differs from these methods in that we do not formulate the problem with respect
to 3D object features and aspect graphs, and we do not search for correspondences between
image features and object surface regions.
Point Features
For a feature f0 ∈ F0 , f0 is the local descriptor for the image at that region and Mf0 is the
local coordinate system including position, orientation, and scale. Point feature algorithms
are separated into estimating the set of interest points from which the coordinate system of
each feature can be computed, and describing the local region around the feature’s location.
A great summary of interest point detectors can be found in [Mikolajczyk et al (2005)] where
the following region detectors proposed in [Mikolajczyk and Schmid (2002, 2004); Tuytelaars
and Van Gool (2004); Kadir et al (2004)] are compared with each other.
An image descriptor is typically represented in a vector space with a kernel-based distance
metric. This formulation allows a feature to be easily used in machine learning and clustering
algorithms. Some of the most popular low-level feature descriptors like Gabor filters are
inspired from the sparse-coded nature of the visual cortex [Serre et al (2005)]. These features
can encode multiple frequencies and respond to spatial intensity changes, which makes them
robust to color and other lighting variations. In order to encode higher level information like
texture, [Lowe (2004); Bay et al (2008)] have proposed a scale-invariant set of features. As
the name implies the feature’s coordinate system encodes the orientation and scale of the
region, making it slightly robust to affine transformations. Several new formulations of SIFT
[Lowe (2004)] make affine invariance a priority for robust pose extraction [Morel and G.Yu
(2009); Wu et al (2008)]. SIFT features represent the local region as an oriented histogram of
spatial derivative responses. [Moreels and Perona (2005)] show a survey of the performance
of feature descriptors for accurately recovery viewpoint changes.
Line Features
A line feature space F1 contains all 1D curves from the image with Mf1 defining the shape
of the curve. There are many algorithms that produce a black and white map detecting
Figure 7.2: Database is built from a set of real-world images, a CAD model, and a set of image
feature detectors. It analyzes the features for stability and discriminability.
the edges of the image. By combining all edges into connected components and segmenting
them whenever an edge branches, we can compute a set of 1D curves. Although the shapes
of the curves encoded in Mf1 alone can be used for matching image curves, there also exist
edge descriptors [Mikolajczyk et al (2003)] robust enough to detect bicycles in images.
with all the neighboring features using a learned classifier. Since the framework uses the
CAD model of the object, it can compute the depth map of the object, measure inter-
occlusions due to complex objects, and perfectly segment out the region the object occupies
in the image. It it the accessibility of the geometric information that allows us the extract
noise-free information and compute the stable geometric and visual words.
The generation process itself is divided into three main components as shown in Figure
7.2. The first four stages analyze all the feature detectors independently of each other.
Image-based Rendering
The searching processes for induced-poses require being able to gathering a dense set of poses
that produce a specific feature. The images we can realistic gather on a robot stage are on
the order of 300-1000, any more images would significantly slow down the analysis processes.
Therefore, being able to render the object from novel views can greatly reduce the database.
Because we have a labeled set of training images, we can simulate rendering of the part from
novel views given the lighting conditions the part was captured with. Assuming the lighting
stays constant throughout the entire time the database is capture, we can extract the bi-
directional reflectance distribution function of every point on the surface of the object, which
will allow accurate simulations of the color. The most naive way to compute image-based
rendering is by extracting textures from every image and linearly interpolating the textures
based on the nearest neighbors of the novel view. When using such a naive method, stitching
two different could be difficult due to lighting variations, therefore it is better to stick with
the closest image. Figure 7.4 shows an example of rendering one image from novel use by
extracting each texture.
nonlinear distortions due to camera lens are already removed from the incoming images, so
we deal with ideal pin-hole cameras. Figure 7.5 shows several examples of objects and their
depth maps. Using depth information, we can compute the 3D location of the feature on
the object surface. Let x̄ ∈ Mf be an image point on the feature, then the corresponding
3D point is
−1 −1 x̄
(7.1) X=T K Depth(T, K)(x̄)
1
where T = R t is the object pose.
If a point feature has an image orientation θ, we to compute its orientation in the object’s
coordinate system. By using the object’s surface normal n(X) at the 3D location of the
feature, we can extract a unique direction D that is orthogonal to n(X) and its projection
matches θ:
cos(θ)
D ∝ R−1 − n(X) n(X)T RT K −1 sin(θ) .
(7.2)
0
Figure 7.6 shows the set of raw directions extracted from the entire training database
and the resulting mean directions extracted after filtering and clustering. During the pose
Raw Clusters In-plane Angle Distribution on Object Surface (radians)
Figure 7.6: Orientations from image features are projected onto the object surface and are clus-
tered. By re-projecting the raw directions on the surface plane and converting to angles (right
graph), we can compute all identified angles (in this case 0 ◦ and 180 ◦ ) and the standard deviation
of the particular mean direction ±15.5 ◦ .
extraction process, if a feature coming from the detector used in Figure 7.6 has a direction,
we can use the clustered directions to narrow down the originating position to two possible
surface positions within a [−31 ◦ , 31 ◦ ] [149 ◦ , 211 ◦ ] in-plane orientation offset.
S
(7.3) pF0 (f ∈ F0 | X)
By pruning all positions that are less than a preset threshold, we can single out the
stable locations (Figure 7.7). Setting the correct threshold is very tricky since different parts
have important features at different distances from each other, also the threshold should be
independent of part size. For 0D features, it is enough for the kernel to be just a function of
3D position. However, for 1D features, the direction of the extracted edges is very important
information and has to be included in the KDE:
→
−
(7.4) pF1 (f ∈ F1 | X, X ).
Figure 7.7: Shows the stability computation for a hole detector and the stability detected locations
(red).
Because there is no ordering on the directions of the lines (the negative direction is just
as likely), it is recommended to train the KDE with both directions specified per sampled
point. Figure 7.8 shows the stability results of a line extractor for one part and how the line
features in one image get pruned.
Figure 7.8: Stability computation for a line detector (upper right is marginalized density for
surface). The lower images show the lines from the training images that are stably detected (red)
and those that are pruned (blue).
After many trials, we have found that first evaluating all the stability measures for all
features and setting the kernel bandwidth to be difference of the lower and upper quartiles
of the data, with the threshold for stability being the median of the data works the best.
Such parameters are independent of object scale, which is a prerequisite for robust thresholds.
Figure 7.9 shows the stability distribution and results of a hole feature detector. Surprisingly,
some of the holes marked in red on the CAD are determined to be unstable, and therefore
bad for detection. When asking a person to set the thresholds manually, it is likely to get a
Figure 7.9: Shows the statistics of a hole detector on industrial metallic parts. Each part was
trained with 300 images around the sphere. The right side shows the final filtered features, which
are used for extracting geometric words.
Figure 7.10: Shows the effect of poorly labeled data sets on the raw features (blue) and the final
filtered features (red). The filtered clusters for the correctly labeled set become much more clear.
Figure 7.11: The histogram and labels of a feature’s descriptor distribution on the object (in this
case, it is the size of the holes).
Figure 7.12: The major shape clusters extracted. The clusters on the left (simpler shapes) are
more frequent than the clusters on the right (more complex shapes).
Figure 7.13: PostgreSQL database that stores the relationships and dependencies of all offline
processes used for pose extraction and object analysis.
We use the SQL relational database query language for organizing the statistical data and
results for a specific object. Each set of algorithms and results are divided into their own
table and have unique ids for instances of these parameters. It is very common for processes
to rely on the results of other processes. For example, the stability measure relies on the
feature extraction and 3D geometric reprojection results in order to build a distribution of
features. Each process also has its own generation parameters like thresholds that make
it unique. SQL databases allows such relations to be encoded, and thus it can build an
information dependency graph of object data shown in Figure 7.13. The functionality of
the object database is very similar to the planning knowledge-base discussed in Chapter 4
(Figure 4.1). The database can store the computation of every process with multiple sets of
generation parameters, which allows us to easily test new parameters and keep track of stale
models because their dependencies were updated.
The biggest advantage of SQL is that it is a query language, which separates the database
storage from the actual code and OS used to generate the data. Creating unique ids for every
object allows us to create a global database of the appearance and statistics of objects that
users can easily query over the network. Although it would be difficult to apply this to home
scenarios where the environments are unstructured and the object types are on the order
of thousands, factories already have databases of every component that goes into their final
products. Augmenting them with the object appearance and statistics could go very far in
automating the vision processes of assembly lines handling the parts.
Figure 7.14: An example of the induced poses for a corner detector. The corner confidence ψ(f )
(top) is used to prune out observations with low confidence. The final induced poses T (f ) for a
corner detector are stored as a set (bottom).
part.
T (f ) = {T | ∃g ∈ F (Render(T )),
ψ(g) > γconf idence
d(f, g) < f ,
(7.5) d(Mf , Mg ) < M }
d denotes a distance metric corresponding to the respective inputs and Render(T ) is the
image of the object at pose T . Equation 7.5 represents the space of poses whose rendered
image Render(T ) contains contains a feature g that matches with the query feature f with
high confidence. In order to minimize space requirements, the poses themselves are nor-
malized with respect to the feature’s image position and orientation, which removes three
degrees of freedom. Figure 7.15 shows a set of poses for a specific curve. Figure 7.16 shows
the sets using mean images.
Figure 7.16: Mean images of the induced poses for an edge detector of a paper cup (in simulation).
The construction of induced poses first starts by seeding each of the visual and geometric
words with the pose information from the training data. In order to search other poses,
we use image-based rendering of novel views (Section 7.2.1). To quickly explore the pose
space around the seeds, we use the Rapidly-Exploring Random Trees algorithm [LaValle and
Kuffner (2000)], which can maintain all the constraints in Equation 7.5. It is necessary to
use RRTs for exploration of the space because the constraints really restrict the solution
space, and hitting a point in the solution space by just randomly sampling will take a very
long time.
Pairwise Poses
Unfortunately the induced pose sets for point features are very big. If a word does not have
an associated orientation, its induced set is 4 DOF. With reasonable discretization values,
the size of the set can reach 600000 sets, which leads to memory and speed problems in the
pose search phase. In order to reduce the computation, we create a pairwise pose set by
storing the consistent poses of two point words. If none of the words have orientations, the
degree of freedom of the sets is two and the set is parameterized by two image positions.
If the first feature has an orientation, then the degree of freedom is one, and the set is
parameterized by two image positions and the relative orientation of the pairwise direction
with respect to the first image feature’s orientation. If there are a total of N visual words
for all detectors used, then there are at most N2 pairwise sets. Even though the number of
induced sets increases, adding pairwise poses greatly speeds up the search task and reduces
memory constraints making it a necessary step for this framework.
of induced pose sets. During run-time, we use this information to generate pose hypotheses
by picking a small random set of features and intersecting their pose sets. Due to this process,
a hypothesis is only guaranteed to be consistent with a small set of features; therefore, we
have to employ a separate pose evaluation metric that takes into account all the features that
lie in the projected object region. We first present an evaluation metric for the contribution
of individual 0D and 1D feature detectors, then we treat all the detector values as a vector
and use Adaboost to find a separating hyper-plane between correct and wrong poses.
Given a pose hypothesis T = R t , we first find all features inside the object and
then compute a score using EvaluateDetector (Algorithm 7.1). Since we have the
geometry of the object, we can inverse project each feature onto the object surface using
Figure 7.17: The individual detector scores for good and bad poses for the object in Figure 7.3.
Using Adaboost, the misclassified good poses and misclassified bad poses are marked on the data
points.
RayCast(pos,dir) and then use the stability measures computed in Sections 7.2.3 and 7.2.4
to compute how well the feature location matches the statistical models. If a feature f ∈ F
is on the object surface and it matches to a word in WF and the distance to the induced
pose set d(T, T (f )) is small, then it is positively supporting and its stability score is added
to the total; otherwise, it is negatively supporting and its stability score is subtracted. For
1D features, a certain percentage of the curve has to lie inside the object region for it to be
considered in the score. Once a feature is verified to lie on the surface, EvaluateDetector
first computes the matching words in WF ; if there is a match, the stability density pw of the
particular word is used, otherwise the density of the detector pF is used.
Classification
The final classifier uses EvaluateDetector to convert all the image features into a D-
dimensional detector space and then uses boosting [Pham and Cham (2007)] to separate the
space for correct and wrong poses statistics. The detector space greatly depends on the αF
parameters in EvaluateDetector (Algorithm 7.1), which are dependent on the feature
detector. Because the final goal is to build a D-dimensional space that easily separates
the correct poses from the wrong, we learn the evaluation parameters while simultaneously
learning the separation criteria via boosting. The evaluation parameter space α is highly
nonlinear, so we use Markov Chain Monte Carlo (MCMC) optimization to search the solution
space. For every new state α queried by MCMC, we compute the best AdaBoost classifier
0
and set the probability of the Markov state as e− , where the 0 is the classification error of
the new state. The MCMC state transitions to the newly sampled state with a probability
0
of eγ (− ) .
Figure 7.18: For an object that has four holes, there are many valid poses of the object in the real
image that can hit all four holes. In fact, using hole features is not enough to extract a confident
pose, 1D curve features are absolutely necessary.
with the features. If the intersection is not empty, then we evaluate every valid pose T by
first getting the D-dimensional vector in the detection space, and then running the learned
Adaboost classifier h. It is important to note that the search process simultaneously considers
all the consistent pose hypotheses within a given discretization limit.
Figure 7.20 shows an example of transforming the induced pose sets into pose hypotheses
for particular image features. The first row shows the pose hypotheses for two point features
and the resulting set of poses when the hypotheses are intersected. Even though the number
of consistent poses is greatly reduced, it is still not enough to completely constrain to one
pose. The second shows the same process except with curves; the pose hypotheses for one
curve are much smaller, therefore they constrain the pose more. Finally by choosing two
curves and two points that lie on the part, we can completely constrain the pose to the
correct one. Although the method is very effective when choosing a set of features that all
lie on the same part, RANSAC will more frequently choose image features that do not lie on
the same. Because the number of chosen feature is three or four, the intersected pose sets
Figure 7.20: Several examples of chosen image features (blue) and their pose hypotheses (in
white). The bottom row shows the positively (green) and negatively (red) supporting features of
the pose hypothesis (blue).
Algorithm 7.2: T ← SearchWithSupport(support)
1 N eighs ← FeatureNeighborhood(support)
2 for F = Sample(C(N eighs, K)) do
T
3 for T ∈ f ∈F T (f ) do
EvaluateDetector(T, F 1 )
EvaluateDetector(T, F 2 )
V ←
...
D
4 EvaluateDetector(T, F )
5 if h(V ) >= 0 then
6 return T
7 end
8 end
9 return ∅
are very likely to be non-empty. In fact, most of the processing time of the search algorithm
is rejecting invalid pose hypotheses because the chosen features do not lie on the same part.
In order to speed up search, the evaluation function is written so as to quickly reject a
hypotheses based on the order of features it looks at without having to evaluate all the
features. The final row shows the positively and negatively supporting features determined
from the pose evaluation phase. The correct pose has mostly positive supporting features,
therefore it passes the test. The incorrect poses pass through a lot of wild features that do
not support it, therefore most of the features it touches subtract the final evaluation score.
7.3.4 Experiments
We have successfully tested the induced-set pose extraction method for both metallic indus-
trial parts (Figure 7.21). The training process is time consuming and currently takes on the
order of a day for 300 training images. Using a hole and line feature detector, extracting
the initial 3D geometry of the features takes about one hour. Compute the stability mea-
sure takes 10-20 minutes per feature. Clustering the filtered hole features into the visual
and geometric words took minutes. Clustering the filtered curves takes 10-20 hours when
using ICP softassign; afterward, we reduced this down to less than 30 minutes. Building the
induced pose sets takes on the order of 5 hours for each feature where the average time used
to search each words is 5 minutes.
The goals were to increase detection rates and quality of the result rather than increase
speed, so we set small discretization factors for most of the generation stages in the algorithm.
Figure 7.21: Results of pose extraction using induced pose sets.
The detection process for the part shown in Figure 7.20 takes 2 minutes on a multi-core
computer system running four threads. The bulk of the time is spent matching image
features to the database of words. After all the feature are matched and pose hypotheses
are built, it takes 10-20s to detect the first pose in the image when RANSAC samples four
features at a time; the combination is not tested again. If the four sampled features are
on the same part, then our method will immediately find the correct pose; however, most
combinations of features involve more than one part. Therefore, the evaluation function
(Algorithm 7.1) was optimized for early rejection of hypotheses, being able to evaluate a
pose in less than 0.05 seconds. For a full bin, the success rate is greater than 98%.
Because we are interested in picking up all parts from the bin, we put 100 parts in the
bin and took out a part after it was detected by the system. Because the average height and
number of candidates for detection slowly decreased as we removed parts, the detection rate
started dropping. Of the 100 parts, there were 5 times that the program could not detect a
part.
We measured the ground truth error in each of the six degrees of freedom of the pose
by mounting the part to high-precision linear and rotary stages. The camera is 2 meters
above the part, so we would expect very large errors in Z. The part first starts with its
Figure 7.22: Ground-truth accuracy results gathered for each DOF of the pose.
Figure 7.23: Some geometric words cluster very densely around specific points on the object
surface. These stable points can be used for alignment.
biggest face pointing towards the camera and the system records the first pose (Figure 7.22
top). For every delta movement the stage makes, the new detected part should have the
same delta movement. The deviations from all delta movements of all degrees of freedom
are recorded and shown in Figure 7.22. The in-plane movements of the camera were the
most accurate, with the average raw translation error being 2.1mm, and the average raw
rotation error being 6.1 ◦ . The out-of-plane rotation error is on average 17 ◦ , we believe this
is due to random choosing of supporting features for building up the pose hypothesis. In
order to speed up the evaluation function, we train the evaluation function in the learning
phase to pass semi-good results also. Because the evaluation function directly looks at the
projected footprint of the part, the larger the change in footprint, the more chances it has to
intersect with a bad feature. Out-of-plane rotations and changes is depth do not change the
footprint of the part that much, in fact they just make it smaller or bigger without offsetting
it that much. And our ground-truth measurements are consistent with this observations. It
is interesting to note that the rotations around the Y-axis have a little smaller error than
those around the X-axis. The reason can be explained again with the footprints since the
part is twice longer on its X axis, than its Y axis; therefore rotations around Y move the
X-axis part more.
Pose Alignment
Because of the inherent discrete sampling nature of the supporting features and the fact that
the evaluation function is trained with slight pose disturbances, it implies that the raw pose
measurements returned from the algorithm are not going to be matching the projection of
the part perfectly. Therefore, we introduce a post-processing step that aligns the part to
the 0D features around it. Basically, some geometric words of the 0D point features cluster
densely around the point surface, so the mean of this cluster can be extracted with very
high confidence as shown in Figure 7.23. If the detected part uses any of these geometric
words in its matching, then it has a 2D position of their measurements, and it has their 3D
coordinates. Therefore, we can use a 2D/3D point correspondence algorithms to find a new
pose that perfectly aligns the 3D geometric word points to the 2D image points. The degrees
of freedom we can fix depends on the number of point correspondences:
The even rows in Figure 7.22 show the improvement using the alignment. We always
have at least one geometric point to align, therefore the XY errors decrease to 1.1mm average
error, which reduced by 48%. The next parameters to see a benefit are the Z depth with
an average error of 22.2mm (30% reduction), and the in-plane angle with an average error
of 5.2 ◦ (15% reduction). The out-of-plane rotations were rarely affected, mostly because we
rarely had more than 2 point correspondences to align. The alignment phase is very quick
to implement and practical to increasing accuracy.
7.4 Discussion
Every pose extraction algorithms needs a set of features on the object to track and base its
estimates on. The more discriminable the feature is compared to other features in the scene,
the more easily the object can be detected. Furthermore, the stability of the feature on the
object’s surface directly relates to how well it can be localized within the object coordinate
system, this implies that the pose estimates will be more accurate. In the first half of this
chapter, we presented an object surface analysis method that can find stable and discrim-
inable features. The training data for the method requires attaching the object of interest
to an already known calibration pattern, which makes the method accessible to anyone.
The power of having accurately labeled training data allows for a plethora of feature-surface
statistics to be computed. In order to have the stability statistics be coordinate system and
scale independent, we motivated the usage of the median and difference of lower and upper
quartiles of the stability measures for filtering out bad features. The filtered features are
then clustered to yield a set of geometric and visual words specific to the appearance of the
object. The analysis is general enough to be applied to any point or curve feature, even if
it has a descriptor vector associated with it. Any rigid object of any material can be easily
inserted and analyzed by the system. Furthermore, the generality of the method allows it to
be used as a pre-processing step in any existing pose extraction algorithm.
We also performed a case study of organizing the computational process of the stability
using a modern relational database. The database has many practical advantages of tracking
data dependencies and can offer access to the data to other parties over the network. Using
this database, we could easily train new objects into the system without worrying about
tracking generation parameters and files. We believe that object-specific databases encoding
vision information will become mainstream in industrial settings where every part has to be
categories and carefully tracked around a factory.
In the second half of the chapter we presented a novel pose extraction method that
takes advantage of the visual and geometric words computed from the stability database.
Because each word stores the poses of the object that were used to generate it, it gave rise to
the formulation of induced pose that inverted the feature generation process. Instead of the
object’s pose generating the feature, a feature could generate the possible poses it came from,
a process very reminiscent to the inverse kinematics problem for motion planning. Induced-
set pose extraction has the advantage of not relying on 2D/3D feature correspondences
because it can implicitly encode the 3D feature information into a pose set estimate, this
allows it to hypothesize all possible consistent poses given a set of image features. By being
able to explicitly encode all consistent poses, the induced-set algorithm is not forced to
commit early to a wrong decision anywhere in the process.
As long as the image features belong to the same object, induced pose sets can find
a correct pose that supports all the features. However when multiple parts are scattered
in a bin, the search process has a lot of difficulty picking objects that belong to the same
part. Therefore, we also presented a learning-based pose evaluation method using positively
and negatively supporting features. Because of the explicit handling of negative supporting
features, small occlusions from other objects can really decrease the evaluation score, which
guarantees that the detected parts are always completely visible and on the top of the pile.
This behavior is desirable for industrial bin-picking scenarios because a robot eventually has
to approach from the top and be able to pick up the part unhindered by other obstacles.
Furthermore, we can use this behavior in a second way: if the system is asked to evaluate a
pose of an already known part, the system can determine if the part is occluded and exactly
where the occlusions occur.
Chapter 8
Conclusion
Even today in 2010, robots are not freely roaming the streets and autonomously moving in
people-present environments because of frequent execution failures and safety fears. Espe-
cially in industrial robotics, execution failures come at the highest cost where an assembly
line has to be stopped in order to recover from errors. Although it is possible to design such
systems, it takes a lot of manpower and research to meet all the standards and handle all
the possible conditions. With today’s understanding of the problems, the mathematics and
low-level robot control issues of manipulation are well-understood and automated; however,
the less automated parts of manipulation continue to be the generation of more analysis-
heavy components like object recognition and motion planning. By clearly identifying and
showing how to automate several of the analysis-heavy components for pick-and-place ma-
nipulation tasks, we can enable many mass-production systems to be configured and put
into use quicker, which has far-reaching implications for today’s economy.
203
8.1 Contributions
As a small step towards these goals, the thesis presented a framework for automated con-
struction of the planning and vision processes required for reliable manipulation tasks. The
focus was specifically on the geometric and statistical analyses of sub-problems common in
manipulation. We sub-divided the problem into a planning knowledge-base and vision-based
database and offered many generic algorithms that help in the entire spectrum of manipula-
tion planning. Chapter 2 starts with the outline of the execution architecture for completely
a simple manipulation task and maps out all the components necessary for reliable execution.
Components that play a major role in the discussed architecture are: the goal configuration
generators that analyze a scene and send informative configuration goals to the motion plan-
ners, the knowledge-bases that encode frequently queried information of the robot and task,
and planning with sensor visibility. Using the layout of the architecture, Chapter 3 delved
into the complexities of the planning algorithms necessary to connect the initial conditions
and the goals. It presented the structure of generalized goal configuration samplers that
form the basis for representing the goal space of a plan. Several types of planning algorithms
were presented: planners having explicit goal spaces that can be sampled from, planners
that whose goal condition is to validate a know path, and planners that can change grasps
during a plan. Chapter 4 presented the structure of the planning knowledge-base and offered
algorithms for the generation and usage of more than seven types of components that are
critical for manipulation planning. By modeling the computational dependencies of each
components, it has allowed us to be methodical on how information gets tracked throughout
the system, and what domain knowledge is necessary for the generalized planners. Chapter
5 presented algorithms that consider the visibility of the sensors and offered two methods
of combining the visibility planning with the goal-oriented grasp planning algorithms. It
showed the importance and efficiency of having a camera sensor verify any object position
before the system can begin planning to grasp it. Two experiments were presented: a mobile
dual-arm humanoid that picks up cups from a sink and places them on a counter, and an
industrial robot that picks up parts scattered in a bin. Chapter 6 presented a completely
automated extrinsic camera calibration system using the planning and visibilities theories
presented thus far. An advantage of the calibration system is that it can work in any environ-
ment and does not have to be initialized with any accurate measurements of the parameters.
In order to verify the quality of the calibration results, we presented a measure based on how
well each of the parameters are constrained by the data. Chapter 7 discusses two topics of
interest in object-specific pose recognition: extracting stable and discriminable features, and
induced-set pose extraction using a novel voting-based method that maps image features to
pose hypotheses. It presented an automated data gathering method that allows a statistical
analysis of object feature distribution on the surface. The power of the induced-set method
is that pose hypotheses can be easily generated from any image. Furthermore, the pose
verification function be learned from training data, providing a way to automatically set
thresholds and weights. Results were shown on a very difficult industrial manipulation scene
and we proved millimeter accuracy of the pose. Chapter A discussed the OpenRAVE archi-
tecture and the design decisions that allowed it to become a stable and reliable platform.
OpenRAVE consists of a core that provides a safe environment for users, and a interface API
that allows users to expand on the functionality without having to recompile the base code.
OpenRAVE provides many key technologies that allow successful manipulation execution
with real robotics, some of the most critical ones like padding and jittering were discussed
in detail.
The specific contributions of the thesis are:
• A small set of manipulation planning algorithms based on the efficient search properties
of the Rapidly-exploring Random Trees formulation. These algorithms allow grasp
planning, mobile manipulation planning, and planning with a gripper camera. Chapter
3.
• A planning knowledge-base that identifies frequently used information in the planners
and caches its results in a database. The database allows components to track their
computational dependencies, which makes it easy to keep track of what needs to be
recomputed when the robot and task specifications change. We showed how to use all
this information to quickly plan for paths. Chapter 4.
• An algorithm named ikfast that solves the analytic inverse kinematics equations of
common kinematics. Unlike other proposed methods based on advanced mathematics,
ikfast searches for the most computationally efficient and numerically stable solution.
It can easily handle all degenerate cases. Section 4.1.
• Several grasping analyses for force-closure and caging grasps that can be used directly
in manipulation. For force-closure grasps, we presented a grasp space parameterization
method and a new repeatability measure evaluation for pruning fragile grasps. For
caging grasp sets, we showed how to expand the possible ways to grasp doors to allow
manipulators of low DOF to easily achieve their tasks. Section 4.2.
• A 6D kinematics reachability formulation that allows for informative samplers and an
inverse reachability map for computing distribution of base placements. This formula-
tion was the basis for a new map which we call grasp reachability that combines grasp
sets and base placement distribution for a simple way of determining where a robot
should be placed to grasp an object. Sections 4.3, 4.4, 4.5.
• Motivated the usage of convex decompositions for padding the robot for safely moving
across the environment, pruning range data, and computing a new distance metric
based on the swept volumes of the robot links. Section 4.6.
• A general goal configuration generator that uses the planning knowledge-base to quickly
analyze the scene and seed planners with possible goal that help achieve the task. We
showed that goal generators have the most impact in a planning algorithm, and have
turned the bulk of the manipulation planning research into quickly computing the goal
space. Section 3.1.
• Algorithms that can quickly sample robot configurations so attached sensors clearly see
the target objects of the task. We presented a data-driven object detectability extents
model that allows a robot to know what regions of the object are detectable. We also
presented a two-stage method of first viewing the object with a gripper camera before
attempting to grasp it, we argued that this method is as fast as regular grasp planning.
Sections 5.1, 5.2.
• A simple visual feedback method that uses the visibility of the object while simulta-
neously choosing the best grasp for that object. This allows the robot to compensate
for big rotations of the object. Section 5.3.
• A completely automated extrinsic camera calibration method using the planning and
visibility sampling theories developed in the manipulation framework. We also pre-
sented a new method of computing the confidence on the gathered data so a robot can
determine when to stop gathering data. Chapter 6.
• For vision-based object analysis, we presented an feature-surface statistical analysis
method to extract stable and discriminable features and be able to generate a set of
visual and geometric words for the object based on the feature detectors used. These
features can be used in any pose-extraction algorithm when determining what to track.
Section 7.2.
• For vision-based pose recognition, we presented a novel pose extraction algorithm that
directly maps image features to a set of pose hypotheses of the object. These hypotheses
are used in a RANSAC algorithm to compute poses from a bin of scattered metallic
objects. Section 7.3.
• The OpenRAVE environment that integrates all the components discussed in this
thesis. OpenRAVE allows really fast development of planning algorithms and provides
many organizational structures for the database and robot information. Appendix A.
• A set of general guidelines for defining the lowest level of manipulation autonomy.
Section 2.5.
Figure 8.1: The construction process can be treated as an offline compiler that converts the
specifications into run-time programs while optimizing the speed of its execution.
• Have a plugin-based architecture that allows users to expand its functionality without
having to recompile the base code. Most functionality should be offered as plugins,
thus keeping the core as simple as possible.
209
• Offer many motion planning algorithm implementations that can be easily extended
to new tasks.
• Make it easy to debug components during run-time without having to recompile or
restart the entire system in order to prevent flushing of the in-memory environment
state.
• Allow the OpenRAVE core to be used as a simulation environment, as a high-level
scripting environment, as a kinematics and dynamics backend module for robot con-
trollers, or as a manipulation planning black box in a distributed robotics environment.
• Allow simple offline planning database generation, storage, and retrieval.
• Support a multi-threaded environment and allow easy parallelization of planners and
other functions with minimal synchronization required on the user side.
One of OpenRAVE’s strongest points when compared with other planning packages is
the idea of being able to apply algorithms to any scenario with very little modification.
Users of OpenRAVE can concentrate on the development of planning and scripting aspects
of a problem without having to explicitly manage the details of robot kinematics, dynamics,
collision detection, world updates, sensor modeling, and robot control.
We first start with the OpenRAVE core design and discuss the programming models
that become possible. Within the context of programming paradigms, we present a new
layer of functionality that all robotics architectures should implement that goes beyond the
basic kinematics, collision detection, and graphics interface requirements of classic robotics
libraries. OpenRAVE provides a set of interfaces that let users modify existing functions and
expand OpenRAVE-enabled modules without having to recompile OpenRAVE or deal with
messy monolithic code-bases. We go through each interface’s design and its usage within the
entire system. Furthermore, we discuss how OpenRAVE is used with real robotics systems
and motivate several key functions that make it possible for planning-enabled robots to work
consistently in a continuously changing and unpredictable environment. We conclude with
future work and other lessons learned in robotics architectures.
A.1 Architecture
Figure A.1 shows the interaction of the four major layers composing the architecture:
• Core Layer. The core is composed of a set of interface classes defining how plugins
share information, and it provides an environment interface that maintains a world
state, which serves as the gateway to all functions offered through OpenRAVE. The
Figure A.1: The OpenRAVE Architecture is composed of four major layers and is designed to be
used in four different ways.
global state manages the loaded plugins, multiple independent environments, and log-
ging. On the other hand, the environment combines collision checkers, viewers, physics
engines, the kinematic world, and all its interfaces into a coherent robotics world state.
• Plugins Layer. OpenRAVE is designed as a plugin-based architecture which allows
to create new components to continuously improve its original specifications. Each
plugin is an implementation of a standard interface that can be loaded dynamically
without the need of recompiling the core. Following this design, different kind of
plugins can be created such as sensors, planners, controllers or physics engines. The
core layer communicates with the hardware through the plugins using more appropriate
robotics packages such as Player [Gerkey et al (2001)] and Robot Operating System
(ROS) [Quigley et al (2009)]. Using plugins, any planning algorithm, robot control, or
sensing-based subsystem can be distributed and dynamically loaded at run-time; this
distributed nature frees developers from struggling with monolithic code-bases.
• Scripting Layer. OpenRAVE provides scripting environments for Python and Oc-
tave/Matlab. Python communicates with the core layer directly with in-memory calls,
making it extremely fast. On the other hand, the Octave/Matlab scripting protocol
send commands through TCP/IP, with a plugin offering a text server on the core side.
Scripting allows real-time modifications to any aspect of the environment without re-
quiring shutdown, making it ideal to testing new algorithms. The Python scripting is
so powerful, that most of the examples and demo code are offered through it. In fact,
users should treat the scripting language as an integral part of the entire system, not
as a replacement to the C++ API.
• Robot Database Layer. Implements the planning knowledge-base covered in Chap-
ter 4 and provides simple interfaces for its access and generation parameters. The
database itself mostly consists of kinematic, quasi-static, dynamic, and geometric anal-
yses of the robot and the task. If the robot is defined properly, then all these functions
should work out of the box.
The main API is coded in C++ using the Boost C++ libraries [Dawes et al (1998-
present)] as a really solid basis of low-level management and storage structures. The Boost
flavors of shared pointers allow object pointers to be safely reference counted in a heavily
multi-threaded environment. Shared pointers also allow handles and interfaces to be passed
to the user without having to every worry about the user calling upon invalid objects or un-
loaded shared objects. Furthermore, OpenRAVE uses functors and other abstracted objects
commonly seen in higher level languages to specify function pointers for sampling distri-
butions, event callbacks, setting robot configuration state, etc. The Boost-enabled design
makes the the C++ API really safe and reliable to use along with saving the users a lot of
trouble doing bookkeeping on their end. Furthermore, it allows the Resource Acquisition Is
Initialization (RAII) design pattern [Stroustrup (2001)] to be fully exploited allowing users
to ignore the complexities of multi-threaded resource management.
A.1.1 Environment
The OpenRAVE state is divided into a global state and an environment-dependent state. The
global state holds information common to all threads, all environments, and all interfaces.
The environment state holds information and settings just for the objects and interfaces
created in the environment. Environment states are independent from each other, whereas
global states affect everything. At initialization, every interface is branded to one specific
environment, thus making the environment the only way for it to create other interfaces and
query objects. The environment manages:
Locking
Because OpenRAVE is a highly multi-threaded environment, the environment state like
bodies and loaded interfaces could be simultaneously accessed. In order to safely write or
read the state, a user has to lock the environment, which prevents any other process from
modifying the environment while the user is working. By using recursive locks, it allows a
lock to be locked as many times as needed within the same thread, greatly reducing the lock
management when a state changing function calls another state changing function. This
safety measure helps users by always guaranteeing the environment is locked when calling
global level environment functions like creating new bodies or loading scenes, regardless if the
user has locked it. However, directly accessing the bodies and robots is dangerous without
having the environment lock acquired.
Simulation Thread
Every environment has an internal time and a simulation thread directly attached to a physics
engine. The thread is always running in the background and periodically steps the simulation
time by a small delta for the physics engine and on all the simulation-enabled interfaces. By
default, the thread is always running and can always potentially modify the environment
state; therefore, users always need to explicitly lock the environment whenever playing with
the internal state like modifying bodies by setting joint values or link transformations. If
not careful, the controller or physics engine will overwrite them. By default, the simulation
thread just sets the object positions depending on their controller inputs, but a physics
engine can be attached to integrate velocities, accelerations, forces, and torques.
The simulation thread can feel like a nuisance at first, but its division of robot control
into control input computation and execution greatly helps users only concentrate on feeding
commands to the robot without worrying about the simulation loop. It also allows a world
update to happen in one one discrete time step.
Cloning
One of the strengths of OpenRAVE is in allowing multiple environments to work simul-
taneously in the same process. Environment cloning allows OpenRAVE to become truly
parallel by managing multiple environments and running simultaneous planners on top of
them. Because there is no shared state across the clone and the original environment, it is
not possible to use an interface created from one environment in another For example, if a
planner is created in one environment, it should be used only by objects in that environment.
It is not possible to set a planner to plan for objects belonging to a different environment.
This is because a planner will lock the environment and expect the objects it controls to be
exclusively under its control.
Because the environment state is very complex, the cloning process can control how
much of it gets transferred to the new clone. For example, all existing bodies and robots
can be cloned, their attached controllers can be cloned, the attached viewer can be cloned,
the collision checker state can be cloned, and the simulation state can be cloned. Basically
the clone should be able to perform any operations that can be done with the original
environment without any modification in the input parameters.
When cloning real robots, one extremely important feature that OpenRAVE cloning
offers is the ability to maintain a real-time view of the world that sensors continuously
update with new information. When a planner is instantiated, it should make a copy of the
environment that it can exclusively control without interfering with the updating operations.
Furthermore, the real-world environment possibly has robot controllers connected to real
robots, having a clone gives the ability to set simulation controllers guarantees robot safety
while planning; commands from a cloned environment would not accidentally send commands
to the real robot.
All other information is independent of the environment and can be categorized into the
kinematics, geometry, and dynamics of the body. Furthermore, robots have categories for
attached sensors and manipulators. The planning knowledge-base stores all cached informa-
tion about a body and a robot, so it needs an consistent way of indexing this information.
Indexing by robot names is not reliable because it is very difficult to remind a user to change
the name every time the body structure is changed. Therefore, OpenRAVE provides func-
tionality to serialize the different categories of a body and create a 128-bit MD5 hash. Each
of the models in the planning knowledge-base relies on different categories of the robot. For
example:
• inverse kinematics generation only uses the kinematics of a sub-chain of the robot
defined by the manipulator and the grasp coordinate system,
• kinematic reachability cares about the robot geometry of the manipulator because it
implicitly stores self-collision results,
• inverse reachability further uses the links connecting the base robot link to the base
manipulator link,
• grasping cares about the geometry of the target body and the kinematics and geometry
of the gripper,
• convex decompositions only care about the link geometry, and
• inverse dynamics cares only about the dynamics properties of each link and the kine-
matics.
There are several challenges to developing a consistent index across all operating systems
and compilers since floating point errors could creep in when normalizing floating-point
values. However, the idea of such an index could greatly help in developing a worldwide
robot database that anyone can use.
A.2 Interfaces
An interface is the base element from which every possible way to expand OpenRAVE
functionality is derived from. An interface always comes from a plugin and is owned by an
environment. Each of the interface types attempt to package a closed set of functions that
are commonly used in robotics architectures. However, it is impossible to predict what all
users need and how technology will evolve, so each interface provides a flexible way to receive
commends from users in the form of streams. Each stream first starts with a command name
and then specifies arguments, much like a function call or remote procedure call, except the
stream does not enforce any format requirements. Furthermore, each interface can contain
extra annotations for holding parameters or other configuration-specific information. This
allows users to set parameters in an XML file for initializing the interface. The interfaces
themselves manage all the low-level library and plugin information, so users are guaranteed
the interface will be always valid as long as its reference is held.
A.2.1 Kinematics Body Interface
Each kinematics body can be thought of as the base geometric object in OpenRAVE. It is
composed of a collection of rigid-body links connected with joints. The kinematics are a
graph of joints and links, there is no enforced hierarchy. The basic body provides:
• Manipulator - Every robot supports a list of manipulators that describe the links the
robot should use when manipulating parts of the environment. Usually manipulators
are serial chains with a Base link and an End Effector link. Each manipulator is also
decomposed into two parts: the arm and the hand. The hand usually makes contact
with the objects while the arm transfers the hand to its destination. The manipulator
class also has an optional pointer to an inverse kinematics solver providing inverse
kinematics functionality.
• Active Degrees of Freedom - When controlling and planning for a robot, it is
possible to set the degrees of freedom that should be used. For example, consider
planning with a humanoid robot. There should be in easy way to specify to the
planners that only the planning configuration space consists of only the right arm
while keeping the rest of the joints the same. Or consider the case where we care about
navigation of the humanoid robot. Here we would want to control the translation of
the robot on the plane and its orientation. Perhaps we want to do footstep planning
Figure A.2: Distance queries.
and also care about controlling the two legs. All this is possible with the active degrees
of freedom feature provided by OpenRAVE.
• Grabbing bodies - It is possible for a robot to attach a kinematics body onto one
of its links so that when the link moves, the body also moves. Because collision
detection will stop being checked between the robot and the body, you could say that
the body becomes a part of the robot temporarily. This functionality is necessary for
manipulation planning. Whenever the robot is carrying a body, all collisions between
the robot and that item should be ignored once the body has been grasped.
• Attached Sensor - Can attach any number of sensors to the robot’s links. The sensor
transformations will be completely owned by the robot. Whenever the robot link is
updated, the sensor will be automatically moved to the new location.
comparison very simple. In fact, this allows bugs to be caught between collision checkers
because a user can quickly set another checker to confirm the behavior.
• 6D pose - translation and rotation defining the full coordinate system of the end-
effector.
• 3D translation - end-effector should hit a particular point
• 3D rotation - end-effector rotation,
• 3D look at direction - a defined direction on the end-effector coordinate system
should look at a point.
• 4D ray - a defined ray on the end-effector should align with a destination ray.
2. Fill a planner parameters structure defining the instance of the problem. The struc-
ture has many fields for describing planning entities like start position, goal condition,
and the distance metric. Try to use these fields as much as possible. Later on, this will
allow users to easily swap planners without having to change the parameters structure
much.
3. Initialize the planner using the robot and planner parameters. This also resets any
previous information the planner had stored.
4. Plan for a path passing in a trajectory pointer for the output. If the function returns
true, then the trajectory will be filled with the geometric solution in the active DOF
configuration space of the robot. It is possible to preserve the previous search space
for the planner if changing the goal conditions and reusing the previous computations.
Planning Parameters
All the information defining a planning problem should be specified in a planner parame-
ters structure, which attempts to cover most of the common data mentioned in Chapter 3.
The fields to set are:
• Cost Function - takes in a configuration and outputs a single value showing the cost
of being in this region.
• Goal Function - takes in a configuration and outputs a value showing proximity to
the goal.
• Distance Metric - distance between two configurations.
• Constraint Function - takes in a previous and new configuration and projects the new
configuration to satisfy problem specific constraints. Returns false if new configuration
should be rejected.
• Sample Function - Samples a random configuration for exploring the configuration
space.
• Sample Neighbor Function - Samples the neighborhood of an input configuration,
controls how far samples can be using the distance metric.
• Sample Goal Function - Samples a random goal configuration.
• Set/Get State Functions - Sets and gets the configuration state.
• State Difference Function - Used to find the difference between two states. This
function properly takes into account identifications for joints without any limits.
• Goal Configurations - Explicit seeding of the goal configurations when initializing
the planner.
• Lower/Upper Limits - Very rough limits of each DOF of the configuration space
used to normalize sampling and verify configurations.
• Resolution - The resolution to check line collisions and used for other discretization
factors dependent on the space.
• Step Length - The step length for discretizing the configuration space when searching.
• Max Iterations/Time - Controls the maximum number of iterations or time that
the planners should compute until it gives up.
• Path Optimization Parameters - Special parameters that will be used in post-
processing the output paths to smooth them of any irregularities.
• Check Self Collisions - If the planner should check self-collisions of the robot for
every configuration. Some configuration spaces, like navigation, might not modify the
internal joints of the robot, so this step could speed up planning.
However there are many different types of inputs to a planner, so it is impossible to cover
everything with one class. Therefore, planner parameters has a very flexible and safe way
to extend its parameters without destroying compatibility with a particular planner or user of
the planner. This is enabled by the serialization to XML. Using XML as a medium, it is easy
to exchange data across different derivations of planner parameters without much effort
or incompatibilities. Because of these serialization capabilities, it becomes possible to pass
in the planner problem across the network or a different thread. Furthermore, a base planner
can read in any XML structure and ignore the fields that it does not recognize, meaning
that users can re-use old planners without having to tune their parameters structures.
Path Optimization
Path smoothing/optimization is regarded as a post-processing step to planners. Path opti-
mization algorithms take in an existing trajectory and filter it using the existing constraints
of the planner. In fact, functionality there is no difference between a path optimization plan-
ner and a regular planner besides the fact that a trajectory is used as input. Because a
planner already has a trajectory as an argument to its plan path method, supporting path
optimization does not cause any major API changes to the infrastructure.
The planner parameters structure reflects what optimization algorithm to use for post
processing the trajectory. By default, this is the linear shortcut method; however, many
Figure A.4: Several simulated sensors.
different optimization algorithms exist, each with their own sets of parameters.
This type of planner post-processing actually allows users to chain planners in the same
way that filters are chained for sensing data. Of course, users can continue to smooth in
planners without relying on this framework. However, explicit control of path smoothing
allows custom parameters to be easily specified.
system would not be able to perceive the target object anymore and might decide to remove
it from the environment. The sensor system interface allows control of what objects should
be released from the control of the perception system, what objects should be locked so the
system cannot destroy them, and what objects should be destroyed. Every kinematics body
has a structure that specifically manages the state for the object and allows planners and
other systems to mark objects as being used, so they shouldn’t be deleted. Furthermore, as
a robot grasps an object, the perception system should completely release it or hand it over
to a new perception system because the object’s state is now controlled by the gripper.
A.3.1 Padding
Obstacles and the robot need to be padded with at least 5mm-10mm around the surface
before computing collision queries. Section 4.6.1 motivated the use of convex decompositions
for uniformly padding the surfaces. However, padding cannot be applied all the time in all
situations. Because joint encoders are very accurate, checking self-collisions with a padded
mesh is meaningless most of the time; in fact, it could cause the robot to get into random
collisions that can affect the feasibility of the plan.
In this thesis we spent a great deal of effort showing why target objects should be treated
differently from obstacles the robot is trying to avoid; the robot actually needs to get close
and make contact with the targets. If making grasp sets with a padding object, it could
cause the robot to hallucinate contact points that don’t exist in reality, which could yield
very bad grasps. The objects should be their original size when creating grasp tables and
approaching the final grasp to the target object. Therefore, padded objects should be used
only when avoiding them is clearly the goal of the plan. The target object should be padded
only in the first stage of grasp planning (Section 3.3) where the gripper gets close, but does
not approach the target. Sampling visibility should also work with the original objects. Any
other situation requires obstacle be padded.
A.3.2 Jittering
The most effective way of moving the robot out of grazing environment or self collisions is
to randomly move the robot joints a small distance and see if the robot got out of collision.
Although many papers have been written on computing proximal distance and finding the
shortest path to get the robot out of collision, they tend to be slow. Furthermore, the straight
line in configuration space between the initial colliding configuration and a close non-colliding
configuration is usually a very good approximation of the shortest path to take. Jittering
first starts searching within a small ball of the current configuration and slowly increases
that ball until a reasonable maximum limit on the configuration distance is reached.
The distance metric presented in Section 4.6.3 becomes very important in determining
what joints affect more volume and what joints do not. By not having a well calibrated
distance metric, small steps in configuration space could move the base joint enough distance
to account for a 5mm-10mm jumps in the end effector. Such huge jumps will most likely
cross obstacles and really get the robot in trouble if it tries to move across the obstacles.
Therefore, we have to rely on a distance metric that takes into account the average swept
volume of each joint.
A.4 Discussion
OpenRAVE provides an environment for testing, developing, and deploying manipulation
planning algorithms in real-world autonomous manipulation applications. The biggest chal-
lenge is developing an integrated architecture that allows for rapid development, powerful
scripting, and the combination of many modules that inter-operate with many libraries. At-
tempting to support everything could spread an architecture thin and make it lose its value;
therefore, we make it easy to connect OpenRAVE to other systems through plugins. The plu-
gin architecture allows OpenRAVE to solely focus on geometric and kinematic the analyses.
This focus allows is to be easily integrated into existing robotics systems that concentrate
on other tasks like low-level control, message protocols, perception, and higher-level intelli-
gence systems. We covered the final OpenRAVE architecture and a lot of the decisions that
went into its design. OpenRAVE already supports a plethora of functions like environment
cloning, geometry hashes, planner parameters, advanced exception handling, and grabbing
bodies that other robotics architectures are just beginning to realize the importance of.
One of the earliest and best decisions when starting to develop OpenRAVE was to make
it open-source. Its open nature has allowed a community to flourish within it. Further-
more, a lot of the community has helped uncover many issues and bottlenecks with previous
OpenRAVE versions. This has allowed the architecture to naturally evolve as user demand
increased. In fact, not having such continuous feedback from a community of over a hundred
researchers would not have allowed OpenRAVE to grow so far and become as popular as it is
today. Because we believe in supporting commercial ventures, we release OpenRAVE code
in the GNU Lesser General Public License and the Apache License, Version 2.0. The core
is protected by the LGPL so we can keep track of changes where the hope is to maintain a
single OpenRAVE distribution that can satisfy everyone. The LGPL requires any changes
to the core to be made public if used in products. All scripts and examples and released
under the much less restrictive Apache License, so users have the freedom to modify it in
any way they want without any worry.
Many people have argued that releasing research prematurely as open-source could help
other parties take advantage of the work, which is beneficial for the original author. Such
statements completely miss the point of research and sharing information, and our hope
is that the success of OpenRAVE can show to the robotics community that sharing code
can greatly benefit the original author’s goals and spur more growth in the community.
Making code open source allows research to be used in many more places and gives it a lot
of exposure.
OpenRAVE will continue to be a community effort to organize manipulation planning
research. Eventually we hope to setup official planning benchmarks to easily test planning,
collision, physics, and other algorithms in the context of manipulation tasks.
References
Albu-Schaffer A, Haddadin S, Ott C, Stemmer A, Wimbock T, Hirzinger G (2007) The dlr lightweight
robot: design and control concepts for robots in human environments. Industrial Robot: An International
Journal 34(5):376–385
An CH, Atkeson CG, Hollerbach JM (1988) Model-Based Control of a Robot Manipulator. MIT Press,
Cambridge, Massachusetts
Ausubel J (2009) The population delusion: Ingenuity wins every time. New Scientist 203(2727):38–39
Baker C, Ferguson D, Dolan J (2008) Robust mission execution for autonomous urban driving. In: 10th
International Conference on Intelligent Autonomous Systems
Bay H, Ess A, Tuytelaars T, Gool LV (2008) Surf: Speeded up robust features. Computer Vision and Image
Understanding (CVIU) 110(3):346–359
Berenson D, Diankov R, Nishiwaki K, Kagami S, Kuffner J (2007) Grasp planning in complex scenes. In:
Proceedings of IEEE-RAS International Conference on Humanoid Robots (Humanoids)
Berenson D, Srinivasa S, Ferguson D, , Kuffner J (2009a) Manipulator path planning on constraint manifolds.
In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Berenson D, Srinivasa S, Ferguson D, Collet A, Kuffner J (2009b) Manipulator path planning with workspace
goal regions. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Bertram D, Kuffner J, Dillmann R, Asfour T (2006) An integrated approach to inverse kinematics and path
planning for redundant manipulators. In: Proceedings of the IEEE International Conference on Robotics
and Automation (ICRA)
Bolles RC, Horaud P, Hannah MJ (1983) 3dpo: A three-dimensional part orientation system. In: IJCAI, pp
1116–1120
229
Borgefors G, Strand R (2005) An approximation of the maximal inscribed convex set of a digital object. In:
Image Analysis and Processing ICIAP
Brock O, Kuffner J, Xiao J (2008) Motion for manipulation tasks. Handbook of Robotics (O Khatib and B
Siciliano, Eds)
Cameron S (1985) A study of the clash detection problem in robotics. In: In Int. Conf. Robotics Automation,
pp 488–493
Chestnutt J (2007) Navigation planning for legged robots. PhD thesis, Robotics Institute, Carnegie Mellon
University
Chestnutt J, Michel P, Nishiwaki K, Kuffner J, Kagami S (2006) An intelligent joystick for biped control. In:
Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Chew LP, Kedem K (1993) A convex polygon among polygonal obstacles: Placement and high-clearance motion.
Computational Geometry: Theory and Applications 3(2):59–89
Chia KW, Cheok AD, Prince S (2002) Online 6 dof augmented reality registration from natural features. In:
Proc. of the Intl. Symp. on Mixed and Augmented Reality (ISMAR)
Chum O (2005) Two-view geometry estimation by random sample and consensus. PhD thesis, Czech Technical
University
Ciocarlie M, Goldfeder C, Allen P (2007) Dimensionality reduction for hand-independent dexterous robotic
grasping. In: Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS)
Ciocarlie MT, Allen PK (2009) Hand posture subspaces for dexterous robotic grasping. Int J Rob Res 28(7):851–
867
Curless B, Curless B, Curless B, Curless B, Levoy M, Levoy M, Levoy M, Levoy M (1995) Better optical
triangulation through spacetime analysis. In: In ICCV, pp 987–994
Datta A, Kim J, Kanade T (2009) Accurate camera calibration using iterative refinement of control points. In:
Workshop on Visual Surveillance (VS), 2009 (held in conjunction with ICCV).
David P, DeMenthon D (2005) Object recognition in high clutter images using line features. In: International
Conference on Computer Vision (ICCV)
David P, Dementhon D, Duraiswami R, Samet H (2004) SoftPOSIT: Simultaneous pose and correspondence
determination. International Journal of Computer Vision 59(3):259–284
Diankov R, Kuffner J (2008) Openrave: A planning architecture for autonomous robotics. Tech. Rep. CMU-
RI-TR-08-34, Robotics Institute, URL http://openrave.programmingvision.com
Diankov R, Ratliff N, Ferguson D, Srinivasa S, Kuffner J (2008a) Bispace planning: Concurrent multi-space
exploration. In: Proceedings of Robotics: Science and Systems (RSS)
Diankov R, Srinivasa S, Ferguson D, Kuffner J (2008b) Manipulation planning with caging grasps. In: Pro-
ceedings of IEEE-RAS International Conference on Humanoid Robots (Humanoids)
Diankov R, Kanade T, Kuffner J (2009) Integrating grasp planning and visual feedback for reliable manipulation.
In: Proceedings of IEEE-RAS International Conference on Humanoid Robots (Humanoids)
Divvala SK, Hoiem D, Hays JH, Efros AA, Hebert M (2009) An empirical study of context in object detection.
In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Dupont L, Hemmer M, Petitjean S, Schömer E (2007) Complete, exact and efficient implementation for com-
puting the adjacency graph of an arrangement of quadrics. In: ESA’07: Proceedings of the 15th annual
European conference on Algorithms, pp 633–644
Eberly D (2001) Intersection of convex objects: The method of separating axes. Tech. rep., Geometric Tools,
LLC, URL http://www.geometrictools.com/
Escande A, Miossec S, Kheddar A (2007) Continuous gradient proximity distance for humanoids free-collision
optimized-postures. In: Proceedings of IEEE-RAS International Conference on Humanoid Robots (Hu-
manoids)
Ferguson D (2006) Single Agent and Multi-agent Path Planning in Unknown and Dynamic Environments. PhD
thesis, Carnegie Mellon University
Ferguson D, Stentz A (2004) Delayed D*: The Proofs. Tech. Rep. CMU-RI-TR-04-51, Carnegie Mellon Robotics
Institute
Ferguson D, Stentz A (2005) The Field D* Algorithm for Improved Path Planning and Replanning in Uniform
and Non-uniform Cost Environments. Tech. Rep. CMU-RI-TR-05-19, Carnegie Mellon School of Computer
Science
Ferguson D, Stentz A (2006) Anytime RRTs. In: Proceedings of the IEEE International Conference on Intelli-
gent Robots and Systems (IROS)
Fikes R, Nilsson N (1971) Strips: a new approach to the application of theorem proving to problem solving.
Artificial Intelligence 2:189–208
Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to
image analysis and automated cartography. Communications of the ACM 24(6):381–395
Gerkey B, Vaughan RT, Stoy K, Howard A, Sukhatme GS, Mataric MJ (2001) Most Valuable Player: A Robot
Device Server for Distributed Control. In: Proceedings of the IEEE International Conference on Intelligent
Robots and Systems (IROS)
Goad C (1983) Special purpose, automatic programming for 3d model-based vision. Proc DARPA Image Un-
derstanding Workshop
Gold S, Rangarajan A, ping Lu C, Pappu S, Mjolsness E (1998) New algorithms for 2d and 3d point matching:
Pose estimation and correspondence. Pattern Recognition 31(8):1019–1031
Goldfeder C, Allen P, Lackner C, Pelossof R (2007) Grasp planning via decomposition trees. In: Proceedings
of the IEEE International Conference on Robotics and Automation (ICRA)
Gordon I, Lowe D (2006) What and where: 3d object recognition with accurate pose. In: Toward Category-Level
Object Recognition, pp 67–82
Gorski KM, Hivon E, Banday AJ, Wandelt BD, Hansen FK, Reinecke M, Bartelmann M (2005) HEALPix:
a framework for high-resolution discretization and fast analysis of data distributed on the sphere. The
Astrophysical Journal 622:759–771
Gravot F, Haneda A, Okada K, Inaba M (2006) Cooking for humanoid robot, a task that needs symbolic and
geometric reasonings. In: Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA)
Gremban KD, Ikeuchi K (1993) Appearance-based vision and the automatic generation of object recognition
programs. In: Three-Dimensional Object Recognition Systems, Elsevier Science Publishers, B.V.
Gremban KD, Ikeuchi K (1994) Planning multiple observations for object recognition. International Journal of
Computer Vision 12(1):137–172
Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset. Tech. Rep. 7694, California Institute
of Technology, URL http://authors.library.caltech.edu/7694
Grimson W, Lozano- Perez T (1985) Recognition and localization of overlapping parts from sparse data in two
and three dimensions. In: IEEE Conf. on Robotics and Automation, pp 61–66
Gu L, Kanade T (2008) A generative shape regularization model for robust face alignment. In: Proceedings of
The 10th European Conference on Computer Vision
Harada K, Morisawa M, Miura K, Nakaoka S, Fujiwara K, Kaneko K, Kajita S (2008) Kinodynamic gait
planning for full-body humanoid robots. In: Proceedings of the IEEE International Conference on Intelligent
Robots and Systems (IROS)
Haralick BM, Lee CN, Ottenberg K, Nolle M (1994) Review and analysis of solutions of the three point
perspective pose estimation problem. International Journal of Computer Vision 13(3):331–356
Hartley R, Zisserman A (2000) Multiple View Geometry in Computer Vision. Cambridge University Press
Hauser K, Ng-Thow-Hing V (2010) Fast smoothing of manipulator trajectories using optimal bounded-
acceleration shortcuts. In: Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA)
Hauser K, Bretl T, Latombe J (2008) Motion planning for legged robots on varied terrain. International Journal
of Robotics Research 27(11-12):1325–1349
Hollinger G, Ferguson D, Srinivasa S, Singh S (2009) Combining search and action for mobile robots. In:
Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
III FLH, Shimada K (2009) Morphological optimization of kinematically redundant manipulators using weighted
isotropy measures. In: Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA)
Jain A, Kemp CC (2008) Behaviors for robust door opening and doorway traversal with a force-sensing mobile
manipulator. In: Proceedings of the Manipulation Workshop in Robotics Science And Systems
Kadir T, Zisserman A, Brady M (2004) An affine invariant salient region detector. In: In Proc of the 8th
European Conference on Computer Vision, pp 345–457
Kaneko K, Kanehiro F, Kajita S, Hirukawa H, Kawasaki T, Hirata M, Akachi K, Isozumi T (2004) Humanoid
robot hrp-2. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Kazemi M, Gupta K, Mehrandezh M (2009) Global path planning for robust visual servoing in complex envi-
ronments. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Kim DJ, Lovelett R, Behal A (2009) Eye-in-hand stereo visual servoing of an assistive robot arm in unstructured
environments. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Kim YJ, Varadhan G, Lin MC, Manocha D (2003) Fast swept volume approximation of complex polyhedral
models. In: ACM Symposium on Solid Modeling and Applications
Knuth DE (1973) Fundamental Algorithms, The Art of Computer Programming, vol 1, 2nd edn. Addison-
Wesley
Kojima M, Okada K, Inaba M (2008) Manipulation and recognition of objects incorporating joints by a hu-
manoid robot for daily assistive tasks. In: Proceedings of the IEEE International Conference on Intelligent
Robots and Systems (IROS)
Kragic D, Miller AT, Allen PK (2001) Real-time tracking meets online grasp planning. In: Proceedings of the
IEEE International Conference on Robotics and Automation (ICRA)
Kuffner J, LaValle S (2000) RRT-Connect: An Efficient Approach to Single-Query Path Planning. In: Pro-
ceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Kuffner J, Nishiwaki K, Kagami S, Inaba M, Inoue H (2003) Motion planning for humanoid robots. In: Pro-
ceedings of the International Symposium on Robotics Research (ISRR)
LaValle S, Kuffner J (2000) Rapidly-exploring random trees: Progress and prospects. In: Robotics: The
Algorithmic Perspective. 4th Int’l Workshop on the Algorithmic Foundations of Robotics (WAFR)
LaValle S, Kuffner J (2001) Randomized kinodynamic planning. International Journal of Robotics Research
20(5):378–400
Lepetit V, Vacchetti L, Thalmann D, Fua P (2003) Fully automated and stable registration for augmented
reality applications. In: Proc. of the Intl. Symp. on Mixed and Augmented Reality (ISMAR)
Li Y, Gu L, Kanade T (2009) A robust shape model for multi-view car alignment. In: IEEE Conference on
Computer Vision and Pattern Recognition
Likhachev M, Ferguson D (2009) Planning long dynamically feasible maneuvers for autonomous vehicles. In-
ternational Journal of Robotics Research 28(8):933–945
Low KH, Dubey RN (1987) A comparative study of generalized coordinates for solving the inverse-kinematics
problem of a 6r robot manipulator. International Journal of Robotics Research 5(4):69–88
Lowe D (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer
Vision 60(2):91–110
Luo Z, Tseng P (1992) On the convergence of coordinate descent method for convex differentiable minimization.
Journal of Optimization Theory and Applications 72(1):7–35
Malisiewicz T, Efros AA (2008) Recognition by association via learning per-exemplar distances. In: CVPR
Manocha D, Zhu Y (1994) A fast algorithm and system for the inverse kinematics of general serial manipulators.
In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Marthi B, Russell SJ, Wolfe J (2008) Angelic Hierarchical Planning: Optimal and Online Algorithms. In:
ICAPS
Michel P (2008) Integrating perception and planning for humanoid autonomy. PhD thesis, Robotics Institute,
Carnegie Mellon University
Mikolajczyk K, Schmid C (2002) An affine invariant interest point detector. In: ECCV ’02: Proceedings of the
7th European Conference on Computer Vision-Part I, pp 128–142
Mikolajczyk K, Schmid C (2004) Scale & affine invariant interest point detectors. Int J Comput Vision 60(1):63–
86
Mikolajczyk K, Zisserman A, Schmid C (2003) Shape recognition with edge-based features. In: Proceedings of
the British Machine Vision Conference, vol 2, pp 779–788
Miller AT (2001) Graspit!: A versatile simulator for robotic grasping. PhD thesis, Department of Computer
Science, Columbia University
Moreels P, Perona P (2005) Evaluation of features detectors and descriptors based on 3d objects. In: Interna-
tional Conference on Computer Vision (ICCV)
Morel J, GYu (2009) Asift: A new framework for fully affine invariant image comparison. SIAM Journal on
Imaging Sciences 2
Morris AC (2007) Robotic introspection for exploration and mapping of subterranean environments. PhD thesis,
Robotics Institute, Carnegie Mellon University
Niskanen S, stergrd PRJ (2003) Cliquer user’s guide, version 1.0. Tech. Rep. T48, Communications Laboratory,
Helsinki University of Technology
Okada K, Ogura T, Haneda A, Fujimoto J, Gravot F, Inaba M (2004) Humanoid motion generation system on
hrp2-jsk for daily life environment. In: International Conference on Machantronics and Automation (ICMA)
Okada K, Kojima M, Sagawa Y, Ichino T, Sato K, Inaba M (2006) Vision based behavior verification system
of humanoid robot for daily environment tasks. In: Proceedings of IEEE-RAS International Conference on
Humanoid Robots (Humanoids)
Okada K, Tokutsu S, Ogura T, Kojima M, Mori Y, Maki T, Inaba M (2008) Scenario controller for daily
assistive humanoid using visual verification. In: Proceedings of the International Conference on Intelligent
Autonomous Systems (IAS)
Oriolo G, Vendittelli M, Freda L, Troso G (2004) The SRT Method: Randomized strategies for exploration.
In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Pantofaru C, Hebert M (2007) A framework for learning to recognize and segment object classes using weakly
supervised training data. In: British Machine Vision Conference
Pelossof R, Miller A, Allen P, Jebara T (2004) An svm learning approach to robotic grasping. In: Proceedings
of the IEEE International Conference on Robotics and Automation (ICRA)
Pham MT, Cham TJ (2007) Online learning asymmetric boosted classifiers for object detection. In: In Proc.
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’07)
Prats M, Martinet P, del Pobil AP, Lee S (2007a) Vision/force control in task-oriented grasping and manipu-
lation. In: Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS)
Prats M, Sanz P, del Pobil A (2007b) Task-oriented grasping using hand preshapes and task frames. In:
Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Prats M, Martinet P, del Pobil A, Lee S (2008a) Robotic execution of everyday tasks by means of external
vision/force control. Journal of Intelligent Service Robotics 1(3):253–266
Prats M, Sanz PJ, del Pobil AP (2008b) A sensor-based approach for physical interaction based on hand, grasp
and task frames. In: Proceedings of the Manipulation Workshop in Robotics Science And Systems
Quigley M, Gerkey B, Conley K, Faust J, Foote T, Leibs J, Berger E, Wheeler R, Ng A (2009) (ros): an
open-source robot operating system. In: ICRA Workshop on Open Source Software in Robotics
Raghavan M, BRoth (1990) A general solution for the inverse kinematics of all series chains. In: Proc. of the
8th CISM-IFTOMM Symposium on Robots and Manipulators
Rimon E (1999) Caging planar bodies by one-parameter two-fingered gripping systems. The International
Journal of Robotics Research 18:299–318
Rimon E, Blake A (1986) Caging 2d by one-parameter two-fingered gripping systems. In: Proceedings of the
IEEE International Conference on Robotics and Automation (ICRA)
Russell S, Norvig P (1995) Artificial Intelligence: A Modern Approach. Prentice Hall, Inc.
Rusu RB, Marton ZC, Blodow N, Dolha M, Beetz M (2008) Towards 3d point cloud based object maps for
household environments. Robotics and Autonomous Systems Journal (Special Issue on Semantic Knowledge)
Rybski P, Veloso MM (2009) Prioritized multihypothesis tracking by a robot with limited sensing. EURASIP
Journal on Advances in Signal Processing 2009:138–154
Rybski PE, Roumeliotis S, Gini M, Papanikopoulos N (2008) Appearance-based mapping using minimalistic
sensor models. Autonomous Robots 24(3):159–167
Saidi F, Stasse O, Yokoi K, Kanehiro F (2007) Online object search with a humanoid robot. In: Proceedings
of IEEE-RAS International Conference on Humanoid Robots (Humanoids)
Schaffalitzky F, Zisserman A (2002) Multi-view matching for unordered image sets, or how do i organize my
holiday snaps. In: Proc. of European Conference on Computer Vision (ECCV)
Schneiderman H, Kanade T (2004) Object detection using the statistics of parts. International Journal of
Computer Vision 56(3):151–157
Schwarzer F, Saha M, claude Latombe J (2003) Exact collision checking of robot paths. Algorithmic Foundations
of Robotics V 7:25–41
Sentis L (2007) Synthesis and Control of Whole-Body Behaviors in Humanoid Systems. PhD thesis, Stanford
University, Stanford, California
Serre T, Wolf L, Poggio T (2005) Object recognition with features inspired by visual cortex. In: Proc of Conf.
on Computer Vision and Pattern Recognition
Shi J, Malik J (1997) Normalized cuts and image segmentation. In: Proc of the 1997 Conference on Computer
Vision and Pattern Recognition (CVPR ’97), IEEE Computer Society
Srinivasa S, Ferguson D, Weghe MV, Diankov R, Berenson D, Helfrich C, Strasdat H (2008) The robotic busboy:
Steps towards developing a mobile robotic home assistant. In: Proceedings of the International Conference
on Intelligent Autonomous Systems (IAS)
Stilman M (2007) Task constrained motion planning in robot joint space. In: Proceedings of the IEEE Inter-
national Conference on Intelligent Robots and Systems (IROS)
Stilman M, Nishiwaki K, Kagami S (2007a) Learning object models for humanoid manipulation. In: IEEE
International Conference on Humanoid Robotics
Stilman M, Schamburek J, Kuffner J, Asfour T (2007b) Manipulation planning among movable obstacles. In:
Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)
Stolarz J, Rybski P (2007) An architecture for the rapid development of robot behaviors. Master’s thesis,
Robotics Institute, Carnegie Mellon University
Stoytchev A, Arkin R (2004) Incorporating motivation in a hybrid robot architecture. Journal of Advanced
Computational Intelligence and Intelligent Informatics 8(3):269–274
Stroustrup B (2001) Exception safety: Concepts and techniques. In: Advances in Exception Handling Tech-
niques, Lecture Notes in Computer Science, vol 2022, Springer Berlin / Heidelberg, pp 60–76, URL
http://dx.doi.org/10.1007/3-540-45407-1_4
Stulp F, Fedrizzi A, Beetz M (2009) Learning and performing place-based mobile manipulation. In: In Pro-
ceedings of the 8th International Conference on Development and Learning (ICDL)
Sudsang A, Ponce J, Srinivasa N (1997) Algorithms for constructing immobilizing fixtures and grasps of three-
dimensional objects. In: In J.-P. Laumont and M. Overmars, editors, Algorithmic Foundations of Robotics
II
Torralba A, Murphy KP, Freeman WT (2007) Sharing visual features for multiclass and multiview object
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(5):854–869
Tuytelaars T, Van Gool L (2004) Matching widely separated views based on affine invariant regions. Int J
Comput Vision 59(1):61–85
Urmson C, Anhalt J, Bagnell D, Baker CR, Bittner R, Clark MN, Dolan JM, Duggins D, Galatali T, Geyer
C, Gittleman M, Harbaugh S, Hebert M, Howard TM, Kolski S, Kelly A, Likhachev M, McNaughton M,
Miller N, Peterson K, Pilnick B, Rajkumar R, Rybski PE, Salesky B, Seo YW, Singh S, Snider J, Stentz A,
Whittaker W, Wolkowicki Z, Ziglar J, Bae H, Brown T, Demitrish D, Litkouhi B, Nickolaou J, Sadekar V,
Zhang W, Struble J, Taylor M, Darms M, Ferguson D (2008) Autonomous driving in urban environments:
Boss and the urban challenge. J Field Robotics 25(8):425–466
Wampler C, Morgan A (1991) Solving the 6r inverse position problem using a generic-case solution methodology.
Mechanisms and Machine Theory 26(1):91–106
Weghe MV, Ferguson D, Srinivasa S (2007) Randomized path planning for redundant manipulators without in-
verse kinematics. In: Proceedings of IEEE-RAS International Conference on Humanoid Robots (Humanoids)
Wheeler M, Ikeuchi K (1992) Towards a vision algorithm compiler for recognition of partially occluded 3-d
objects. Tech. Rep. CMU-CS-TR-92-185, Robotics Institute
Wheeler M, Ikeuchi K (1995) Sensor modeling, probabilistic hypothesis generation, and robust localization for
object recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 17
Wu C, Chipp B, Li X, Frahm JM, Pollefeys M (2008) 3d model matching with viewpoint-invariant patches
(VIP). In: Proc of Conf. on Computer Vision and Pattern Recognition
Wyrobek K, Berger E, der Loos HV, Salisbury K (2008) Towards a personal robotics development platform:
Rationale and design of an intrinsically safe personal robot. In: Proceedings of the IEEE International
Conference on Robotics and Automation (ICRA)
Yershova A, Jain S, LaValle S, Mitchell J (2009) Generating uniform incremental grids on so(3) using the hopf
fibration. International Journal of Robotics Research
Zacharias F, Borst C, Hirzinger G (2007) Capturing robot workspace structure: Representing robot capabilities.
In: Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS)
Zacharias F, Sepp W, ChBorst, Hirzinger G (2009) Using a model of the reachable workspace to position mobile
manipulators for 3-d trajectories. In: Proceedings of IEEE-RAS International Conference on Humanoid
Robots (Humanoids)
Zacharias F, Leidner D, Schmidt F, ChBorst, Hirzinger G (2010) Exploiting structure in two-armed manipula-
tion tasks for humanoid robots. In: Proceedings of the IEEE International Conference on Intelligent Robots
and Systems (IROS)
Zhang L, Curless B, Seitz SM (2002) Rapid shape acquisition using color structured light and multi-pass
dynamic programming. In: In The 1st IEEE International Symposium on 3D Data Processing, Visualization,
and Transmission, pp 24–36
Zhang L, Curless B, Seitz SM (2003) Spacetime stereo: Shape recovery for dynamic scenes
Zhang Z (2000) A flexible new technique for camera calibration. In: IEEE Transactions on Pattern Analysis
and Machine Intelligence
Zhang Z, Deriche R, Faugeras O, Luong Q (1995) A robust technique for matching two uncalibrated images
through the recovery of the unknown epipolar geometry. Artificial Intelligence 78(1-2):87–119
Zheng Y, Qian WH (2006) An enhanced ray-shooting approach to force-closure problems. Journal of Manufac-
turing Science and Engineering 128(4):960–968