Zhang 2022 J. Phys. Conf. Ser. 2203 012065
Zhang 2022 J. Phys. Conf. Ser. 2203 012065
Zhang 2022 J. Phys. Conf. Ser. 2203 012065
Abstract. There are many kinds of inverse kinematics solutions for robots. Deep reinforcement
learning can make the robot spend a short time to find the optimal inverse kinematics solution.
Aiming at the problem of sparse rewards in the process of deep reinforcement learning, this paper
proposes an improved PPO algorithm. Firstly, built a simulation environment for the operation
of the robotic arm. Secondly, use a convolutional neural network to process the data read by the
camera of the robotic arm, obtaining a network about Actor and Critic. Thirdly, based on the
principle of inverse kinematics of the robotic arm and the reward mechanism in deep
reinforcement learning, design a hierarchical reward function containing motion accuracy to
promote the convergence of the PPO algorithm. Finally, compare the improved PPO algorithm
with the traditional PPO algorithm. The results show that the improved PPO algorithm has
improved both the convergence speed and the operating accuracy.
1. Introduction
The control of the robotic arm is an indispensable part of the field of robot control. There are usually
many inverse kinematics solutions of the robotic arm, and the motion of each joint in each solution is
different. The robotic arm can quickly find a suitable trajectory in different working environments
through deep reinforcement learning algorithms.
However, in the construction of many deep reinforcement learning algorithms, the sparse rewards
problem is a key challenge that has not yet been solved[1]. In a sparse reward environment, the agent
can only get rewards at the end of the sequential decision-making process. The lack of effective
information feedback in the intermediate process leads to under-fitting of the strategy network, slow
training speed, and high cost. There are many ways to solve the sparse reward problem, which can be
roughly divided into multi-objective guidance method that increases virtual goal rewards, hierarchical
strategy method that increase subtask rewards, the imitation learning method that increase expert
similarity rewards, and the curiosity method that increases the state novelty rewards.
Designing the corresponding reward function for different environments can more directly reduce
the impact of sparse rewards on deep reinforcement learning training. In the field of manipulator control,
one of the important objectives of the reward function is to drive the end of the manipulator to the target
point. Li Heyu et al[2]. Improved the reward function of the PPO algorithm, and divided the grasping
operation of the robotic arm into two stages, one is from the initial position to the stage directly below
the target object, and the other is from directly below the target object to the specified grasping position.
This method reduced the motion of the robot arm Jitter in the process; Zhao et al[3].proposed a robotic
snatch method that combined the concept of energy consumption in dynamics. The distance between
the robotic arm and the real target was determined by calculating the energy consumption, thereby
improving the accuracy of the reward function; Sun Kang et al[4]. Set up a reward function based on
the time of complete the capture task, the distance between the end-capture mechanism of the robotic
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ICRAIC-2021 IOP Publishing
Journal of Physics: Conference Series 2203 (2022) 012065 doi:10.1088/1742-6596/2203/1/012065
arm and the target point, and the size of the joint observation driving torque as the reward value.
In response to the problem and research discussed above, this paper modifies the reward function of
the PPO algorithm, and sets up a hierarchical reward function according to the distance between the end
of the robotic arm and the target object. This could solve the problem of low learning efficiency of the
PPO algorithm to a certain extent.
2. Related works
The PPO algorithm changes the on-policy training method in the PG algorithm to the off-policy training
method, which updates the strategy in each iteration and reduces the difference with the previous
strategy while ensuring the minimum loss function. Through the importance sampling method, the
𝑘
objective function of PPO is as follows, where 𝐴𝜃 (𝑠𝑡 ,𝑎𝑡 ) is the advantage function[5]:
′ 𝑝 (𝑎 |𝑠 ) 𝑘
𝐽𝜃 (𝜃) = 𝐸(𝑠𝑡 ,𝑎𝑡 )~𝑝 [𝑝 𝜃 𝑎𝑡 |𝑠𝑡 𝐴𝜃 (𝑠𝑡 ,𝑎𝑡 )] (1)
𝜃𝑘 𝜃𝑘
( 𝑡 𝑡)
This article adopts the PPO-Clip method. The update strategy of the objective function is as follows:
𝜃𝑘+1 = 𝑎𝑟𝑔max 𝐸 [𝐿(𝑠,𝑎,𝜃𝑘 ,𝜃)] (2)
𝜃 𝑠,𝑎~𝑝𝜃𝑘
𝑝𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝜃𝑘 𝑘
= min ( 𝐴 (𝑠𝑡 ,𝑎𝑡 ),𝑔(𝜀,𝐴𝜃 (𝑠𝑡 ,𝑎𝑡 ))) (3)
𝑝𝜃𝑘 (𝑎𝑡 |𝑠𝑡 )
(1+𝜀)𝐴 𝐴⩾0
𝑔(𝜀,𝐴) = { (4)
(1−𝜀)𝐴 𝐴<0
When the advantage function is positive, the next update will tend to increase the probability of
taking the same action. When the advantage function is negative, the next update will tend to reduce the
probability of taking the same action.
2
ICRAIC-2021 IOP Publishing
Journal of Physics: Conference Series 2203 (2022) 012065 doi:10.1088/1742-6596/2203/1/012065
robot arm should rotate when the end of the robot arm reaches the next position (𝑥𝑟 ′,𝑦𝑟 ′,𝑧𝑟 ′) .
Subsequently, the motors of each joint will drive the joint to rotate and output an action A.
A = [∆𝑎1 , ∆𝑎2 , ∆𝑎3 , ∆𝑎4 , ∆𝑎5 , ∆𝑎6 ] (5)
When the angle of each joint of the robotic arm changes, the image data collected by the camera will
also change, so S is updated. Repeat the above process until the end of the robotic arm reaches the target
position(𝑥d ,𝑦d ,𝑧d ).
4.1.1. Specify the limit value of the end motion of the robotic arm. When the end exceeds the limit range
of the motion, r = −0.5.
0.3 < 𝑥 < 0.9
{−0.3 < 𝑦 < 0.3 (6)
0 < 𝑧 < 0.6
4.1.2. Set the maximum number of motion steps. When the number of motion steps is exceeded and the
end of the robotic arm has not reached the target position, r = −0.5.
4.1.3. Hierarchical reward function To give the system enough positive rewards, the rewards for
successfully reaching the target position are divided into steps
𝑟 = 0.1 , 0.05 < 𝑑𝑖𝑠 ⩽ 0.2
{ 𝑟 = 5 , 0.01 < 𝑑𝑖𝑠 ⩽ 0.05 (7)
𝑟=7 , 𝑑𝑖𝑠 ⩽ 0.01
3
ICRAIC-2021 IOP Publishing
Journal of Physics: Conference Series 2203 (2022) 012065 doi:10.1088/1742-6596/2203/1/012065
4
ICRAIC-2021 IOP Publishing
Journal of Physics: Conference Series 2203 (2022) 012065 doi:10.1088/1742-6596/2203/1/012065
6. Conclusion
This paper constructs an improved PPO algorithm, sets a reasonable reward function, and uses it for
neural network training. Through verification, the algorithm in this paper can converge in a short time,
improve efficiency, and has a stable control effect. After training, the robotic arm can continuously learn
and update itself among numerous inverse kinematics solutions, and approach the target object as
quickly as possible.
Acknowledgement
First and foremost, I would like to show my deepest gratitude to my supervisor, Zhang Shaolin, a
respectable, responsible and resourceful scholar, who has provided me with valuable guidance in every
stage of the writing of this thesis. Without his enlightening instruction, impressive kindness and patience,
I could not have completed my thesis.
5
ICRAIC-2021 IOP Publishing
Journal of Physics: Conference Series 2203 (2022) 012065 doi:10.1088/1742-6596/2203/1/012065
I shall extend my thanks to Associated Professor Zheng Change for all her kindness and help. Her
keen and vigorous academic observation enlightens me not only in this thesis but also in my future study.
Last but not least, I would like to thank the School of Technology, Beijing Forestry University and
the Institute of Automation, Chinese Academy of Sciences for providing an experimental environment
for my thesis.
This thesis work was partially supported by the Beijing University Student Research and Career
Creation Program, the project number is s20201002114.
References
[1] He Q 2021 Research on Multi-goal-conditioned Method in Reinforcement Learning with Sparse
Rewards Hefei University of Science and Technology of China
[2] Li H, Lin T and Zeng B 2020 Control Method of Space Manipulator by Using Reinforcement
Learning Aerospace Control 38 p 6
[3] Zhao R and Tresp V 2018 Energy-based hindsight experience prioritization Conference on Robot
Learning PMLR pp 113-122
[4] Sun K, Wang Y, Du D and Qi N 2020 Capture Control Strategy of Free-Floating Space
Manipulator Based on Deep Reinforcement Learning Algorithm Manned Spaceflight 26 p 6
[5] Jian S 2021 Algorithm principle and implementation of proximal policy optimization[EB/OL]
https://www.jianshu.com/p/9f113adc0c50.
[6] Coumans E and Bai Y 2017 PyBullet Quickstart Guide[EB/OL] https://github.com/bulletph
ysics/bullet3
[7] Zhu G, Huo Y, Luan Q and Shi 2021 Y Research on IoT environment temperature prediction
based on PPO algorithm optimization[J] Transducer and Microsystem Technologies 40 p 4
[8] OpenAI Spinning Up[EB/OL] 2018 https://spinningup.openai.com/en/latest/