Academia.eduAcademia.edu

Towards GPU-Accelerated PRM for Autonomous Navigation

2020, Advances in Intelligent Systems and Computing

Towards GPU-Accelerated PRM for Autonomous Navigation 74 Janelle Blankenburg, Richard Kelley, David Feil-Seifer, Rui Wu, Lee Barford, and Frederick C. Harris Jr. Abstract Sampling based planning is an important step for longrange navigation for an autonomous vehicle. This work proposes a GPU-accelerated sampling based path planning algorithm which can be used as a global planner in autonomous navigation tasks. A modified version of the generation portion for the Probabilistic Road Map (PRM) algorithm is presented which reorders some steps of the algorithm in order to allow for parallelization and thus can benefit highly from utilization of a GPU. The GPU and CPU algorithms were compared using a simulated navigation environment with graph generation tasks of several different sizes. It was found that the GPU-accelerated version of the PRM algorithm had significant speedup over the CPU version (up to 78×). This results provides J. Blankenburg () · D. Feil-Seifer · F. C. Harris Jr. Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA e-mail: jjblankenburg@nevada.unr.edu; dave@cse.unr.edu; fred.harris@cse.unr.edu R. Kelley Nevada Center for Applied Research, University of Nevada, Reno, Reno, NV, USA e-mail: rkelley@unr.edu R. Wu Department of Computer Science, East Carolina University, Greenville, NC, USA e-mail: wur18@ecu.edu L. Barford Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA Keysight Laboratories, Keysight Technologies, Reno, NV, USA e-mail: lee.barford@ieee.org promising motivation towards implementation of a realtime autonomous navigation system in the future. Keywords Path planning · Autonomous vehicle · Probabilistic roadmap · Parallel computing · Speedup 74.1 Introduction The primary motivation of this work is to develop an endto-end navigation system for an autonomous vehicle. Autonomous navigation can have a strong impact on society, enabling an industry that affects many people’s daily lives. This booming industry involves both academic institutions and large enterprises competing together in order to develop a feasible autonomous car. Three of the major components autonomous navigation in uncertain environments include are: the controller, roadway and obstacle detection, and path planning. While these systems will work together, each component has its own set of engineering challenges in order for it to function in real time and in the real world. For the purposes of this work, we will focus on the path planning component. In order for an autonomous system to perform in realworld environments, this system must be able to generate a path for navigation in real-time. This quick pace is one of the main challenges of path planning algorithms for autonomous vehicles. In recent years, many autonomous cars have begun to include embedded GPUs into their systems. These GPUs can be utilized towards the effort of developing a real-time navigation system. One of the other big issues developers face when choosing a path planning algorithm is how to incorporate the dynamics of the vehicle into the plan. © Springer Nature Switzerland AG 2020 S. Latifi (ed.), 17th International Conference on Information Technology–New Generations (ITNG 2020), Advances in Intelligent Systems and Computing 1134, https://doi.org/10.1007/978-3-030-43020-7_74 563 564 J. Blankenburg et al. Various types of path planners and dynamics models have been used for autonomous navigation. This work focuses specifically on sampling based motion planners. These methods tend to utilize both a global and a local planner. The local planner is required to help the car navigate between different way points. The global planner it used to generate this set of way points from the start to the goal. Since a local planner handles the small scale navigation and vehicle dynamics, a simplified global sampling based path planning algorithm is sufficient for generating a long-range path. For this reason, this paper explores the use of Probabilistic Road Maps (PRMs) [1] for global planning. In order to ensure this system can be incorporated into a real-time autonomous navigation system in the future, the paper proposes a GPUaccelerated PRM algorithm. The remainder of the paper is structured as follows: Sect. 74.2 presents state-of-the art approaches to GPUaccelerated sampling based planning for autonomous navigation, Sect. 74.3 describes the details of the proposed approach, Sect. 74.4 describes the experimental setup and results, Sect. 74.5 presents new and related directions of research, and Sect. 74.6 gives a concluding summary of the presented work. 74.2 Related Work The eventual goal of this work is to create a real-time planner that will be integrated into the autonomous navigation system on the University of Nevada Reno’s autonomous Lincoln MKZ vehicle. For the purposes of this work, this section explores two main areas of research: state of the art methods for autonomous navigation using sampling based planners and state of the art methods for accelerating sampling based planners using a GPU. 74.2.1 Autonomous Navigation Via Sampling Based Planning Sampling based planning has been utilized as a means for path planning in autonomous systems for over a decade. The majority of these methods use a complicated path planning function, such as Rapidly-exploring Random Trees (RRT) [2–6]. However, the RRT algorithm requires most if not all of the steps to be implemented on the GPU all at once in order to avoid the high overhead cost of data transfer between the CPU and GPU. Therefore, this work uses a more simplistic algorithm (PRM) which allows us to incrementally determine feasibility of a path with minimal overhead. Significantly fewer works utilize a simple sampling based planning method, such as PRM. The works in [7] and [8] used PRMs to perform navigation tasks for an autonomous unmanned aerial helicopter. Although this work performed planning in 3D, autonomous car navigation is similar as obstacles are three-dimensional. The work proposed in [9] utilizes PRMs for multi-robot motion planning for car-like robots. The planning methods in this work are similar to planning for autonomous navigation tasks as well. One recent work combines PRMs and reinforcement learning for long-range robotic navigation [10]. The application of this work is most similar to our intended application. In this work, an RL agent learns short-range, point-to-point navigation polices the capture the robot dynamics, which acts as the local planner for the autonomous navigation task. Similar to the proposed work, this paper uses PRM for the global planner for navigation. Each of these papers are focused on creating a full autonomous navigation system. They focus not only on a global planner, but also utilize local planners to ensure the navigation adheres to the dynamics of the system. Therefore, these papers focus on a different application than the proposed work, as none of these methods utilize GPUs to accelerate the global planning. Instead of focusing on ensuring correct dynamics through a local planner, the proposed work focuses on speeding up the global planner by utilizing the GPU. Creating an end-to-end autonomous navigation system is currently out of the scope of this project, but the methods proposed in this work may be utilized in such a system in the future. 74.2.2 Sampling Based Planning Using a GPU Many works utilize the GPU to speed up various sampling based methods. Some works focus on complicated methods, such as RRT [11–13]. Other works focus on simpler methods such as PRMs [14–16]. However, all of these methods focus on a different problem formulation than that proposed in our future work on the autonomous car. The work in [17] is most similar to our proposed method. This work developed a GPU-based parallel collision detection algorithm for motion planning using PRMs. Although this work develops a very thorough algorithm for GPU-accelerated PRMs, the applications discussed in this work are slightly different than those proposed in our work. Our proposed methods is focused on speeding up the global planner for a long-range autonomous navigation system for an autonomous car. Therefore, the overhead of the work proposed in [17] is unnecessary for our application, as we are only focusing on a small portion of the planning problem. This focus on long-range navigation requires planning in very large spaces, which are not addressed in [17]. Towards GPU-Accelerated PRM for Autonomous Navigation 74.3 Methodology The proposed work is focused on speeding up the generation portion of the PRM algorithm by utilizing the computational efficiency of the GPU. The steps of the generation portion of the PRM algorithm are shown in Algorithm 1. The general idea behind the PRM algorithm is to generate a set of random points in the configuration space (i.e. generate a configuration) that are not in collision with the obstacles in the space. Then each point is connected to its k-nearest neighbors, thus generating a graph of the configuration space which can later be used for planning during the query phase of PRM. For the purposes of this work, we are not discussing the query phase of PRM, as we only provide modifications to the generation phase for PRM. In order to develop a GPU-accelerated PRM algorithm, a few minor changes have to be made to this algorithm. These changes are focused on changing the order of the steps in order to allow for parallelization in the code. The GPUaccelerated PRM algorithm is shown in Algorithm 2. Several important remarks about the implementation of the GPU-accelerated PRM algorithm are discussed in the following subsections. 74.3.1 Data Transfer Overhead In the current version of the GPU-accelerated algorithm, the random configurations are generated on the CPU. They are Algorithm 1 Main steps of the generation portion of the PRM algorithm 1: Graph G is empty 2: while number of vertices in G < N do 3: generate random configuration q 4: for each q, select k closest neighbors do 5: for each neighbor q’ do 6: local planner connects q to its neighbors 7: if connection is collision free then 8: add edge (q, q’) to G 9: add vertex q to G Algorithm 2 Modified version of the generation portion of the PRM algorithm to allow for parallelization 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: Graph G is empty Target vector t is empty generate N random configurations q for each q, check for collisions do if point is collision free then save q into t for each q in t, find k closest neighbors do if connection is collision free then add edge to G add vertex to G 565 then transferred over to the GPU, and are used on the GPU for the remainder of the algorithm. The rest of the algorithm is run on the GPU so there is almost no other data transfer overhead for generating the graph using this method. The only caveat here is discussed in Sect. 74.3.3, where we see there is an additional minimal overhead to maintain the graph as it is generated. 74.3.2 Utilizing Thrust Functions In order to gain the most efficiency from utilizing the GPU, several of the built in thrust functions were used. In order to perform the initial collision check, the following steps were taken: 1. use thrust::count_if to count the number of configurations that are not in collision with the obstacle in the scene, and 2. use thrust::copy_if to save the non-colliding configurations to the target vector t The count function is necessary to allocate a target vector of the correct size to minimize the memory overhead required by the remainder of the algorithm. The utilization of the thrust functions allowed for a simple and efficient way to filter out the configurations that were in collision with the obstacle in the scene. Therefore, we are only left with configurations in the target vector which are possible candidates for nodes in the generated graph. In order to find the nearest neighbors of the a given configuration q in the vector t, the following steps were taken: 1. use thrust::transform to calculate the distance from each configuration in t to q 2. use thrust::sort_by_key to sort the configurations by their distance to q, and 3. get the first k sorted configurations, which correspond to the k-nearest neighbors By utilizing the transform function, we were able to get the distance from one configuration to the rest very quickly. By next applying a sorting function, we were able to quickly pull out the k-nearest neighbors by simply taking the portion of the data which corresponded to the closest configurations based on their distances to the given configuration. This manner of k-nearest neighbors search should be very efficient for large graphs, due to the parallelization of both the distance check and sorting via the thrust functions. This process is then repeated for every configuration q in t, in order to get the k-nearest neighbors for each configuration in t. This repetition is done by iterating through this configurations, calling the above process each time. This 566 J. Blankenburg et al. iterative portion is the slowest part of the proposed algorithm. With additional data storage, this portion might benefit from further parallelization in the future. 74.3.3 Storage of the Graph The last important remark about this algorithm is that the graph G is stored on the CPU. In order to do this, once we have the k-nearest neighbors for a given configuration q, we have to copy these from the GPU back to the CPU. This results in an additional data transfer overhead in our algorithm, as briefly mentioned in Sect. 74.3.1. However, as we get to large graphs, this transfer becomes minimal since each configuration only transfers k neighbors. We can then store the edges between these neighbors and the given configuration q in the graph. Additionally, we store the given configuration q as a vertex in the graph. Thus, our graph is made up of a list of vertices and edges like normal. Storing the graph on the CPU is very useful for visualization purposes. Visualizing the generated graph is done on the CPU, so storing the graph directly to the CPU as it is generated has much less overhead since there is minimal data transfer between the CPU and the GPU. 74.4 Experimental Validation 74.4.1 Experimental Setup In order to evaluate the performance of the GPU-accelerated algorithm, we performed several different test cases in which we compared a CPU version of the algorithm with the GPUaccelerated version. These were done on an MSI MS-16H2 laptop with an Intel 4 core i7 CPU (i7-4710HQ) with 16 GB of RAM and an NVIDIA GeForce GTX 860 M with 4 GB of NVRAM and 1152 cores. In order to allow for a more direct comparison, the CPU version also utilizes the parallelized version of the PRM algorithm. Due to the use of thrust vectors and functions, the only major differences between the CPU and GPU version is whether the thrust vectors are stored on the host or the device. In the CPU version, all of the thrust vectors are stored directly on the CPU and no data transfer occurs to the GPU. The details of the GPU version are discussed in Sect. 74.3. The CPU and GPU-accelerated algorithms were compared in a simulated navigation environment. This environment consists of a single large obstacle in the center of the image. The goal is to build a graph around the obstacle which can be used to generate a navigation path through the space later on. A sample of the environment is shown in Fig. 74.1 with a generated graph of 1000 nodes. Fig. 74.1 A sample image of the simulated navigation task. The black sphere in the center is the obstacle in the scene. The green dot in the lower left is the starting point and the red dot in the upper right is the ending point for the navigation task. The dots and lines represented a graph consisting of 1000 nodes that was generated by the GPUaccelerated algorithm In order to evaluate the performance of the algorithms, we used them to generate graphs with different numbers of nodes. We looked at how long it took to generate a graph with 10, 100, 1000, 10,000, and 100,000 nodes. However, due to the incredibly long runtime of the CPU algorithm with 100,000 nodes (>5 h), we chose to omit running this scenario on the CPU. Therefore, the case for generating a graph with 100,000 nodes is only run and analyzed for the GPU-accelerated version of the algorithm. Thus, in all of the figures, the scenario for the CPU with 100,000 nodes is left blank, and in the tables it is marked as N/A. Within each case, we averaged the runtime over five trials. The base environment was consistent across all trials. 74.4.2 Results and Discussion The comparison of runtimes between the CPU and GPUaccelerated PRM algorithm for the different scenarios are shown in Table 74.1 and Fig. 74.2. Additionally, we looked at a simple version of throughput across these different cases as well. In order to compute throughput, we simply divided the number of nodes generated by the runtime. This gives us a rough estimate of how many nodes can be generated per second. The throughput is shown in Table 74.2 and Fig. 74.3. Lastly, we looked at the speedup of the GPU with respect to Towards GPU-Accelerated PRM for Autonomous Navigation 567 Table 74.1 Chart of the runtime in seconds averaged over 5 trials of both the CPU and GPU-accelerated version of the modified PRM algorithm N 10 100 1000 10,000 100,000 CPU 0.0042 0.0524 2.312 224.6414 N/A GPU 0.1204 0.1326 0.396 2.8516 44.5028 The first row shows which scenario was run (N = 10, …, 100,000). The next two rows show the times for the CPU and GPU respectively Fig. 74.3 A plot of the throughput of both the CPU and GPUaccelerated version of the modified PRM algorithm. Throughput is calculated by taking the number of nodes in the graph divided by the runtime Table 74.3 Chart of the speedup of the GPU-accelerated version of the modified PRM algorithm with respect to the CPU version Fig. 74.2 A plot of the runtime in seconds of both the CPU and GPUaccelerated version of the modified PRM algorithm N 10 100 1000 10,000 100,000 Speedup 0.035 0.395 5.84 78.82 N/A Speedup is calculated by taking the runtime of the CPU version/runtime of the GPU version. The first row shows which scenario was run (N = 10, …, 100,000). The next row shows the speedup of the GPU algorithm Table 74.2 Chart of the throughput of both the CPU and GPUaccelerated version of the modified PRM algorithm N 10 100 1000 10,000 100,000 CPU 2380.95 1908.40 432.53 44.51 N/A GPU 83.06 754.15 2525.25 3506.80 2247.05 Throughput is calculated by taking the number of nodes in the graph divided by the runtime. The first row shows which scenario was run (N = 10, . . . , 100,000). The next two rows show the throughput for the CPU and GPU respectively the CPU version. The speedup is shown in Table 74.3 and Fig. 74.4. From Table 74.1 we see that in large cases the GPUaccelerated version ran significantly faster, but in very small cases, the CPU version greatly outperformed the GPU version. This results is what we expected; as the graph grows in size, the GPU should perform faster than the CPU version. This result is further illustrated in Fig. 74.2. We see that the slope of the runtime line for the GPU version is much less than the slope of the runtime line for the CPU version. This illustrates that as the number of nodes continue to grow, the difference in runtimes should continue to grow as well, implying that the GPU version performs better on larger graphs. From Fig. 74.3 we see strong trends in the CPU and GPU throughput. For small graphs, the CPU version has high Fig. 74.4 A plot of the speedup of the GPU-accelerated version with respect to the CPU version of the modified PRM algorithm throughput, but it quickly diminishes. On the other hand, the GPU version has small throughput to start and it gets better as the number of nodes increases, up until the 100,000 case. At this point, we see the throughput decreases slightly. Based on this information, the GPU algorithm works well for large graphs, but the performance decreases if the graphs get too 568 J. Blankenburg et al. large. The exact values of the throughput can be seen better in Table 74.2. In order to further analyze the performance of these methods, we compare them to calculate the speedup of the GPUaccelerated version with respect to the CPU version. The speedup for each scenario is shown in Table 74.3. The results in this table reflect what we saw in the runtime and throughput tables/plots. For small graphs, the GPU performs worse due to the overhead of the initial data transfer. As the graphs get larger, this overhead becomes minimal resulting in a significant speedup of the GPU version with respect to the CPU version. We see that for large graphs (N = 10,000) the GPU-accelerated version of the algorithm results in a speedup of 78×. From Fig. 74.4 we see that we have nearly linear speedup on a loglog scale as the number of nodes in the graph increases. This shows that although the throughput went down a bit on the largest size graph, the speedup would probably still be significant for very large graphs. Unfortunately, due to extremely long run time, we were unable to run the CPU version for the 100,000 case so we cannot get a speedup to prove that the trend of a linear speedup on a loglog scale continues for this case. Overall, from the experiments, we see that the GPUaccelerated version is much faster and more efficient for large graphs than the CPU version. Therefore, for the future application of long-range autonomous navigation, this algorithm will be very beneficial as the graphs necessary for path planning in these applications will be very large. 74.5 Future Work This work presents several important directions for future work. In order to further increase the speedup from the GPU-accelerated PRM algorithm, the random configuration generation will be performed on the GPU instead of the CPU. This will eliminate the overhead of transferring the data from the CPU to the GPU in the first step of the algorithm. In the future, this algorithm will be incorporated into an end-to-end autonomous navigation system. This component will be the global planner used for long-rage navigation tasks performed by the University of Nevada Reno’s autonomous Lincoln MKZ vehicle. This GPU-accelerated algorithm will help to ensure that the autonomous navigation system performs in real-time. If it is found that the speedup from this algorithm is not enough to perform navigation tasks in real-time, several other parallelizations of the algorithm could be programmed later such as those described in [17]. Additionally, the current version of the algorithm can be modified to include extra data storage to allow for the k-nearest neighbor search to be parallelized to run simultaneously across all configurations at once to help speed up the algorithm. 74.6 Conclusion This work is the first step towards a real-time autonomous navigation system. The focus is on speeding up a sampling based path planning algorithm for long-range navigation. This work proposes a GPU-accelerated sampling based planner which can be used as a global planner in autonomous navigation tasks. The sampling based path planning algorithm explored in this work is the PRM algorithm. A modified version of the generation portion for the PRM algorithm is presented. Both a CPU and GPU-accelerated version of this algorithm were evaluated using a simulated navigation environment with graph generation tasks of several different sizes. Based on these experiments, we found that the GPUaccelerated version of the PRM algorithm had significant speedup (up to 78×) over the CPU version. This result is the first step towards the implementation of a real-time autonomous navigation system in the future. Acknowledgments This material is based in part upon work supported by the National Science Foundation under grant numbers IIA-1301726 and IIS-1719027. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. References 1. Kavraki, L.E., Svestka, P., Latombe, J.-C., Overmars, M.H.: Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 12(4), 566–580 (1996) 2. Ma, L., Xue, J., Kawabata, K., Zhu, J., Ma, C., Zheng, N.: Efficient sampling-based motion planning for on-road autonomous driving. IEEE Trans. Intell. Transp. Syst. 16(4), 1961–1976 (2015) 3. hwan Jeon, J., Cowlagi, R.V., Peters, S.C., Karaman, S., Frazzoli, E., Tsiotras, P., Iagnemma, K.: Optimal motion planning with the half-car dynamical model for autonomous high-speed driving. In: 2013 American Control Conference, pp. 188–193. IEEE, Piscataway (2013) 4. Braid, D., Broggi, A., Schmiedel, G.: The terramax autonomous vehicle. J. Field Rob. 23(9), 693–708 (2006) 5. Kuwata, Y., Teo, J., Fiore, G., Karaman, S., Frazzoli, E., How, J.P.: Real-time motion planning with applications to autonomous urban driving. IEEE Trans. Control Syst. Technol. 17(5), 1105– 1118 (2009) 6. Ryu, J.-H., Ogay, D., Bulavintsev, S., Kim, H., Park, J.-S.: Development and experiences of an autonomous vehicle for highspeed navigation and obstacle avoidance. In: Frontiers of Intelligent Autonomous Systems, pp. 105–116. Springer, Berlin (2013) 7. Pettersson, P.O., Doherty, P.: Probabilistic roadmap based path planning for an autonomous unmanned aerial vehicle. In: Proceeding of the ICAPS-04 Workshop on Connecting Planning Theory with Practice (2004) 8. Pettersson, P.O., Doherty, P.: Probabilistic roadmap based path planning for an autonomous unmanned helicopter. J. Intell. Fuzzy Syst. 17(4), 395–405 (2006) 9. Yan, Z., Jouandeau, N., Cherif, A.A.: ACS-PRM: adaptive cross sampling based probabilistic roadmap for multi-robot motion planning. In: Intelligent Autonomous Systems, vol. 12, pp. 843–851. Springer, Berlin (2013) Towards GPU-Accelerated PRM for Autonomous Navigation 10. Faust, A., Ramirez, O., Fiser, M., Oslund, K., Francis, A., Davidson, J., Tapia, L.: PRM-RL: long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. CoRR, abs/1710.03937, 2017. http://arxiv.org/abs/1710. 03937 11. Park, C., Pan, J., Manocha, D.: Realtime GPU-based motion planning for task executions. In: IEEE International Conference on Robotics and Automation Workshop on Combining Task and Motion Planning (May 2013). Citeseer, Princeton (2013) 12. Bialkowski, J., Karaman, S., Frazzoli, E.: Massively parallelizing the RRT and the RRT. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3513–3518. IEEE, Piscataway (2011) 13. Jacobs, S.A., Stradford, N., Rodriguez, C., Thomas, S., Amato, N.M.: A scalable distributed RRT for motion planning. In: 2013 569 14. 15. 16. 17. IEEE International Conference on Robotics and Automation, pp. 5088–5095. IEEE, Piscataway (2013) Bleiweiss, A.: Gpu accelerated pathfinding. In: Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, pp. 65–74. Eurographics Association, Genoa (2008) Fischer, L.G., Silveira, R., Nedel, L.: Gpu accelerated pathplanning for multi-agents in virtual environments. In: 2009 VIII Brazilian Symposium on Games and Digital Entertainment, pp. 101–110. IEEE, Piscataway (2009) Kider, J.T., Henderson, M., Likhachev, M., Safonova, A.: Highdimensional planning on the GPU. In: 2010 IEEE International Conference on Robotics and Automation, pp. 2515–2522. IEEE, Piscataway (2010) Pan, J., Manocha, D.: Gpu-based parallel collision detection for fast motion planning. Int. J. Robot. Res. 31(2), 187–200 (2012)