Combination of Computer Vision Detection and Segmentation for Autonomous Driving
Yu-Ho Tseng, Shau-Shiun Jan, Department of Aeronautics and Astronautics National Cheng Kung University, Taiwan
The unmanned technology we increasingly rely on is becoming more and more feasible and has gradually been taken seriously. For example, many research institutes and companies are committed to the development of self-driving cars to reduce human error caused by exhaustion. Therefore, many researchers begin to develop computer vision. More appropriate predictions such as the shapes of asphalt roads, cars and people can provide more reliable information for computer vision of the self-driving car to know where to go and what to avoid. Current advanced deep learning networks for autonomous cars are based on the semantic segmentation technique that can detect different objects by describing the edge of each object. However, semantic segmentation has a fatal problem, namely the accuracy of the segmentation edge. Since the semantic segmentation conducts the classification pixel by pixel, the segmentation edge is often relatively rough and blurred. The inaccurate segmentation edge might cause serious misjudgments and might further result in causing accidents. In addition to semantic segmentation, object detection technique is also the focus of the current development of multiple objects recognition. By contrast, object detection is suitable for self-driving car to describe the shape of which we need to protect. Because object detection describes the shape of an object with a rectangular box, it could provide more safety margin for the protected target. However, object detection technique is not suitable for detecting curved items, since it could lead to excessive classification area error. Therefore, in order to develop a more reliable computer vision for autonomous car, the integration of object detection and semantic segmentation is required.
Consequently, this paper designs a unified network architecture that takes the advantages of both the semantic segmentation and object detection to detect people, cars and asphalt roads simultaneously. As stated above, semantic segmentation is good at describing irregular shapes by classifying images pixel by pixel, and therefore is used for asphalt road detection. On the other hand, cars and people are most needed to be protected from collisions. Since we should provide larger safety margin for them, the object detection is the best choice. However, even the appropriate tools have been chosen to address the issues, the design of a proper unified network combining two approaches is yet another concern. Despite the fact that both techniques are based on the same network architecture, the networks trained for semantic segmentation and object detection are not the same. Networks trained by different techniques vary greatly. Most semantic segmentation networks are trained to classify objects pixel by pixel, while modern object detection networks are trained to classify by small area. Therefore, in order to appropriately integrate semantic segmentation with object detection, three main tasks are needed to be done: (1) design a network to integrate both techniques, (2) define the loss function properly and (3) test our training model accuracy and time consumption. First, we select the current deep learning network for self-driving as reference. We transform the last layer into detection decoder, and segmentation decoder. We use the multiple feature layers to accelerate the training process and improve its accuracy as well. Additionally, we use relatively low resolution input that can effectively reduce computation load and increase detection speed. Second, the definition of loss function is the key factor of training quality and convergence rate. Therefore, we define an appropriate loss function for training our detection and segmentation decoder. Third, the performance of accuracy and time consumption we compare our proposed approach with only segmentation and only detection.
Finally, in this paper, we provide (1) the accuracy assessment of semantic segmentation and object detection, (2) the time consumption evaluation for all tasks in inference processes and (3) the comparison study between our approach with only segmentation and only detection.