Return to Session B4b

Session B4b: Sensor Fusion

A Semantic Segmentation-based Approach for Train Positioning
Sara Baldoni, Roma Tre University & Radiolabs, Italy; Federica Battisti, University of Padova, Italy; Michele Brizzi, Roma Tre University & Radiolabs, Italy; Michael Neri, Roma Tre University, Italy; Alessandro Neri, Roma Tre University & Radiolabs, Italy
Location: Beacon B

Peer Reviewed

Peer Reviewed

The European rail sector is increasing its efforts to build a positioning architecture that includes the GNSS in the European Rail Traffic Management System (ERTMS). The railway community selected satellite-based positioning as one of the key game changers for ERTMS evolution from the current framework based on odometers and transponders. However, satellite-related assets should allow a seamless operation with the current signalling standards in order to ensure full compatibility. Unfortunately, GNSS-based positioning techniques are vulnerable to degradations such as GNSS faults (e.g. satellite and constellation failure), signal deterioration (e.g. ionospheric scintillations, multipath) and external threats (e.g. jamming and spoofing). In order to provide high accuracy and high integrity navigation solutions for safety-critical applications, such vulnerabilities have to be overcome.
In this context, the objective of this work is to enhance the positioning performances thanks to the fusion of data from GNSS and visual sensors, providing accurate position information in challenging railway environments. Although GNSS will still play a primary role in train localization, the integration of a greater number of sensors will become compulsory. Different sensors can cooperate in order to increase the accuracy, or to enhance the robustness of the overall system. Moreover, the redundancy gained through the multi-sensor fusion approach can be employed for integrity purposes.
When designing a multi-sensor localization framework, the choice of the sensors to employ is of utmost importance. Different sensors, in fact, provide different types of data (e.g., images, depth maps, pointclouds) and can work in different operating conditions. Usually, cameras, stereo cameras, radars and LIDARs are used as additional sensors. In this work, we explore the application of cameras. The reasons behind this choice are manifold. First of all, cameras are among the first additional sensors which have been installed on cars, thus representing a well-known and mature technology. In addition, they are low-cost and light-weight, thus being suitable for the installation on all the classes of trains (e.g., regional trains, public transport trains, etc.). Moreover, as highlighted in [1], cameras provide higher resolution images than LIDARs, thus enabling an easier scene segmentation and understanding. Finally, deep learning tools for detection, segmentation, and classification have been mainly applied to images thus paving the way for a deep-learning based rail localization framework.
In the proposed scheme, on-board cameras are used to detect georeferenced landmarks. Optimal candidates for such use are railway infrastructure elements, e.g. signs, gantries, traffic lights, but any conspicuous point which is easily identified can be used.
Moreover, we take into consideration the constrained motion of the train onto its track to enhance robustness in case of noisy sensor data, errors in the landmark detection, or ambiguities due to incorrect landmark association. To this aim, a track map describing detailed geometric features in combination with the topological track connections is used. Thanks to the rapid development of Geographic Information Systems (GIS), in fact, we assume the availability of a digital map of the railway environment containing accurate information concerning the coordinates of the landmarks and the tracks.
During operation, the landmarks are detected from the images acquired by the on-board camera using a neural network, and are compared with those potentially inside the field of view of the camera. To reduce the number of possible candidates and the risk due to incorrect landmark association, we limit the database search area to the 2D confidence interval corresponding to the Hazardous Misleading Information (HMI) probability surrounding the current estimated position. Moreover, this also decreases the computational cost of the database searching procedure.
Then, the information extracted from the database is used to compute the train position based on the line-of-sight of the landmarks, i.e. the angle upon which they are seen from the camera perspective. More specifically, a generalized triangulation algorithm is used to localize the train by computing the intersection of the loci of points for which the differences of the viewing angles corresponding to each landmark pair remain constant. This locus corresponds to a portion of the circle passing through the landmarks and the train. The intersections between the circles can be computed by solving the nonlinear system given by the equations of those circumferences. Every time a new group of landmarks is encountered, this procedure is repeated in order to compensate for the drift accumulated by the odometry subsystem.
In addition, for the cases in which the computed position does not meet the stringent railway requirements, due for instance to an unfavorable landmark distribution, or when the number of (detected) visible landmarks is too small, we employ the track map by computing the intersection between the available circles and the track path.
However, the initial surveying effort needed to build the track database for the entirety of the railway lines could be considered too onerous by the rail stakeholders with respect to the current systems based on balises, and therefore prevent the adoption of these innovative technologies. For this reason, we will investigate the potential of employing visual sensors for providing the positioning information, regardless of the availability of the track map. Moreover the minimal configuration which enables the computation of the train’s position, and the impact that the track map could have on the resulting accuracy will be discussed. This way, the surveying effort could be initially limited to the areas in which few landmarks are present.
The proposed localization framework requires a precise perception of the surrounding environment in order to perform the positioning task. In this work, perception is developed thanks to machine learning and deep learning algorithms by transforming sensor data into semantic information. Object detection, together with semantic segmentation, are the most critical tasks in an autonomous transport system. They consist of a combination of regression and classification tasks aiming at recognizing and classifying relevant objects from images, videos and pointclouds. Fully Convolutional Neural Networks (CNNs) have obtained tremendous success in different fields of computer vision, and are currently gaining attention for the tasks of image segmentation and object detection.
Regarding the object detection task in images, most state-of-the-art neural network architectures (e.g., Fast R-CNN [2], YOLOv4 [3], Mask R-CNN [4]) use the “recognition using regions” paradigm in order to process different spatial locations and different aspect ratios within the image. A Region Proposal Network (RPN) is included to provide the dimensions of the bounding boxes and the classification output.
On the other hand, semantic segmentation is the task of clustering together the parts of an image which belong to the same object class. It is a form of pixel-wise classification which permits identifying relevant regions in the field of view of the camera. For instance, U-net [5] is a U-shaped semantic segmentation network which has a contracting path and an expansive path of convolutional filters. Originally developed for biomedical purposes, U-net is able to achieve a high level of segmentation accuracy in the automotive environment. It aims at obtaining a segmentation mask which characterizes each pixel of the input image. State-of-the-art models integrate the U-net architecture with attention modules in order to learn how to suppress irrelevant regions while highlighting salient features.
In order to compute the line-of-sight between the train and the landmark, the location of the pixels belonging to the landmarks has to be accurately defined. For this reason, semantic segmentation should be privileged with respect to object detection. In addition, given the peculiarities of the railway environment, we focus on the detection of traffic lights, traffic signs, and in general the signalling equipment used for train operations
As a consequence, the realization of the semantic segmentation task is crucial for the deployment of the proposed localization framework. However, to the best of our knowledge, it has never been employed for train positioning purposes. In addition, only few datasets containing images acquired by on-board visual sensors are publicly available for the railway scenario. For this reason, in this paper we explore the application of semantic segmentation by exploiting the RailSem19 [6] dataset for training a U-shaped network with attention mechanism.

REFERENCES
[1] L. Heng et al., "Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System," 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 4695-4702, doi: 10.1109/ICRA.2019.8793949.
[2] R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440-1448, doi: 10.1109/ICCV.2015.169.
[3] A. Bochkovskiy, C. Wang, and H. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
[4] K. He, G. Gkioxari, P. Dollár and R. Girshick, "Mask R-CNN," 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980-2988, doi: 10.1109/ICCV.2017.322.
[5] O. Ronneberger, P. Fischer, T. Brox, “U-net: Convolutional networks for biomedical image segmentation”, in: N. Navab, J. Hornegger, W. M. Wells, A. F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 2015, pp. 234–241.
[6] O. Zendel, M. Murschitz, M. Zeilinger, D. Steininger, S. Abbasi and C. Beleznai, "RailSem19: A Dataset for Semantic Rail Scene Understanding," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 1221-1229, doi: 10.1109/CVPRW.2019.00161.



Return to Session B4b