Visual-Inertial SLAM by fusing Stereo and Inertial Measurement Units based on ORB-SLAM
Yi-Chieh Sun, Shau-Shiun Jan, National Cheng Kung University, Taiwan
Alternate Number 5
The simultaneous localization and mapping (SLAM) has been a popular research topic, and SLAM is widely used in robotic vision, virtual reality, and augmented reality. According to different sensors, SLAM can be categorized into 2D/3D SLAM, RGBD SLAM, and visual SLAM. Also, visual SLAM can be classified into feature-based SLAM and direct SLAM by the approaches of extracting features. Feature-based SLAM tracks an image by comparing the characteristic of feature points while direct SLAM calculates the relation between pixels. However, visual SLAM has some weakness while processing a pure rotation situation, moving object pass by, or non-textured area. These situations might affect the accuracy of pose estimation, and even result in tracking lost. In order to solve the problem, it is beneficial to fuse inertial measurement units (IMU) with SLAM. Gyroscope and accelerometer provides angular velocity and acceleration measurements. By receiving the additional self-motion information from the IMU sensors, the system is able to estimate the gravity direction and give the rotation observable. IMU observations are more appropriate for a quick movement during a short period of time, which is contrarily the weakness of a camera. On the other hand, the drift of the inertial sensor can be corrected by fusing data with camera since the image information does not drift if the camera is in a fixed location. Thus, camera and IMU measurements are able to complement each other to supply a more robust and accurate system.
In 2016, a research of fusing monocular camera and IMU sensors with ORB-SLAM has been presented. Following the approach of ORB-SLAM, the system is capable of closing loops and reusing its map to achieve zero-drift localization in the already mapped areas. The research overcomes the weakness of the IMU drift and the accurate state estimations from visual-inertial initialization can be received. However, monocular camera has a well-known problem of scale ambiguity. Hence, the system requires scale estimation and scale refinement. At the beginning of the system, only the pure monocular ORB-SLAM is conducted to provide measurements for initialization. When the initialization procedure receives reliable estimation, the system will start fusing IMU information. The experiment result points out that it needs 15 seconds to receive the reliable estimation. If the scale can be known at the start of the system, it is believed that the initialization procedure can be simplified and the calculation time can be reduced. Since the stereo camera is able to provide the scale information, and research of fusing stereo and IMU based on the algorithm of ORB-SLAM has not yet been presented, this research proposes a visual-inertial SLAM by fusing stereo camera and IMU sensors based on ORB-SLAM. Using a stereo instead of a monocular camera, the system can receive more information from images, which enhances the ability for tracking. Thus, more accurate pose estimations will be obtained and the robustness will be increased. The system is then tested by processing online EuRoC dataset, which includes different flight dynamics and lighting conditions.
Inspire by the work of monocular ORB-SLAM fusing with IMU, we follow the methods they adopted to achieve a visual-inertial SLAM, which are tightly-coupled and optimized-based method. In addition, we use stereo camera in order to acquire the information of scale. The IMU measurements between two keyframes are pre-integrated into a single compound measurement for the sampling rate of IMU is higher than the camera. The initialization of a visual-inertial system is an important issue which we are focusing on. Several steps are included to initialize the mono visual-inertial ORB-SLAM. The system first runs the pure monocular ORB-SLAM for a few seconds to estimate an initial solution for the structure and to generate several keyframe poses. Then, the bias of gyroscope is computed from the known orientation of two consecutive keyframes, so that the measurements of accelerometer can be correctly rotated. Next, the scale and gravity vector are estimated without considering the accelerometer bias. To avoid solving the velocity, two position relations between three consecutive keyframes are computed in this step. As we can receive the direction of gravity, the magnitude of gravity is used to solve the accelerometer bias, and refine the scale and gravity direction. Finally, since the gravity, biases, and the scale are known, the velocity of keyframes can be computed from the position relation between two consecutive keyframes. When all the variables above are reliably estimated, the system starts fusing the IMU information. We expect that as the stereo camera provides the scale information, the equation solving the scale and gravity vector can be simplified and thus speed up the calculation. The estimation of gravity direction and the accelerometer bias can then be closer to the true value to improve the accuracy of the proposed method.
To conclude, we present a tightly-coupled visual-inertial SLAM system. This ORB-SLAM-based system is fused with a stereo camera, providing complete localization and mapping in indoor areas with automatic initialization. We use the proposed method to process online dataset EuRoC and compare the performance with that of the mono visual-inertial ORB-SLAM system. First we compare the calculation time for the complete IMU initialization. Since the stereo camera provides the scale information, we believe that the calculation of the IMU initialization can be simplified and therefore reduce the time of receiving the accurate initialization. Second, the influence of the initialization performance on the accuracy in later optimization procedure is presented. For the scale factor is known, the rest of the initialization values can be more accurate. In order to check whether a stereo camera provides more precise results than a monocular one, we then calculate the root mean square error (RMSE) of the trajectory, and compare the result with that of the mono visual-inertial ORB-SLAM. The robustness of our method is also checked. EuRoC dataset includes 11 sequences containing different flight dynamics and lighting conditions. While the mono visual-inertial ORB-SLAM lost track in one of the eleven sequence, we expect that our system will successfully process every sequences. Finally, the proposed method fusing stereo camera and IMU sensors based on ORB-SLAM is expected to have simpler initialization procedure, more accurate positioning result, and is more robust of processing in different conditions.