First-Person Indoor Navigation via Vision-Inertial Data Fusion
Amirreza Farnoosh, Mohsen Nabian, Pau Closas, Sarah Ostadabbas, Electrical & Computer Engineering Department, Northeastern University
In this work, we aim to enhance the first-person indoor navigation experience by fusing IMU data collected from a smartphone carried by the user with the vision information obtained through the phone's camera. Beside first-person indoor navigation, the proposed data fusion approach will be of interest in a variety of applications ranging from autonomous robotic navigation and unmanned vehicle control to real-time 3D map reconstruction of a new (or unknown) location.
Humans naturally perceive and navigate their surrounding world using their sense of vision as the primary input modality in combination with their self-proprioceptive feedback. Then, their brain reconstructs 3D models of the visualized scenes allowing them to navigate through the environment or to explore a new location. However, to reach a destination, the path information needs to be provided to the person via a map in conjunction with a positioning system or through their prior knowledge about the location/path. The Global Navigation Satellite Systems (GNSS) data collected over time provides localization and information on the path taken. However, GNSS signals are usually not available or very weak in indoor places, such as inside buildings or tunnels. To estimate accurate positioning in areas where the GNSS signals are unavailable, data from inertial measurement units (IMUs) has been used for relative odometry and orientation detection. Yet, data from IMUs are prone to extensive drift and distortion that is typically caused by accumulation error. Even if we could build a drift-free system by applying computationally-expensive filters and bias estimation models, the IMUs do not give the subject's orientation with respect to the known indoor coordinates, which for humans are formed by perceiving the locations of walls/corridors. Moreover, IMUs do not provide any information about the overall 3D structure of the indoor places/paths which is necessary for indoor navigation and localization. Unlike IMUs, vision data collected via cameras can provide drift-free instantaneous information about the person's orientation relative to the indoor coordinates as well as depth inference (distance to the walls), which are sufficient for indoor scene understanding. IMU data and vision-extracted information can be used together to enhance or complement each other. Vision-based orientation can be used to correct the drift of gyroscope measurement in order to enhance attitude estimation. The gyroscope data can also enhance the angle estimation from the video when there are little or no visual cues detected in the video.
To this end, we employed the concept of vanishing directions together with the orthogonality constraints of the man-made environments (called the Manhattan world assumption) in an expectation maximization (EM) framework to estimate person's orientation with respect to the known indoor coordinates from video frames as well as to detect hallways' depth and width information. In the man-made environments, the dominant vanishing directions are aligned with three orthogonal directions of the reference world coordinate. These orthogonality constraints can be used to estimate the relative orientation of the camera with respect to the scene. We formulated the problem such that we can obtain vanishing directions in a Gaussian sphere representation through a probabilistic framework from straight edge lines detected in the frames. This framework simultaneously detects vanishing directions and groups parallel lines using the EM algorithm. The incorporation of orthogonality constraint of the man-made environments into this framework results in a more robust orientation estimation. We proposed two approaches to solve the objective function of the EM algorithm. In the first approach, we explicitly solved the objective function for vanishing directions using the eigenvalue decomposition technique. Once we have the vanishing directions, the extraction of orientation angles is straightforward with Manhattan world assumption. In the second approach, we directly solved the objective function of the EM algorithm for orientation angles using the gradient descent (GD) method. This approach, although solves the objective function approximately, gives full control over orientation angles, and therefore leads to more accurate estimations (specially in crowded scenes), when we have some prior knowledge about the camera rotation axis or when we want to incorporate angle information from other sources (e.g. IMUs) into this framework.
We instantly combined the orientation angle(s) obtained from video with the angles measured from IMU unit using Kalman filter to remove gyroscope drift, obtain better orientation estimations and enhance navigation. We use the angle estimation from the video as measurement input and the angular velocity from gyroscope as the parameter for prediction model in the Kalman filter framework. This fusion also enhances the angle estimation from the video when proper edge lines cannot be detected in the frames. In the EM method for orientation estimation from video, we give the updated angles as the initial point for the new video frame. This helps the algorithm to keep track of the orientation change with respect to the initial room coordinates. Since the IMU measurements are not able to give relative orientation with respect to the scene coordinates, we use the angles obtained from the first video frame as starting orientation.
We then used the labeling of straight lines into three principal directions (as another output of the EM algorithm) to detect candidate ground lines in the image sequence, preferably at the corners of the scene, and infer depth information (distance to the walls). The specific structure of indoor scenes may allow us to detect proper ground lines (baselines) and depth lines using perspective properties, and select a view reference line, called horizon line, such that we can extract the depth, width, and height of the actual indoor 3D world having the camera height and focal length. When we have the depth information, we can use camera equations to obtain real-world coordinates of an image point detected to have that known depth. Once proper baseline and depth lines are selected, they can be used together with vanishing points to detect ground planes/walls (and their dimensions) for 3D modeling of the scene.
Finally, we used the accelerometer data to estimate the person's displacement as they move through the environment. It is done by counting the number of steps the person takes through peak detection in accelerometer signal and using their average gait measurements. It is worth noting that double integration of the accelerometer data (considering the device orientation) as a straightforward method for displacement calculation results in highly inaccurate or sometimes wrong displacement measurements, since accelerometer gives unreliable values in slow changing, monotonous movements with small accelerations like walking, and therefore displacement measurement tends to drift significantly over time due to huge accumulation error.
We evaluated the performance of our video-based algorithms in relative orientation estimation on videos recorded from various indoor scenes with an iPhone 7 camera at 30fps. These videos are from a rotary hallway in which the person complete a full 360 degree turn to the starting-line, and a bedroom, a crowded laboratory and a wide indoor scene in which we rotate the camera in place. The results showed the capability and robustness of our approach in reliable orientation estimation from video frames in all scenes.
For our vision-inertial data fusion purposes, we have developed an iPhone application to collect video and IMU data synchronously at a frequency defined by the user. We used our iOS App to collect video-IMU data at 30Hz from a rotary hallway scene as we completed a full lap. We applied our proposed framework to process this IMU-augmented video in order to obtain orientation angles, rotations, hallways' depth and width, and displacement in real-time. Experimental results showed that the estimated hallways' depth and width are consistent with their tailor measurements, and we could get a closed path map from the rotary hallway over a roughly 80 meter lap.