Wide Baseline Matching for Autonomous Approaches of MAVs
Karsten Mueller, Ruben Kleis, Institute of Systems Optimization (ITE), Karlsruhe Institute of Technology (KIT), Germany; Gert F. Trommer, ITE/KIT, Germany & ITMO University, Russia
The small size and maneuverability of micro aerial vehicles (MAVs) allows for exploration of buildings, for example after an earthquake or fire. However, for autonomous approaches the MAV needs to find an entry into the building. This work deals with the scenario that an operator defines a window on a picture as an entry to the building. Since the perspective view of this reference picture differs from the live images of the MAV, a matching algorithm is needed in order to redetect the window.
The objective of our approach is to present an algorithm that provides accurate results for the problem of ultra wide baseline matching. At the same time, the algorithm is required to have low processing time for being applicable in real-time. One of the drawbacks of widely used keypoint descriptors such as SIFT, SURF and ORB is the limited invariance to out-of-plane perspective transformations: While these descriptors are invariant to rotation and scale, they provide limited accuracy in ultra wide baseline viewing conditions.
The algorithm presented in this paper is based on the ORB key point matching due to its low processing time. However, the improvements described in this work can also be applied to other methods such as SIFT or SURF. First, keypoints and the respective descriptors are extracted from the reference image taken by the operator. Moreover, to improve robustness and reliability, additional images are used for keypoint extraction: The first image is the reference image with additional Gaussian blur. The other images result from image transformations that cause synthetic camera movements to the left and to the right. By using four reference images instead of only the original reference image, four sets of keypoints that are more robust to perspective changes and image quality are extracted. Therefore, higher accuracy in wide baseline scenarios is achieved. The process of generating the additional reference images and the keypoint extraction could be performed offline since the reference image is already available before the MAV’s mission has started.
The second step is to match the descriptors of the reference images with the current live image of the MAV. For this, an advanced filtering scheme is used to select the best matches. Based on this process, the homography matrices are estimated using a RANSAC algorithm. For determining the best homography matrix, the cross-correlation of the areas around the matched feature points is computed. Additionally, a color descriptor is used: The reference point in the window chosen by the operator is projected into the live image using the computed homography matrices. Color descriptors are calculated around the reference point and the projected points in the live image. Choosing the homography matrix that results in the smallest color descriptor dissimilarity between the two regions provides an accurate and robust selection of the correct homography matrix.
For video sequences of an autonomous window approach, the algorithm described above is extended by applying tracking between consecutive images. To avoid divergence, the feature tracker is reinitialized each time a high rating of the algorithm calculating the homography matrix is achieved. Moreover, for reliably detecting windows, a rectangle detector is implemented and integrated into the rating.
An extensive evaluation of the algorithm is presented in the paper. This includes a comparison with standard keypoint descriptors such as SIFT, SURF and ORB. Moreover, improvements resulting from the different parts of the algorithm are highlighted. The dataset used for evaluation is the Zurich Building Database that contains images of 200 buildings. For each building, five images taken from different viewpoints are available. Additionally to the differences in perspective, illumination conditions vary and occlusions, for example by trees, occur in some pictures. Furthermore, a video database containing more than 2000 images from different places in Karlsruhe is used for evaluation. Due to large variations in scale and a lack of structure, the video dataset provides additional challenges besides the viewpoint variation. Using these videos an analysis of the performance with respect to the angle with which the window is approached is given.
Results show significant improvement and an accurate detection of the corresponding window in the databases while achieving the goal of short computing time. In the challenging video database, the percentage of successful detection is raised from 30% for standard ORB to over 80% using our algorithm. It is shown that a reliable measure for the accuracy of the result can be found. Using this additional criterion, an input to an intelligent guidance of the MAV can be developed. The tradeoff between availability and high accuracy is discussed. Moreover, it is demonstrated that the algorithm provides significantly improved robustness for window approaches from wide angles. Furthermore, the real-time capability of the approach is shown: While SIFT and SURF have a maximum processing time of more than 1.5 seconds per image, the algorithm presented in this paper never exceeds a processing time of 200 milliseconds.
The algorithm presented in this paper is an accurate and reliable method for redetecting a window from a reference image in a live image. By improving the ORB feature matching algorithm a robust detection even in wide baseline scenarios is possible. Including the presented reliability criterion, a reliable input to the guidance of the MAV is available for autonomous approaches of buildings. Moreover, our approach is very versatile as it is possible to easily adapt it to detect and track other features such as doors or simply distinctive parts of a wall instead of windows. Compared to other state of the art wide baseline matching algorithms our algorithm has the advantage of being applicable in real-time while showing high accuracy in the evaluation of extensive and demanding datasets.