Previous Abstract Return to Session C4 Next Abstract

Session C4: Vision-based Navigation Systems

Landmark Selection and Recognition with Hopfield Attractor Networks
Kyle Volle, NRC & University of Florida; Prashant Ganesh, University of Florida; Kevin Brink, Air Force Research Lab
Location: Atrium Ballroom
Alternate Number 3

Loop closure detection is an important part of many simultaneous localization and mapping (SLAM) applications. If an autonomous agent can recognize when it revisits a location, then it can correct for errors in its position estimate and use maximum a posteriori (MAP) methods to rectify its map. There is a tension inherent in loop closure detection, false positives can be difficult for a system to recover from, so a high threshold for acceptance of a match is indicated, but loop closures are generally rare relative to the overall number of positions and so false negatives are undesirable missed opportunities. By using Hopfield networks, the prior position which most closely matches the current position can be found and then either accepted or rejected. This reduces the risk of false positives by eliminating the need for spurious comparisons.
Hopfield networks are a type of artificial neural network that provide content-addressable memory. A Hopfield network trained to recognize certain patterns will be able to make the correct association even given distorted or partial input patterns. This makes them robust at pattern matching. They also have the added benefit of a built-in heuristic metric for the suitability of a given input to be a stable and identifiable learned pattern. Because of these features, Hopfield networks are ideal for selecting visual landmarks for navigation and recognizing them upon subsequent encounters.
This work builds on previous work by the authors in which Siamese Convolutional Neural Networks (SCNN) learn a mapping to a high-dimensional vector space in which groups similar input images separates disparate inputs. This is used for image correspondence detection to recognize loop closures in a simultaneous localization and mapping application. The output vectors of the learned mapping are used as the learned patterns of the Hopfield network. Thus, when a similar output vector is encountered again, the Hopfield network can associate it with the learned landmark. Because the Hopfield network has stable basins of attraction for each learned pattern, this approach is robust to variations such as viewing angle, lighting, partial occlusion, et cetera.
Traditionally, an SCNN returns a boolean value as to whether two images match or not based on if a metric, such as the L2 norm or cosine similarity, of the output vectors meets a certain threshold. In contrast, with a Hopfield network, the landmark corresponding to the most similar learned pattern will be returned. The difference between the input vector and the returned vector can be used to reject spurious matches.
In this work, RGB images are fed into a convolutional neural network (CNN)-based feature extractor, such as the one used in ResNet-50, ResNet-101 or YOLO. The output of this extractor is a representation of the image that has proved meaningful for other tasks. This feature vector is fed into a fully-connected neural network that is trained to find a representation that minimizes the difference between matching images and maximizes the difference between disparate ones. It is these output vectors that are learned by the Hopfield network as a representation of a given landmark.
Additionally, Hopfield networks are relatively compact and fast. The risk of spurious matches is proven to be negligible with a ratio of approximately 8 nodes for every learned pattern. All of the operations are simply addition and multiplication, which should allow for fast performance.
This approach will be evaluated by comparing precision-recall curves of the correspondences detected and accepted by the Hopfield network to those of the SCNN with more traditional metrics. In addition to performance comparisons, the computational space and time complexities will be compared. These comparisons will be performed on simulated and real-world image sets.



Previous Abstract Return to Session C4 Next Abstract