Exploiting a Prior 3D Map to Improve the Accuracy of CNN-Based Object Recognition
Siddarth Kaki and Todd Humphreys, The University of Texas at Austin
A technique is presented for improving convolutional neural network (CNN) –based recognition accuracy of new objects in a scene by exploiting a prior 3D map of the scene. The aim of this work is to reduce object recognition errors in cluttered environments by isolating new objects from the known background. Such isolation enables a CNN trained to recognize an enumerated set of objects to focus narrowly on the portions of images that contain new objects instead of having to process the whole scene. As a result, changes in a prior map can be quickly detected and semantically labeled, allowing confident navigation within the ever-evolving cluttered environment.
Current techniques for object recognition in cluttered environments employ semantic instance segmentation  to first segment an image into unidentified objects, and then pass the resulting segments to a CNN for object recognition, as opposed to attempting to classify the image as a whole. These techniques can achieve high recognition accuracy for some scenes, but can fail catastrophically for others – especially scenes of cluttered environments. By exploiting a highly-accurate prior 3D point cloud map of the area, background clutter can be distinguished from foreground new objects, improving object recognition accuracy of the latter.
To identify changes between the real world and the prior map, this paper’s proposed technique employs feature correlation and machine-learning-based object recognition. While object recognition accuracy with machine learning has improved vastly over the past few years , such systems still struggle to accurately parse cluttered or noisy scenes. This paper presents a smart and automated cropping technique to improve recognition accuracy: 1) a micro aerial vehicle (MAV) is flown through a previously-mapped environment, 2) the MAV takes new images, and feature descriptors are extracted from the images, 3) the MAV is projected into a prior 3D point cloud map of the environment, 4) a "virtual image" is taken from the same vantage as the actual MAV’s camera, and feature descriptors are extracted from the virtual image, 5) feature descriptors existing in both the real image and the virtual image are correlated to identify missing or added features in the environment, 6) the real and virtual images are cropped down to only what has changed, 7) the cropped images are parsed by a CNN trained to detect and recognize objects, which are then semantically labeled accordingly, and 8) the positions of the identified objects are determined with respect to the prior map.
The cropping process reduces the clutter in an image to improve recognition accuracy. Cropping fundamentally improves recognition accuracy because neural-network-based classification inherently includes a softmax-classifier as the last neural network layer. Softmax layers by definition distribute probabilities among a finite list of objects the network is trained upon to sum up to unity. Thus, by reducing the clutter in the image by cropping, the probability of the most-likely classification of the object in the image will increase. Modern CNN architectures employ semantic instance segmentation to address this as described earlier. However, early results have shown that cropping before classification still performs better than solely segmentation in many scenarios. Additionally, cropping and segmentation may work in tandem: an image might first be cropped from a course analysis, and then be segmented, before classification.
The general class of CNNs is chosen for object detection and recognition. CNNs have proven to be the most apt class of machine learning architectures for image classification , . Several CNN architectures (such as MobileNet  and Inception-v3 ) are considered, comparing performance metrics such as mean
average precision (mAP), memory and computing footprint, and real-time performance. While Inception-v3 is the state-of-the-art in terms minimal error rate, the architecture requires significant computing prowess, rendering real-time operation on a mobile platform infeasible. Though not as accurate as Inception-v3, the MobileNet class of architectures is designed to run real-time on a smartphone, making MobileNet ideal for real-time processing on a MAV. Coupled with smart cropping, the recognition accuracy of a MobileNet-based CNN parsing a cropped image approaches, and sometimes surpasses, that of an Inception-v3-based CNN parsing the full, original image.
For outdoor mapping purposes, a variety of objects may be encountered, ranging from cars to trees to benches. The CNN must be able to correctly identify as wide a range of objects as possible. However, training upon large datasets is expensive in terms of cost, time, and computing resources. As such, the CNNs chosen are pre-trained upon large general image datasets such as ImageNet  and COCO . The pre-trained CNNs are partly re-trained by transfer learning, which allows easy extension of a CNN trained upon a certain dataset to recognize objects not initially trained upon without re-training the entire network fromscratch .
This paper will provide a thorough description of the system used, including the custom MAV and processing software. It will compare its results in cluttered scenes against the results of standard pipelines for segmentation and recognition.
 A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy, “Semantic instance segmentation via deep metric learning,” CoRR, vol. abs/1703.10277, 2017. arXiv: 1703.10277.
 J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al., “Speed/accuracy trade-offs for modern convolutional object detectors,” ArXiv preprint arXiv:1611.10012, 2016.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., Curran Associates, Inc., 2012, pp. 1097–1105.
 A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. arXiv:1704.04861.
 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015. arXiv: 1512.00567.
 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 248–255.
 T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C.L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, Springer, 2014, pp. 740–755.
 S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.