Integrating Low-Resolution Surveillance Camera and Smartphone Inertial Sensors for Indoor Positioning
Jiuxin Zhang and Pingqiang Zhou, Shanghaitech University, Shanghai, China
Objective and related work
Accurate indoor positioning techniques play a key role in Location Based Service (LBS), especially in indoor map navigation service. Despite the strong demand for LBS, indoor positioning remains a grand challenge and active research are desired to find an accurate, yet cheap and easy-to-use solution.
In an indoor environment, due to signal shielding effect of building, positioning techniques using global navigation satellite systems (GNSS) like GPS show very bad performance. Considering ease of use, additional equipment for indoor positioning should be minimized, thus for most cases, smartphone is the best choice for positioning service. However, no existing system using unmodified smartphones is accurate enough for LBS applications.
--Radio Frequency (RF)-based approaches have been widely used for indoor positioning. WiFi-fingerprinting method takes full advantage of the already existing WiFi routers. BLE and iBeacon positioning system require additional Bluetooth equipment to achieve enough beacon density. Although RF-based systems are easy to set up, they have limited usage for many applications like retail navigation and shelf-level advertising due to limited accuracy (with errors up to meters) and not offering orientation information.
--Smartphone Pedestrian Dead Reckoning (PDR) approaches do not need extra equipment. They use the inertial sensor in the smartphone to collect acceleration and orientation data, and then calculate the relative displacement information by a second integral of acceleration. As long as the initial position is specified, current earth-coordinate position can be reckoned. However, errors accumulate over time and increase rapidly, and the system relies on a proper technique to set initial position.
--Visible Light Positioning (VLP) systems utilize smartphone cameras and additional modulated LEDs. In such system, smartphones take pictures of several LEDs and calculate the relative position of the smartphones with respect to the LEDs. Because of the straight propagation characteristic of visible light, the errors can be significantly reduced. But such system is not easy to use because the users must hold the phone and take pictures of the LEDs at the ceiling. Another problem is that, the system needs densely distributed modified LEDs, which may lead to high cost and is not suitable for all buildings.
In this work, we propose a novel indoor positioning system. Our system uses surveillance cameras, which are available and widely distributed in most of the buildings. The sight line of camera is straight thus the positioning error can be reduced to centimeter level. Unlike fingerprinting-based methods which demand frequently calibration, the surveillance cameras in our system only need to be calibrated once. Further, inertial sensors in PDR is applied in our system to collect the gait features of the users who need the positioning services, and to ensure a user-friendly solution. Finally, to prevent the error or failure caused by obstructions before the pedestrians, we propose an object tracking algorithm using Convolutional Neural Network (CNN), which can track partly sheltered or temporally completely sheltered pedestrians.
The proposed indoor positioning system consists of two parts.
--The first part is an object tracking system using surveillance camera.
To mimic the worst case of equipment condition in real-world, we use a fairly low resolution camera with 480p resolution. The camera is installed on the wall with 3-meter height above the floor, it collects video stream of the walking pedestrians in the field view of fixed angle. The camera sends real-time video image to the back-end server where the video is processed with our proposed image object-tracking methodology. To eliminate intrinsic distortion, the camera is pre-calibrated, and the rotation matrix between camera’s coordinates and real-world coordinates should also be calculated using four pairs of points. To track the pedestrians and extract their positional information,
--Firstly, we use the Gaussian Mixture Model (GMM) method to extract the foreground mask that represents moving objects. Each pixel of the image holds several Gaussian models whose parameters is initialized offline but trained online. Pixels that do not match any Gaussian model are regarded as foreground. After foreground segmentation, the parameters in Gaussian models are updated by an online Expectation Maximization (EM) algorithm to renew the background features.
--After foreground segmentation, we use a shadow suppression algorithm to obtain the clear outlines of pedestrians. For each pedestrian’s foreground mask, we draw min-bound rectangles to calculate the height, centroid and angle of inclination. Such information will be used to estimate the ground projection coordinates of pedestrian’s center of gravity, which approximates the position where the pedestrian stands in the video view. Then the actual position can be calculated with the position in the video view by coordinates transformation using rotation matrix calculated before. When a pedestrian is sheltered by another object, extracting clean foreground image becomes an impossible task. So in this case, we use CNN-based object tracking algorithm to continue tracking. In our implementation, the CNN net is based on MDNet, which has 3 convolutional layers and 3 FC layers. The initial input of CNN is the bounding box of the target pedestrian’s foreground image in the last not-sheltered frame. Then CNN reads the subsequent frames, tracks the pedestrian and updates its position. The parameters of convolutional layers is pre-trained offline while the parameters of the FC layer are updated online.
--Meanwhile, the system ceaselessly extracts the gait features of the pedestrians in the video including azimuth, walking state (walk or stand) and step frequency. Heading azimuth can be easily calculated from the history path. The other features can be obtained by the normalized pixel number of the bottom half of people’s foreground mask. The normalized pixel number is always periodic when people walks so that it represents the walking state.
--The second part is pedestrian identification system based on smartphone inertial sensors.
After the tracking process, we have obtained the positional information and gait feature of moving pedestrians in the video view. Next, we need to identify the target pedestrian among the candidate pedestrians in the video view to determine its position. Face-recognition systems don’t fit this situation because the surveillance cameras in real-life usually have low resolution and monitor large areas, thus the pedestrians’ faces in the video view are too blur to be recognized. What’s more, one surveillance camera can catch the face of a pedestrian only when it is facing the camera. Therefore, in our work, we instead use the accelerometers and magnetometers in the smartphones to collect the walking features of the pedestrians, including azimuth, walking state and step frequency. The walking feature will be used to match the gait features extracted in the first part of our system to recognize the user of the smartphone.
Our experimental environment is set in a big room with a size of 10m × 14m. A single surveillance camera is installed on the wall. People walk around with smartphones, and the system identify them and calculate their physical positions.
The error is defined as the L1-norm difference between the reported position of our indoor positioning system and the manually marked position within each frame. Our results show that, when a pedestrian is not sheltered, the mean error is about 9 centimeters while the max error is 17 centimeters. In occlusion conditions, the mean and max errors respectively increase to 14 and 29 centimeters.
Conclusions and significance
In this work we propose a novel indoor positioning system with surveillance cameras. Compared with RF-based approaches, our system has very high accuracy (at centimeter level). It is also more user-friendly compared with other VLP-based systems. The system can be rapidly set in any building covered by surveillance cameras and does not need extra equipment. Unlike WiFi or magnetic field fingerprinting, the proposed system does not need regular calibration. In summary, our work is a successful attempt to apply the idea of computer vision in indoor positioning area.