Robot indoor navigation point cloud map generation algorithm based on visual sensing

: At present, low - cost Red Green Blue Depth ( RGB - D ) sensors are mainly used in indoor robot environment perception, but the depth information obtained by RGB - D cameras has problems such as poor accuracy and high noise, and the generated 3D color point cloud map has low accuracy. In order to solve these problems, this article proposes a vision sensor - based point cloud map generation algorithm for robot indoor navigation. The aim is to obtain a more accurate point cloud map through visual SLAM and Kalman ﬁ ltering visual - inertial navigation attitude fusion algorithm. The results show that in the positioning speed test data of the fusion algorithm in this study, the average time - consuming of camera tracking is 23.4 ms, which can meet the processing speed requirement of 42 frames per second. The yaw angle error of the fusion algorithm is the smallest, and the ATE test values of the algorithm are smaller than those of the Inertial measurement unit and Simultaneous - Localization - and - Mapping algorithms. This research algo - rithm can make the mapping process more stable and robust. It can use visual sensors to make more accurate route planning, and this algorithm improves the indoor positioning accuracy of the robot. In addition, the research algorithm can also obtain a dense point cloud map in real time, which provides a more comprehensive idea for the research of robot indoor navigation point cloud map generation.


Introduction
Under the background of the continuous development and progress of science and technology in the new era. With the continuous expansion of the application scenarios of indoor intelligent mobile robots, the requirements for robot technology are also constantly improving. Therefore, robot multi-sensor fusion technology has gradually been favored by many researchers in related fields [1][2][3]. At present, the image point cloud generation of robot navigation and positioning mainly adopts two methods. In the field of computer vision, after completing camera calibration, corresponding point pairs are established in multiple images and then the three-dimensional coordinates are reconstructed in the scene, that is three-dimensional reconstruction; this is 3D reconstruction Structure from Motion (SfM), and Simultaneous-Localization-and-Mapping (SLAM). Among them, SLAM of robots in unknown environments is the key research point in the field of robot navigation and positioning. And it is also the premise for robots to complete instructions [4]. However, for robots that need to work in more complex environments, traditional singlevisual SLAM has been unable to satisfy its ability to perceive environmental information. Because the robot can only obtain information about a certain dimension in the environment through a single visual sensor. The limited environmental information cannot provide it with more environmental features to assist in positioning, resulting in low mapping quality and low navigation and positioning accuracy [5]. Although the depth camera Red Green Blue Depth Map (RGB-D) of visual SLAM can provide more information than traditional cameras, the depth camera RGB-D is limited by its own hardware equipment. It is easily affected by light, and the robot is prone to tracking failure under conditions such as high-speed rotation. The inertial sensor has the advantages of high output frequency of attitude estimation and high output accuracy under high-speed rotation. But the output of its attitude information will also have the problem of large error accumulation in the case of long-term work: the robot's autonomous working ability in an indoor environment, the accuracy of navigation and positioning, and the lack of map construction quality. The research is based on improving the robot's pose and pose estimation accuracy. Visual SLAM and Kalman filter visualinertial navigation are studied. The pose is fused to generate a real-time and accurate robot indoor navigation point cloud map.

Related work
Under the background of the continuous development and progress of science and technology in the new era, as the application scenarios of indoor intelligent mobile robots continue to expand, the requirements for robot navigation and positioning technology are also constantly improving. For robots that need to work in more complex environments, traditional single sensors have been unable to meet their needs for the perception of environmental information. Therefore, scholars and experts at home and abroad have deeply discussed the generation algorithm of indoor robot navigation point cloud maps. In order to improve the trajectory planning accuracy of the detection robot, Zhang et al. proposed a new method for generating non-destructive testing trajectories of robots based on preprocessed point cloud data. The results show that the method is reliable [6]. In order to obtain an accurate environmental cognitive map and achieve the goal of bionic robot navigation, experts and scholars such as Yu et al. proposed a navigation algorithm to construct an accurate environmental cognitive map, and experiments show that the algorithm is accurate [7]. The purpose of the research is to achieve precise positioning and obtain faster tracking speed and better depth map, researchers such as Wen et al. proposed an algorithm for constructing dense maps through a fast stereo vision inertial navigation system. The results show that the algorithm has certain advantages. In order to reliably estimate object recognition and pose, researchers such as Wang et al. proposed a new global point cloud descriptor [8]. The results show that this method can solve this problem well and guide the robot to grasp objects with mirror poses [9]. Liu and other scholars proposed a multi-resolution auxiliary historical point cloud matching algorithm in order to solve the problems of the traditional two-dimensional lidar-related scanning matching algorithms, such as a large amount of calculation and poor real-time performance. The results verify the effectiveness and accuracy of the algorithm [10]. In order to ensure that non-expert users can plan robot trajectories intuitively and effectively, scholars such as Zhang et al. proposed a new trajectory planning system. The experimental results show that the system is effective and has certain advantages [11].
In order to improve the accuracy and performance of robot navigation, researchers such as Yang et al. proposed a two-layer ant colony optimization algorithm that can be used for the autonomous navigation of robots. The experiments show that the algorithm is efficient [12]. In order to explore the method of robot navigation in a dynamic environment, experts such as Wang et al. proposed a real-time projection of a 3D point cloud to a 2D static global map based on SLAM and constructed a 2D dynamic global map system. Experiments show that the system is effective and reliable [13]. Li et al. proposed an algorithm for reconstructing remote indoor environments through incremental point cloud compression, and the purpose is to improve the effectiveness of the 3D mapping based on remote operation of the robot in a dangerous area working environment. The research verifies the applicability of this method [14]. In order to improve the accuracy and resolution of all-focus images and 3D point clouds, Zhao et al. proposed a 3D vision sensing method based on focus superposition. Experiments show that this technology is effective and superior [5]. In order to solve the problem that only the closest point is iterated in the dynamic environment and other dangerous environments with uncertainty, and the accurate point cloud registration performance cannot be achieved, scholars such as Li et al. proposed a six-degree-of-freedom attitude estimation and mapping based on Segmented scan matching framework, and experiments show that this method can obtain higher pose estimation accuracy [15].
From the research results of domestic and foreign scholars and experts on the generation algorithm of robot indoor navigation point cloud map, it can be seen that the algorithm for pose estimation based on visual sensors such as cameras has gradually matured, but the current research on the accuracy of robot indoor navigation at home and abroad is still insufficient. At the same time, the quality of map construction and other issues are still difficult problems that need to be solved. Therefore, based on visual sensing, this research uses visual SLAM and Kalman filter visual-inertial navigation attitude fusion algorithm to generate a robot indoor navigation point cloud map. Very effective representation of a robot navigation map, which is mainly obtained by laser point cloud and image point cloud [16,17]. The 3D laser scanner used in the 3D point cloud is expensive, bulky, and complicated in data processing. There are two main methods of generating image point clouds, SfM and SLAM. SfM is difficult to obtain the real size of the 3D scene structure in the processing process, and cannot meet the high real-time requirements. With the development of sensor technology and computer vision technology, SLAM in the field of robotics is more suitable for application scenarios with high real-time requirements. The sensor commonly used in visual SLAM is the camera. Among them, the depth camera (RGB-D) can provide more information than traditional cameras, and it is not necessary to calculate the depth as time-consuming and laborious as monocular or binocular. In terms of SLAM, it is mainly used indoor. A color image I R RGB 3 → and a depth image can be obtained at the same time by the depth camera I R D → . Suppose the focal length of the standard pinhole camera is f , the offset of the camera projection center along the image x axis and y axis are, respectively, c c , x y the projection function that maps π a point in the three-dimensional space P X Y Z , , , as shown in equation (1): Given the pixel coordinates of the image p u v , T ( ) = , and the corresponding depth value Z p ( ), the back projection of the three-dimensional point in space π 1 − is reconstructed, as shown in equation (2): The definition of camera rigid body transformation is shown in equation (3): represents an orthogonal rotation matrix, and t R 3 ∈ represents a translation matrix. The camera pose is defined as the cylinder transformation matrix, and then, the curvilinear coordinate system is used (twist coordinate), as shown in equation (4): , , , , , .
represents the linear velocity of the camera, w w w R , , represents the angular velocity of the camera. According to the operation rules of Lie group and Lie algebra, in space, the camera's rigid body transforms the pose as shown in equation (5): corresponding to the first frame image is defined as the origin of coordinates, the camera pose is estimated from the relative change between two adjacent frames of images, and the motion of the camera in three-dimensional space is recorded and described. The depth camera (RGB-D) can obtain a color image and a dark image, respectively, and the obtained depth image obtains the spatial three-dimensional point coordinates corresponding to each pixel in the camera coordinate system. Assume that the set t 1 − of feature points of the P t 1 − color image is I t RGB 1 − , and the corresponding set of threedimensional points in space is. At the moment, t the set of feature points of the p t color image is, and the set of matching between I t RGB and t through this feature point t 1 − is p, then I t RGB 1 − the projection error between and is shown in equation (6): In equation (6), the corresponding three-dimensional point coordinates in space are represented. By optimizing the projection error of the color image, the pose of the camera can be obtained as shown in equation (7): In order to improve the robustness of camera tracking, when the camera pose cannot be obtained, the camera pose is t 1 − estimated by combining the depth image at ξ ∧ the t moment with the ICP algorithm, and the camera pose is used to find matching features between the feature point set p t and P t 1 − point. In the SLAM point cloud map, only key frames that meet certain conditions are selected to be saved instead of all image frames, and the relative distance between two frames is defined as the combination of translation and rotation weights of the camera pose, as shown in equation (8): In equation (8), W represents the diagonal matrix, which ξ ij represents the weight of each parameter. If the K i 1 − distance between the previous key frame and the current frame exceeds the set threshold, a new keyframe will be created K i . The processing steps of the key frame are shown in Figure 1.  As shown in Figure 1, for the newly added keyframes K , the keyframes that share the same visual word are quickly retrieved (Visual Words) through vectors to Bow form a local keyframe set K L , and at the same time, with the K L corresponding 3D map point set, the joint optimization of projection error and backprojection error is performed. The estimation accuracy of the camera pose and 3D map coordinates corresponding to the keyframe is improved. Assuming t that the color image and dark image obtained by the depth camera at time are divided into two I D , t t , the one t 1 − obtained at the time is respectively I D , 1 − − , and the camera at the two moments is transformed into ξ , then the back-projection error between D t and D t 1 − is shown in equation (9): In equation (9), it v t k represents the three-dimensional point in space obtained by back-projecting the pixel coordinates D t of the first k map point, and the projection error of the color image is shown in equation (10): After jointly optimizing the back-projection error of the dark image and the projection error of the color image, the result obtained is shown in equation (11): In equation (11), w rgb represents the weight parameter of the color image projection error. Figure 2 shows the algorithm flow of the indoor navigation point cloud pose estimation algorithm for the depth camera (RGB-D) robot based on the visual SLAM algorithm.
As shown in Figure 2, the study uses the ORB (Oriented FAST and Rotated BRIEF, ORB) operator to quickly extract and match the visual features of the color sequence images captured by the depth camera and then combines the Iterative Closest Point (ICP) algorithm and Real-time estimation of camera pose. And the ICP error constraint in the camera pose optimization operation is added, and then the graph optimization algorithm is use to jointly optimize the color image projection error obtained by the depth camera and the back-projection error of the depth image, and finally relatively accurate camera pose and 3D sparseness point cloud are obtained. However, in visual SLAM, the tracking drift of the camera is unavoidable due to the continuous accumulation of attitude errors.

Point cloud map based on visual SLAM and Kalman filter visual inertial navigation attitude fusion algorithm
In fact, the real-time performance of camera pose estimation based only on the visual SLAM algorithm in an indoor environment is still not high, and the output frequency is low, which is prone to tracking failure under high-speed rotation. Inertial measurement unit (IMU) plays an important role in navigation and positioning. The inertial sensor has a high output frequency for attitude estimation and high output accuracy under high-speed rotation [18]. The establishment of IMU kinematics model relies on the Kalman filter to achieve an accurate output of attitude. The visual-inertial navigation attitude fusion algorithm based on the Kalman filter is shown in Figure 3.
As shown in Figure 3, you need to build a map immediately and start positioning to obtain IMU data and image frames. Then, the ORB descriptor is extracted and synchronized with the image frame at the same time. The pose is initialized by combining the image, IMU position, and attitude information. Next, Kalman filtering is performed and stopped after obtaining the updated data of IMU. The research combines the advantages of the two pose estimation algorithms, using the attitude information output by the inertial navigation as the predicted value, and then takes the attitude information obtained by the visual slam as the observation value. Kalman filter is an optimal autoregressive data processing algorithm. Before using this algorithm, it is necessary to obtain the state equation and observation equation of the system. The state equation of the inertial system is shown in equation (12): In equation (12), b Δ represents the amount of error and θ δ represents the small angle error; n g is Gaussian noise whose gyroscope obeys a normal distribution and whose mean value is 0, which means that the n bg random walk model leads to a normal distribution noise whose mean value is 0. Among them, ω ⌊ ×⌋ represents the antisymmetric matrix about the rotation axis, such as equation (13): In order to obtain the state transition matrix ϕ k , the system matrix needs to be discretized. This study uses Taylor series expansion and converges to the second-order term, as shown in equation (14)   In equation (14), n is an arbitrary constant, and F k the state transition matrix representing the system; combined with equation (12), the state transition matrix ϕ k can be represented by equation (15): In equation (15), Θ Ψ and both represent matrix blocks. In order to obtain the system noise covariance in the prediction process, it is necessary to Q k discretize it according to the noise matrix and the state transition matrix and then approximate it into a third-order form, as shown in equation (16) And the noise covariance matrix are Q k obtained from equation (16) ϕ k , and the Kalman wave prediction process is shown in equation (17): In equation (17), x x , − both represent the system variance matrix, where x k k 1 ∧ | − represents k the state-predicted value at x k ∧ the moment, and the state-estimated value at this moment. The prediction process is the basis of the Kalman filter update, which can obtain the optimal estimate of the system state through the criterion of minimizing the variance. With the growth of time, due to the extremely high frequency of attitude estimation output by the IMU, this also leads to a large accumulation error of the final attitude. Therefore, the attitude output value of IMU can be corrected by using the low-frequency output result of visual SLAM and improved the robustness of the system. Using the observation matrix H , the equation of the observation equation can be obtained as shown in equation (18): The update process of the Kalman filter can be obtained from equation (18), as shown in equation (19) [23]: In equation (19), R k represents the observation noise variance matrix. In the filtering process, it is necessary to set initial values for some parameters, including the initial state quantity X k , the initial system covariance matrix P k , the system noise Q c , and R represents the observation noise covariance matrix. After time synchronization, both the IMU and the depth camera sensor start to accumulate from the initial value of zero, and the initial value of the gyro bias is shown in equation (20): The initial value obtained by equation (20) and its parameters are shown in equation (21) Since the initial value of the system noise covariance matrix determines the initial filter gain of the system, but it has little effect on the convergence speed, it will be P k set to 6*6 the unit diagonal matrix. The filtering performance is greatly affected by the initial value of the system noise covariance matrix Q c . When the elements increase, the uncertainty of the system noise and system parameters will also increase, and the gain matrix will also increase, and the robustness and robustness of the system will also increase. Sex can be improved. If a better convergence value needs to be obtained, Q c the value is shown in equation (22): When R the element value in the matrix increases, it means that the influence of measurement noise increases, while the gain matrix decreases, and the system correction weights also weaken, so that the transient response and steady-state value of the system also follow. reduce. The choice of matrix Q c and R is contradictory, R and Q c determines the degree of confidence in measurement and observation, respectively. In this study, the visual SLAM measurement value is used as the observation value, and the attitude value output by the IMU is used as the prediction value. Therefore, the study will assign a R smaller noise value. After adjustment, the setting of this value is shown in equation (23): In general, it is not necessary to re-adjust the initial parameter values before replacing the IMU device. The overall algorithm flow of robot indoor navigation point cloud map generation based on visual SLAM and Kalman filter visual-inertial navigation attitude fusion is shown in Figure 4.
As shown in Figure 4, the algorithm for generating the point cloud map of robot indoor navigation is mainly divided into three modules. The tracking line mainly obtains the camera pose and the sparse point cloud through the depth camera of the visual SLAM, and then determines the new keyframe and analyzes the keyframe. It makes reasonable additions or deletions and performs point cloud splicing and fusion. Finally, the closed-loop detection is used to optimize the essential map, and the attitude information output by visual SLAM and inertial navigation is used as the observed value and the predicted value to improve the robustness of the attitude acquisition process and finally obtain a more accurate robot indoor navigation point cloud map [24,25].

Experimental analysis of point cloud map generation for robot indoor navigation based on visual sensing
The experiment, Turtlebot as a robot platform, is equipped with a depth camera-Kinect Xbox 2 generation camera, and Intel Core I5-3230CPU is installed with a laptop computer, and 64 bit system is used as the main control computer, which is responsible for SLAM information processing and system operation control. The research by Debian Sid extracts the visual features of the sequence images captured by the depth camera and then obtains the matching set of feature points from the adjacent image frames to estimate the camera pose in real time and its motion trajectory in the air as shown in Figure 5(a). The depth image obtained by the depth camera to match the set of feature points between adjacent images is used, and then the triangulation operation to reconstruct the spatial 3D point coordinates corresponding to the pixels of the image is performed, and the obtained sparse 3D point cloud is shown in Figure 5(b).
As shown in Figure 5, in space, the motion trajectory of the camera is almost the same as the line connecting the coordinate points of the three-dimensional sparse point cloud. In order to reduce the influence of error accumulation and obtain a globally unified map, loop detection based on SLAM is introduced in the processing flow of SLAM. When processing keyframes, the newly generated key positive vector is registered, and in the loop detection stage, the keyframes that share the same visual word as the current keyframe are retrieved from the constructed vector database and used as candidates for loops. Then, the similarity between the candidate and the current keyframe is calculated, the key frame whose similarity exceeds a certain threshold as the loop candidate is taken, and its adjacent frame as the loop candidate are selected and is used to calculate the transformation relationship between it and the current frame, taking a sufficient number of keyframes as loop keyframes and ensuring that they are geometrically consistent with the current frame. In order to test the efficiency of the research method, the key steps are measured for speed, and the performance test results are shown in Table 1.
As shown in Table 1, it can be seen from the positioning speed test data that the average time-consuming of camera tracking is 23.4 ms, which can meet the processing speed requirement of 42 frames per  second. It can be seen from the program performance measurement and robot test results that the research method can achieve indoor real-time requirements for robot navigation to generate point cloud maps. The experiment uses Intel Lab the company's IMU, its output frequency can reach 200 Hz, its model is OS3DM, and the IMU is fixed on the Kinect camera. The pose estimation information of the visual SLAM algorithm is used as the observation value, and the output value of the IMU is corrected. In order to ensure the accuracy of the fusion algorithm, the IMU and the Kinect camera need to be uniformly aligned through the transformation between the coordinate systems, and then the time filter mechanism under ROS is used to match the topic timestamps published by the two, and the data are aligned through the software-time synchronization. Figure 6 shows the visual-inertial navigation attitude comparison and ground truth map.
It can be seen from Figure 6a that, in a relatively short time, the attitude information output by IMU can meet the requirements, but after a period of operation, the output value has a relatively obvious drift due to the increase of the accumulated attitude error. The attitude information output by the visual SLAM algorithm also produces a certain error after long-term operation, and the error value is smaller than that of the IMU. The error value of the research algorithm is smaller than the output attitude error of IMU and SLAM. The algorithm uses the attitude information obtained by visual SLAM as the observation value to correct the output value of the IMU, which is close to the real value in Figure 6b. The visual-inertial navigation attitude error is sorted and counted, and the obtained results are shown in Table 2.
From Table 2, the maximum error of the yaw angle of the research method is 8.215°, and its root mean square error is 2.143, which is smaller than the yaw angle error of the IMU and RGBD-SLAM algorithms,  indicating that the accuracy of the research algorithm is to a certain extent improved above. Furthermore, the accuracy of the experiment is verified, and the data set TUM-RGBD is used for the accuracy evaluation test. The absolute trajectory error between the estimated camera trajectory and the real trajectory is measured ATE as the ATE absolute trajectory error statistics are shown in Table 3. Table 3, in the 10 data sets, the absolute error trajectory ATE of the camera calculated based on the fusion of visual SLAM and Kalman filter visual-inertial navigation attitude is the minimum of 0.010 and the maximum error of 0.073. The absolute trajectory error ATE of the IMU is a maximum of 0.092 and a minimum of 0.015. The absolute trajectory error ATE of the RGBD-SLAM algorithm is 0.914 at the maximum and 0.027 at the minimum. The absolute trajectory error ATE value of the fusion algorithm studied in each data set is smaller than that of the IMU and RGBD-SLAM algorithms. It can be seen from  the test results that the accuracy and robustness of the camera pose estimated by this research method are better than the other two methods. The camera trajectory estimated by the research method is compared with the real value of the camera trajectory, and the results are shown in Figure 7. It can be seen from Figure 7 that four representative camera trajectories are selected to show the difference between the estimated trajectory and the true trajectory. It is obvious from Figure 7(a) that the estimated value of the camera track is highly coincident with the real value and close to the ground live track. It can be seen from Figure 7(b) that although the trajectory of the true value is more complex, the trajectory estimated by the camera still has a good coincidence with it. Although the range of real trajectories in Figure 7(c) is expanded, the difference between estimated trajectories and real trajectories is very small. In Figure 7(d), the true value trajectory is open and closed, and the camera-estimated trajectory can accurately avoid the missing trajectory line, which is very close to the true value trajectory. The results show that the camera trajectory estimated by the fusion algorithm of visual SLAM and Kalman filtering visualinertial navigation attitude can be very similar to the real value trajectory of the camera. Compared with IMU and RGBD-SLAM algorithms, the camera trajectory obtained by this algorithm is more accurate, realtime, and robust.

Conclusion
This research mainly focuses on the problem of point cloud map generation for robot indoor navigation and applies an algorithm based on visual sensing, which combines visual SLAM and Kalman filter visualinertial navigation attitude. The experimental results show that the movement trajectory of the camera in space is roughly the same as the line connecting the coordinate points of the three-dimensional sparse point cloud, which proves the effectiveness of this method. In the positioning speed test data, the average time-consuming of camera tracking is 23.4 ms, which can meet the processing speed requirement of 42 frames per second and can realize the real-time requirement of robot indoor navigation point cloud map generation. It can be seen that the error value of the fusion algorithm in this study is smaller than the output attitude error of IMU and SLAM. The maximum error of the yaw angle of the research method is 8.215°, and its root mean square error is 2.143, which is smaller than the yaw angle error of the IMU and RGBD-SLAM algorithms, indicating that the accuracy of the research algorithm has been improved to a certain extent. The ATE test values of this research method are smaller than those of the IMU and RGBD-SLAM algorithms, and the camera trajectory estimated by this algorithm is very close to the true value of the camera trajectory in the data set. Although the fusion algorithm in this study has achieved good test results, it still lacks in establishing a dense 3D point cloud, and the follow-up research will focus on improving this aspect.