Xi'an Technological University
Subject: Computer Science, Software Engineering
eISSN: 2470-8038
SEARCH WITHIN CONTENT
Yu Haige ^{*} / Yu Fan ^{*} / Wei Yanxi ^{*}
Keywords : Robot, IMU, Stereo Vision, SLAM
Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 4, Issue 4, Pages 47-55, DOI: https://doi.org/10.21307/ijanmc-2019-070
License : (CC-BY-NC-ND 4.0)
Published Online: 27-January-2020
Vision-based SLAM has an outstanding problem is not work when the camera fast motion, or camera operating environment characterized by scarce. Aiming at this problem, this paper proposes a SLAM method of IMU and vision fusion. This article uses a stereo camera to extract the image ORB feature points. During the camera movement, if the number of extracted feature points is less than a certain threshold and the camera movement cannot be estimated or the estimated camera rotation and translation is greater than a certain threshold, the camera pose is estimated by fusing IMU, Otherwise use feature points to estimate camera pose. This paper uses non-linear optimization methods to perform pose estimation of pure feature points and pose estimation of fused IMU, respectively. The experimental results show that binocular vision SLAM with IMU information can estimate the camera pose more accurately.
With the development of robot technology, more and more robots are approaching our lives, such as sweeping robots, shopping mall robots, etc. Mobile robots are the product of the cross fusion of various disciplines and technologies. Among them, SLAM(Simultaneous Localization and Mapping) is an important technology for mobile robots. SLAM means that the robot builds a map of the surrounding environment in real time based on sensor data without any prior knowledge, and infers its own positioning based on the map. From the 1980s to the present, more and more sensors are used in SLAM, from early sonar, to later 2D/3D lidar, to monocular, binocular, RGBD, ToF and other cameras. Compared with lidar, cameras used in vision SLAM have become the focus of current SLAM research due to their advantages such as low price, light weight, large amount of image information, and wide application range. Stereo cameras generally consist of two pinhole cameras placed horizontally. Compared to monocular vision’s scale uncertainty and pure rotation problems, binocular cameras can directly calculate the pixel depth. At the same time, compared to RGB-D cameras, stereo cameras collect images directly from ambient light and can be used indoors and outdoors. Compared with lidar, the main disadvantage of the camera as a SLAM sensor is that when the camera moves too fast, the camera will blur images, and the camera will not work in a scene with insufficient environmental feature textures and few feature points.
Aiming at the problems of the above-mentioned visual SLAM system, this paper proposes an algorithm that fuses IMU and SLAM. Through the fusion of IMU, it can provide a good initial pose for the system. At the same time, during the camera movement process, it makes up for the shortcomings of visual SLAM, ensuring the accuracy of the camera pose estimation in the case of fast camera movement and lack of environmental texture.
Camera models generally have four coordinate systems: a pixel coordinate system, an image coordinate system, a world coordinate system, and a camera coordinate system. Figure 1:
Among them, O_{w} − X_{w}Y_{w}Z_{w} is the world coordinate system. The world coordinate system is the reference coordinate system in the visual SLAM system. The positions of the camera trajectory and map points are described based on this coordinate system. The unit is m.
O_{i} – xy is the image coordinate system. The image coordinate system uses the intersection of the camera optical center and the image plane coordinate system as the origin. The unit is mm.
O_{c} – X_{c}Y_{c}Z_{c} is the camera coordinate system. The camera coordinate system uses the camera optical center as the origin, and the directions parallel to the x-axis and y-axis of the image coordinate system are respectively taken as the X_{c}-axis and Y_{c}-axis, and the direction perpendicular to the image plane is the Z_{c}-axis. The unit is m.
O – uv is the pixel coordinate system. The origin of the pixel coordinate system is generally the upper left corner of the image, with the u axis to the right parallel to the x axis, and the v axis to the y axis. The unit is pixel.
The camera maps the coordinate points of the three-dimensional world to the two-dimensional image plane. This process is generally a pinhole model. Under the pinhole model, it is assumed that there is a spatial point P, and the coordinates of the point P are [X Y Z]^{T}. After the projection of the small hole O, the point P falls on the imaging plane o – xy, and the imaging point is p, The p-point coordinate is [X Y Z]^{T}. Let the distance from the imaging plane to the small hole be the focal length f. Therefore, according to the principle of triangle similarity, there are:
So we can get:
The difference between the pixel coordinate system and the imaging plane is a zoom and a translation of the origin. Suppose that the pixel coordinates are scaled α times on the u axis and β times on the v axis, and the origin is translated [c_{x}, c_{y}]^{T}, so we can get:
Equation (3) is substituted into equation (2) to get:
The unit of f is m and the unit of α and β is pixel/ m, so the unit of f_{x}, and f_{y} is pixel. Written as a matrix:
Among them, the matrix K is called the internal parameter matrix of the camera, and P is the coordinate representation of the space point in the camera coordinate system.
Let the coordinate P of the space point in the camera coordinate system correspond to the coordinate P_{w} in the world coordinate system, and use coordinate transformation to obtain:
Among them, T represents the pose of the camera relative to the world coordinate system, and can also be called the external parameter of the camera. In summary, the pinhole camera model uses the triangle similarity relationship to obtain the relationship between space points and pixels, which is a relatively ideal model. In practice, there will be errors in the manufacture and installation of optical lenses, which will affect the propagation of light during the imaging process and cause distortion in the images collected by the camera. Here we mainly consider radial distortion and tangential distortion.
Radial distortion is caused by the shape of the lens, and the distortion increases as the distance between the pixel and the center of the image increases. Therefore, a polynomial function can be used to describe the changes before and after the distortion, that is, the quadratic and higher-order polynomial functions related to the distance between the pixel and the center of the image can be used for correction. The polynomial of the coordinate change before and after the radial distortion correction is as follows:
Among them, [x, y]^{T} is the coordinates of the uncorrected points, and [x_{corrected}, y_{corrected}]^{T} is the coordinates of the points after the distortion is corrected. r is the distance from the point (x, y) to the origin. k_{1},k_{2} and k_{3} are three radial distortion parameters. Usually these three parameters can be obtained by the calibration step.
For tangential distortion, the reason is that the lens and the imaging plane cannot be strictly parallel during camera assembly. Tangential distortion can be corrected using two other parameters, p1 and p2:
Considering the two types of distortion, we can find the correct position of a pixel in the pixel coordinate system through 5 distortion coefficients:
In summary, the parameters describing the camera model mainly include: in the camera’s internal parameter matrix, and distortion correction parameters.
The binocular camera generally consists of two pinhole cameras placed horizontally, and the two cameras observe an object together. The aperture centers of both cameras are located on one axis, and the distance between the two is called the baseline b of the binocular camera. There is an existing space point P, which is an image in the left-eye camera and the right-eye camera, and is denoted as P_{L}, P_{R}. Due to the presence of the camera baseline, these two imaging positions are different. Remember that the coordinates of the imaging on the left and right sides are x_{L}, x_{R}, which can be seen from the similarity of the triangles:
We can get:
The above model is an ideal model, which aims to explain the principle of measuring the actual three-dimensional point depth of the binocular camera. In practical applications, due to factors such as manufacturing and installation, it is difficult to achieve that the imaging planes of the binocular cameras are strictly on the same plane and the optical axes are strictly parallel. Therefore, before using a binocular camera for measurement, it should be calibrated to obtain the left and right camera internal parameters and the relative position relationship between the left and right cameras.
At present, the fusion method of monocular vision sensor and IMU can be divided into two types: loose coupling and tight coupling[1]. Loose coupling is based on the vision sensor and IMU as two separate modules, both of which can calculate the pose information, and then fused by EKF[2] and so on. Tight coupling refers to the non-linear optimization of vision and IMU data to obtain pose estimates. Because tight coupling can make full use of each sensor’s data, this paper uses tight coupling to fuse vision and IMU data. Firstly, the purely visual feature point pose estimation method is used to estimate the camera pose. Then, during the camera movement, if the number of extracted feature points is less than a certain threshold value, the camera movement cannot be estimated or the estimated camera rotation and translation are greater than a certain threshold value, The camera pose is estimated by fusing the IMU, otherwise feature points are still used to estimate the camera pose.
The ORB (Oriented Fast and rotated Brief) algorithm was proposed by Ethan Rublee et al. In 2011[3]. The ORB feature is composed of the FAST feature and the BRIEF descriptor. It adds orientation and scale invariance to the FAST feature. Features are described using binary BRIEF descriptors. When performing feature matching, the descriptors between feature points and feature points are compared. The binocular camera can directly obtain the corresponding 3D position of the pixel under the known pixel matching of the left and right camera images. Therefore, the stereo camera-based SLAM system can use the known 3D point and its projection match in the current frame to obtain the current camera pose without the need to solve camera motion using epipolar geometry[4].
This paper first uses the method of EPnP[5] to solve the camera pose. The EPnP pose solution method can more effectively use the matching point information, and iteratively optimize the camera pose. EPnP is known as the coordinates $$\{{P}_{i}^{w},i=1,2,\dots ,n\}$$ of n space points in the world coordinate system and their corresponding coordinates $$\{{P}_{i}^{c},i=1,2,\dots ,n\}$$ in the image coordinate system to solve the rotation matrix R and translation vector t of the camera movement.Set four non-coplanar virtual control points in the world coordinate system, whose homogeneous sitting marks are: $$\{{C}_{i}^{w}|i=1,2,3,4\}$$. The relationship between the world coordinates of the space points and the control points is as follows:
Once thevirtual control point is determined and the premise that the four control points are not coplanar, {α_{ij}, j = 1, …,} is the only one determined.In the image coordinate system, the same weighting sum relationship exists:
Substituting equation (13) into the camera model gives:
The image coordinates u_{i}, v_{i} in Equation (13) are known, so:
In order to obtain the coordinates of the 2D point into the camera coordinate system, it is assumed that α_{ij} in the camera coordinate system is consistent with α_{ij} in the world coordinate system, that is, to find the rotation and translation of the four control points. Solve linear equations:
Among them, M is a 2n×12 matrix, and $$\mathbf{X}=[{C}_{1}^{cT},{C}_{2}^{cT},{C}_{3}^{cT},{C}_{4}^{cT}]$$ is a vector composed of 12 unknowns to be solved.
v_{i} is the right singular vector of M, and the corresponding singular value is 0. Solve the M^{T}M eigen value and eigenvector. The eigenvector with eigenvalue of 0 is v_{i}. N is the dimension of the M^{T}M space, and β_{i} is the coefficient to be determined.
Depending on the position of the reference point, the spatial dimension of the matrix M^{T}M may take the values 1,2,3,4. According to the same distance between the control points in the world coordinate system and the camera coordinate system, six constraints can be obtained, and the pending coefficients can be solved.
When N = 1, according to the constraints:
and so:
When N = 2:
Since β_{1} and β_{2} only appear in the equation as quadratic terms, let $$\mathbf{\beta}={[{\beta}_{1}^{2},{\beta}_{1}{\beta}_{2},{\beta}_{2}^{2}]}^{T}$$, and use the vector ρ to represent all $$\parallel {C}_{i}^{w}-{C}_{j}^{w}{\parallel}^{2}$$, thus obtaining the equation:
Where L is a 6×3 matrix composed of v_{1} and v_{2}.
When N = 3, L is a 6×6 matrix.
In summary, the coordinate solution of the reference point in the camera coordinate system can be obtained as the initial value of the optimization, the optimization variable is β = [β_{1}, β_{2}, …, β_{N}]^{T}, and the objective function is:
Optimize β corresponding to the smallest dimension of the error, get the vector X, and restore the coordinates of the control point in the camera coordinate system. At the same time, the coordinates of the reference point in the camera coordinate system are obtained according to the centroid coordinate coefficient. Finally, according to the coordinates of a set of point clouds in the two coordinate systems, the pose transformations of the two coordinate systems are obtained. The solution steps are as follows:
a) Find the center point:
b) To the center point:
c) Calculate the H matrix:
d) SVD decomposition of H matrix:
e) Calculate the rotation R:
If R <0, then R(2,.) =-R(2,0).
f) Calculate displacement t:
Taking the results of EPnP solution as initial values, the method of g2o was used to optimize the pose of the camera nonlinearly. Construct the least squares problem and find the best camera pose:
Among them, u_{i} is the pixel coordinates of the projection point, K is the camera internal reference, ξ is the camera pose, and P_{i} is the space point coordinate.
The measurement frequency of the IMU is often higher than the frequency at which the camera collects pictures. For example, the binocular camera used in this article has a frame rate of up to 60FPS and an IMU frequency of up to 500Hz. The difference in frequency between the two results in multiple IMU measurements between the two frames. Therefore, in order to ensure the information fusion of the two sensors, it is necessary to pre-integrate [6] the data of the IMU. That is, only the IMU information between the two image moments is integrated to obtain the relative pose value, and the integration result is saved for later joint optimization.The IMU-based camera pose estimation method mainly includes three coordinate systems: the world coordinate system, the IMU coordinate system, and the camera coordinate system. All pose and feature point coordinates are finally expressed in the world coordinate system. During the calculation process, this article will convert the state quantity in the camera coordinate system to the IMU coordinate system, and then to the world coordinate system.In this article, the letter W is used to represent the world coordinate system, the letter B is used to represent the IMU coordinate system, R_{WB} is used to represent the rotation matrix from the IMU coordinate system to the world coordinate system, and p_{WB} is used to represent the translation matrix from the IMU coordinate system to the world coordinate system.
The acceleration and angular velocity of the IMU are:
Among them, b^{a}(t) and b^{g}(t) represent the bias of the accelerometer and gyroscope respectively, η^{a}(t) and η^{g}(t) represent the noise of the accelerometer and gyroscope respectively, and _{W}g represents the gravity vector in the world coordinate system.
The derivatives of rotation, velocity, and translation are expressed as:
The rotation, speed and translation in the world coordinate system can be obtained by the general integral formula:
Use Equation (32) in discrete time for Euler integration:
The IMU model is obtained from equations (30) and (33):
Suppose there are two image frames with time t_{i} and t_{j}, t_{j} > t_{i}. Therefore, the IMU’s pre-integration observation model is:
Among them, A, B, and C are the noise terms of the rotation amount, the pre-integrated speed noise term, and the pre-integrated translation noise term, respectively.
For the pose between two adjacent frames, this paper still uses a nonlinear optimization method to fuse IMU information and visual information. Among them, the state quantities that need to be optimized are:
In equation (36), $${R}_{WB}^{j}$$, $${v}_{WB}^{j}$$, and $${p}_{WB}^{j}$$ are the rotation, velocity, and translation of the IMU coordinate system relative to the world coordinate system at time i, and the random walk bias of the gyroscope and accelerometer at time i, respectively.
Therefore, the optimal state quantity θ is solved by optimizing the visual reprojection error and the IMU measurement error:
The upper computer of the experimental platform in this article is a laptop with Ubuntu 16.04 version, running memory is 8G, processor model is CORE i5 8250U, and the main frequency is 1.6GHz. The robot platform is a Dashgo D1 robot mobile platform that supports the ROS development system. The overall size is θ406×210 and the diameter of the driving wheel is 125mm. The binocular camera sensor used is MYNT EYE D1000-IR-120/Color.
The experiments in this paper are mainly aimed at the positioning accuracy of the robot. The evaluation index is the RMSE (root-mean-square-error) of the robot position:
Where $${\hat{p}}_{i}$$ is the estimated robot position and p_{i} is the actual robot position.
In this paper, robot positioning experiments are performed in corridor environments with insignificant environmental characteristics and indoor environments with rich characteristics. In a corridor environment, a mobile robot is used to carry experimental equipment to travel at a constant speed of 10m in the positive direction of the camera, and then the positioning accuracy of pure vision and the positioning accuracy of vision fusion IMU are recorded separately. In a feature-rich indoor environment, a robot linear experiment was also performed to make the mobile robot move forward at a constant speed of 5m in the positive direction of the camera, but the speed was 2.5 times that of the previous experiment. Perform multiple experiments and record the results.
From the experimental results, it can be seen that the stereo vision positioning error of the fusion IMU is less than the pure vision positioning error, which indicates that the visual positioning of the robot with the fusion IMU is more accurate than the vision-only positioning in low-texture environments and fast robot movements. degree.
In this paper, the robot positioning technology in the robot system is researched, and a binocular vision fusion IMU-based robot positioning method is proposed. Compared with the pure vision robot localization method, the proposed method is more robust in low-textured environments and fast robot movements. The experimental results show that the visual positioning method integrated with IMU solves the defects of pure visual positioning to a certain extent and improves the positioning accuracy of the robot.