Keywords

1 Introduction

The combination of vision and inertial sensors has long been a popular research field for three-dimensional structure, ego-motion estimation and visual odometry. Both monocular camera and Inertial Measurement Unit (IMU) are cheap, low-cost, low-weight and complementary. A moving camera can provide us accurate state estimation and sufficient environment 3D structure up to an unknown metric scale. While inertial sensors with high frame-rate can help us handle fast camera motion, scale ambiguity and short-term motion estimation.

Many Visual-inertial fusion strategies have been proposed, which can be divided into the loosely coupled modality and the tightly coupled one. Loosely coupled strategy is to estimate 6D pose and position separately. On the contrary, tightly coupled fusion strategy is to jointly optimize all sensor states. Most recent works concentrate on tightly-coupled visual-inertial odometry, using keyframe-based non-linear optimization [1,2,3,4] or filtering [5,6,7,8]. Non-linear optimization and tightly coupled methods have attracted much interest of researchers in recent years due to its good trade-off between accuracy and computational efficiency. This article follows this trend and focuses on the monocular unknown scale problem.

Visual scale estimation is a research hotspot in the monocular SLAM. The early MonoSLAM [11] initializes from a target of known size, which help to assign a precise scale to the estimated map. Filter-based methods include ROVIO [12], MSCKF [5] and [13, 14], where the scale information is added to the extended Kalman filter as an additional state variable. The paper [15] proposed a maximum-likelihood estimator for the scale of the monocular SLAM system. In [16] and visual-inertial ORB-SLAM [9], the scale is estimated within the process of optimization using methods such as Gauss-Newton. While promising, taking all visual and inertial measurements for scale estimation may contain outliers, which lead to declined accuracy of scale estimation. Besides, the method introduced in [9] is lack of robust termination criterion for IMU initialization, which results in increased computation and reducing the effect of IMU information.

In this paper, we devote to solve above problems existed in [9]. The main contribution of our research work is two-fold. Firstly, we present an approach to estimate scale, gravity and accelerometer bias together, and regard the estimated gravity as an indication for identifying convergence and termination for scale estimation procedure. Secondly, we propose a keyframe-based method that uses a weighted term to reduce the influence of large residuals, which lead to scale estimation refinement.

The remainder of this article is organized as follows. In the main Sect. 2 we explain the camera model, the IMU noise models, and the kinematics models of IMU, we also give a brief introduction about IMU pre-integration technique. In Sect. 3, we describe our approach as a whole, in particular we introduce the method for scale estimation and refinement. We also propose an automatic termination criterion. Section 4 is dedicated to show the performance of our approaches and we compare them with the ground truth. We conclude the paper in Sect. 5.

2 Preliminaries

In this section, we first introduce some notation throughout this paper: the matrix \( T_{EF} = \left[ {\begin{array}{*{20}c} {R_{EF} } & {{}_{E}P_{F} } \\ \end{array} } \right] \) represents the transformation from reference \( F \) to reference \( E \).

Then we will introduce some preliminary knowledge about the coordinate system, the camera model, inertial sensor model, and IMU pre-integration. Figure 1 shows the situation of the camera-IMU setup with its corresponding coordinate frames. Multiple camera-IMU units represent the consecutive states at continuous time, which is convenient for understanding the following Equations in Sect. 3.1. The camera provides the pose and the unscaled position in the camera frame \( C \). We denote the world reference frame with \( W \) and the IMU body frame \( B \). The transformation \( T_{CB} = \left[ {\begin{array}{*{20}c} {R_{CB} } & {{}_{C}P_{B} } \\ \end{array} } \right] \) between camera and IMU reference systems can be calibrated using Kalibr [17].

Fig. 1.
figure 1

The relationship between different coordinate frames and multiple states of camera-IMU

2.1 Camera Model

Here we consider a conventional pinhole-camera model [22], which any 3D point \( X_{C} \in {\mathbb{R}}^{3} \) in the camera reference maps to the image coordinates \( x \in {\mathbb{R}}^{2} \), through the camera projection function \( \pi :{\mathbb{R}}^{3} \mapsto {\mathbb{R}}^{2} \):

$$ \pi (X_{C} ) = \left[ {\begin{array}{*{20}c} {f_{u} \frac{{x_{C} }}{{z_{C} }} + c_{u} } \\ {f_{v} \frac{{y_{C} }}{{z_{C} }} + c_{v} } \\ \end{array} } \right],\begin{array}{*{20}c} {} & {} \\ \end{array} X_{C} = \left[ {\begin{array}{*{20}c} {x_{C} } & {y_{C} } & {z_{C} } \\ \end{array} } \right]^{T} $$
(1)

where \( \left[ {\begin{array}{*{20}c} {f_{u} } & {f_{v} } \\ \end{array} } \right]^{T} \) is the focal length and \( \left[ {\begin{array}{*{20}c} {c_{u} } & {c_{v} } \\ \end{array} } \right]^{T} \) is the principal point.

2.2 Inertial Sensor Model and IMU Kinematics Model

An IMU generally integrates a 3-axis gyroscope sensor and a 3-axis accelerometer sensor, and correspondingly, the measurements provide us the angular velocity and the acceleration of the inertial sensor at high frame-rate with respect to the body frame \( B \). The IMU measurement model contains two kinds of noise: one is white noise \( n_{\text{t}} \), the other is random walk noise that is a slowly varying sensor bias \( b_{\text{t}} \), so we have:

$$ {}_{B}\widetilde{\omega }(t) = {}_{B}\omega (t) + b_{g} (t) + n_{g} (t) $$
(2)
$$ {}_{B}\widetilde{a}(t) = R^{T}_{WB} (t)({}_{W}a(t) - {}_{W}g) + b{}_{a}(t) + n_{a} (t) $$
(3)

where the \( {}_{B}\widetilde{w}(t) \) and \( {}_{B}\widetilde{a}(t) \) are the measured values expressed in the body frame, the real angular velocities \( {}_{B}w(t) \) and the real acceleration \( {}_{W}a(t) \) are what we need. The left subscript \( W \) denotes in the world frame. And the \( R_{WB} \) is the rotational part from the transformation \( \left\{ {R_{WB} } \right.\;\left. {{}_{W}P} \right\} \), which maps a point from sensor frame \( B \) to \( W \). The dynamics of non-static bias \( b_{t} \) are modeled as a random process:

$$ \dot{b}_{g} = n_{{b_{g} }} \,\,\,\dot{b}_{a} = n_{{b_{a} }} $$
(4)

where the \( n_{{b_{g} }} \) and \( n_{{b_{a} }} \) are the zero-mean Gaussian White noises. Our goal is to deduce the motion of system from the output of IMU. For this purpose, we show the following IMU kinematics model [11]:

$$ {}_{W}\dot{R}_{WB} = R_{WB} \;{}_{B}\omega^{ \wedge } ,\,{}_{W}\dot{\nu } = {}_{W}a,\,{}_{W}\dot{p} = {}_{W}\nu $$
(5)

2.3 IMU Pre-integration

The IMU pre-integration technique incorporated with SLAM framework are proposed correctly in [18]. Here we give an overview of its theory and usage within monocular visual-inertial SLAM system. The pose and velocity of IMU at time \( t + \Delta t \) is obtained by integrating Eq. (5):

$$ R_{WB} (t + \Delta t) = R_{WB} (t)Exp({}_{B}\omega (t)\Delta t) $$
(6)
$$ {}_{W}\nu (t + \Delta t) = {}_{W}\nu (t) + {}_{W}a(t)\Delta t $$
(7)
$$ {}_{W}p(t + \Delta t) = {}_{W}p(t) + {}_{W}\nu (t)\Delta t + \frac{1}{2}{}_{W}a(t)\Delta t^{2} $$
(8)

which assumes that \( {}_{W}a \) and \( {}_{B}\omega \) maintain a constant in the time interval \( \left[ {t,t + \Delta t} \right] \). Equations (6)–(8) become function of the IMU measurements using Eqs. (2)–(3):

$$ R(t + \Delta t) = R(t)Exp((\tilde{\omega }(t) - b{}_{g}(t) - n_{g} (t))\Delta t) $$
(9)
$$ \nu (t + \Delta t) = \nu (t) + g\Delta t + R(t)(\tilde{a}(t) - b_{a} (t) - n_{a} (t))\Delta t $$
(10)
$$ p(t + \Delta t) = p(t) + \nu (t)\Delta t + \frac{1}{2}g\Delta t^{2} + \frac{1}{2}R(t)(\tilde{a}(t) - b_{a} (t) - n_{a} (t))\Delta t^{2} $$
(11)

Here the coordinate frame subscripts is dropped for readability. In Eqs. (6)–(11) \( \Delta t \) is the sampling interval of the IMU. Assuming that the IMU is synchronized with the camera, and provides measurements at discrete times \( k \). Integrating all IMU measurements between two consecutive keyframes at times \( k = i \) and \( k = j \), then the IMU pre-integration \( \Delta R_{ij} \), \( \Delta v_{ij} \) and \( \Delta p_{ij} \) are expressed as:

$$ \Delta R_{ij} \;\dot{ = }\;R_{i}^{T} R_{j} \; = \;\prod\limits_{k = i}^{j - 1} {Exp((\tilde{\omega }_{k} - b{}_{gk} - n_{gk} )\Delta t)} $$
(12)
$$ \Delta \nu_{ij} \;\dot{ = }\;R_{i}^{T} (\nu_{j} - \nu_{i} - g\Delta t_{ij} )\; = \;\sum\limits_{k = i}^{j - 1} {\Delta R_{ik} (\tilde{a}_{k} - b_{ak} - n_{ak} )\Delta t} $$
(13)
$$ \begin{aligned} & \Delta p_{ij} \;\dot{ = }\;R_{i}^{T} (p_{j} - p_{i} - \nu_{i} \Delta t_{ij} - \frac{1}{2}g\Delta t_{ij}^{2} ) \\ & = \;\sum\limits_{k = i}^{j - 1} {\left[ {\Delta \nu_{ik} \Delta t + \frac{1}{2}\Delta R_{ik} (\tilde{a}_{k} - b_{ak} - n_{ak} )\Delta t^{2} } \right]} \\ \end{aligned} $$
(14)

3 Scale Estimation and Refinement with a Weighted Item

In this section, we firstly introduce the process of scale estimation based on visual-inertial ORB-SLAM [9]. Since some visual-inertial measurements between two kerframes may not be exact, we propose a weighting method for outliers handling and scale estimation refinement, inspired by [10]. Next, we present a robust termination criterion for scale estimation procedure. At last, we describe the scale benchmark, which can be used to verify the accuracy of our estimated results.

3.1 Scale Estimation

In this section, we introduce the scale estimation method in details, which is able to estimate scale \( s \), gravity \( {}_{W}g \), accelerometer bias \( b_{a} \) together. The full state vector \( X \) is defined as:

$$ X = \left[ {s,{}_{W}g,b_{a} } \right]^{T} \in {\mathbb{R}}^{7 \times 1} $$
(15)

In the monocular SLAM system, the camera position and 3D points are all up-to-scale. It can be solved by integrating IMU data. First we consider the following equation, which represents that it includes a visual scale \( s \) when transforming the position in the camera frame \( C \) to the IMU frame \( B \)

$$ {}_{W}p_{B} = s{}_{W}p_{C} + R_{WC} \;{}_{C}p_{B} $$
(16)

For two consecutive keyframe \( i \) and keyframe \( i + 1 \), the corresponding IMU position and velocity are obtained using pre-integration Eqs. (13) and (14):

$$ {}_{W}p_{B}^{i + 1} = {}_{W}p_{B}^{i} + {}_{W}v_{B}^{i} \Delta t_{i,i + 1} + 0.5{}_{w}g\Delta t_{{_{i,i + 1} }}^{2} + R_{WB}^{i} (\Delta p_{i,i + 1} + J_{\Delta p}^{a} b_{a} ) $$
(17)
$$ {}_{W}v_{B}^{i + 1} = {}_{W}v_{B}^{i} + {}_{W}g\Delta t_{i,i + 1}^{2} + R_{WB}^{i} (\Delta v_{i,i + 1} + J_{\Delta v}^{a} b_{a} ) $$
(18)

where Jacobian \( J_{( \cdot )}^{a} \) denotes a first-order approximation of the effect of changing accelerometer bias. Then taking Eq. (16) into Eq. (17), it becomes:

$$ s{}_{W}p_{C}^{i + 1} = s{}_{W}p_{C}^{i} + {}_{W}v_{B}^{i} \Delta t_{i,i + 1} + 0.5{}_{w}g\Delta t_{{_{i,i + 1} }}^{2} + R_{WB}^{i} (\Delta p_{i,i + 1} + J_{\Delta p}^{a} b_{a} ) + (R_{WC}^{i} - R_{WC}^{i + 1} ){}_{C}p_{B} $$
(19)

To solve this linear system, we consider two relations (19) between three consecutive keyframes (Fig. 1 shows an example) and exploit the velocity relation in (18), we can get the following equations:

$$ \left[ {\begin{array}{*{20}c} {\alpha (i)} & {\beta (i)} & {\gamma (i)} \\ \end{array} } \right]X = \psi (i) $$
(20)

where the visual scale \( s \), gravity \( {}_{W}g \) and acceleration bias \( b_{a} \) are unknown variables. Writing keyframes \( i,\,i + 1,\,i + 2 \) as 1, 2, 3 for readability, we have:

$$ \alpha (i) = ({}_{w}p_{c}^{2} - {}_{w}p_{c}^{1} )\Delta t_{23} - ({}_{w}p_{c}^{3} - {}_{w}p_{c}^{2} )\Delta t_{12} $$
(21)
$$ \beta (i) = 0.5I_{3 \times 3} (\Delta t_{12}^{2} \Delta t_{23} + \Delta t_{23}^{2} \Delta t_{12} ) $$
(22)
$$ \gamma (i) = R_{WB}^{2} J_{{\Delta p_{23} }}^{a} \Delta t_{12} + R_{WB}^{1} J_{{\Delta v_{12} }}^{a} \Delta t_{12} \Delta t_{23} - R_{WB}^{1} J_{{\Delta p_{12} }}^{a} \Delta t_{23} $$
(23)
$$ \begin{aligned} & \psi (i) = (R_{WC}^{1} - R_{WC}^{2} ){}_{C}p_{B} \Delta t_{23} - (R_{WC}^{2} - R_{WC}^{3} ){}_{C}p_{B} \Delta t_{12} - R_{WB}^{2} \Delta p_{23} \Delta t_{12} \\ & - R_{WB}^{1} \Delta v_{12} \Delta t_{12} \Delta t_{23} + R_{WB}^{1} \Delta p_{12} \Delta t_{23} \\ \end{aligned} $$
(24)

Stacking all relations between every three consecutive keyframes using Eq. (20), we can get a linear overdetermined equation groups. Finally, we can solve it via Singular Value Decomposition (SVD) to get the results of the scale \( s \), gravity \( {}_{W}g \), accelerometer bias \( b_{a} \). Note that we can construct \( 3(N - 2) \) equations with 7 unknowns, where \( N \) is the number of keyframes, thus we need at least 5 keyframes.

Every time a new keyframe is inserted by ORB-SLAM, the procedure runs to get new estimated values of scale, gravity and accelerometer bias. When the termination criterion is established, the estimation procedure ends up.

3.2 Weighting Method for Scale Estimation Refinement

In the Sect. 3.1, it takes all visual-inertial measurements into the scale estimation procedure, which may contain outliers, so we utilize the weight \( w_{i} \) to handle outliers for estimation refinement. Simply, we exploit the initial values to weight the residual in a similar way to the Huber norm [20], and define the residual as the first moment norm:

$$ r_{i} = \left| {C_{i} X_{est} - D_{i} } \right| $$
(25)

where \( X_{est} \) is the estimated results from Sect. 3.1, \( C_{i} \) and \( D_{i} \) are from Eq. (20) for the i-th consecutive three keyframes, and defined as:

$$ C_{i} = \left[ {\begin{array}{*{20}c} {\alpha (i)} & {\beta (i)} & {\gamma (i)} \\ \end{array} } \right] $$
(26)
$$ D_{i} = \left[ {\psi (i)} \right] $$
(27)

The weight is associated with the residual.

$$ w_{i} = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & {} & {} & {r_{i} < threshold} \\ \end{array} } \\ {\begin{array}{*{20}c} {\frac{threshold}{{r_{i} }}} & {otherise} & {} & {} \\ \end{array} } \\ \end{array} } \right. $$
(28)

If the measurement is obviously wrong for our scale estimate, its \( w_{i} \) is set to zero. And in our experiments, we set the threshold to 0.002. With the \( N \) keyframes in the process of scale estimation, we are able to build an overconstrained linear system as:

$$ \left[ {\begin{array}{*{20}c} {w_{1} \cdot C_{1} } \\ {w_{2} \cdot C_{2} } \\ \vdots \\ {w_{N - 2} \cdot C_{N - 2} } \\ \end{array} } \right] \cdot X = \left[ {\begin{array}{*{20}c} {w_{1} \cdot D_{1} } \\ {w_{2} \cdot D_{2} } \\ \vdots \\ {w_{N - 2} \cdot D_{N - 2} } \\ \end{array} } \right] $$
(29)

where \( C_{i} \) and \( D_{i} \) are from Eqs. (26) and (27) for the i-th consecutive three keyframes. Once we get the Eq. (29), the procedure runs to estimate an updated vector \( \hat{X} \) by solving Eq. (29) via SVD.

3.3 Termination Criterion

In this section we propose an automatic criterion to determine when we consider the scale estimate successful. Because the norm of the nominal gravity is a constant ~9.8 m/s2, we regard it as one convergence indicator. The other is that the difference of consecutive solutions \( X \) in Sect. 3.1 is under a certain threshold for several times. The visual scale estimation terminates when both conditions above are established.

3.4 Scale Benchmark

In monocular SLAM system, the translation decomposed from essential matrix is ambiguous up to an unknown scale. To obtain a globally consistent scale factor, visual-inertial ORB-SLAM system initializes mean depth of all the feature points to one. In other words, the real visual scale is determined at the start of the system initialization. Because the first two keyframes selection and the map points generation is random in the ORB-SLAM system initialization, the initial scale is not fixed. For this reason, we need to calculate the actual scale according to the ground truth data, which is extracted by Leica MS50 and motion capture system and provide us the accurate 6D pose in the IMU body reference frame \( B \).

Once the initialization of ORB-SLAM system completes, it outcomes an initial translation \( t \) between the first two keyframes. Meanwhile, we can calculate the actual translation \( {}_{{C_{1} }}p_{{C_{2} }} \) according to their corresponding ground truth states. Then the actual scale \( s \) is computed by the following formula:

$$ {}_{{B_{1} }}p_{{B_{2} }} = R_{{B_{1} C_{2} }} \;{}_{{C_{2} }}p_{{B_{2} }} + R_{{B_{1} C_{1} }} \;{}_{{C_{1} }}p_{{C_{2} }} + {}_{{B_{1} }}p_{{C_{1} }} $$
(30)
$$ s = {{{}_{{C_{1} }}p_{{C_{2} }} } \mathord{\left/ {\vphantom {{{}_{{C_{1} }}p_{{C_{2} }} } t}} \right. \kern-0pt} t} $$
(31)

where \( {}_{{B_{1} }}p_{{B_{2} }} \) is the position of \( B_{2} \) in the body frame \( B_{1} ,\,B_{1} \) and \( B_{2} \) are the IMU frames corresponding to the camera frame \( C_{1} \) and \( C_{2} \) at the same timestamp (see Fig. 1). \( R_{{B_{1} C_{2} }} = R_{{B_{1} B_{2} }} R_{BC} \) is computed from the orientation \( R_{{B_{1} B_{2} }} \) computed by the ground truth data and calibration \( R_{BC} \).

4 Experimental Results

We conducted several experiments using the sequence \( V1\_01\_easy \) and \( V2\_01\_easy \) in the EuRoC dataset [21] to analyze the performance of our approach. It provides synchronized global shutter stereo images at 20 Hz with IMU measurements at 200 Hz and trajectory ground truth. We conduct the experiments in a virtual machine with 2 GB RAM.

4.1 Scale Estimation Results

The scale estimation procedure runs every time a new keyframe is inserted by ORB-SLAM [19]. Figure 2 shows the estimated scale, gravity and accelerometer bias. All variables are converged to stable values after 11 s. Figure 2(a) shows that the converged scale (~2.25972) is quite close to the ground truth scale (2.28132) which is the scale benchmark computed by the method that we have introduced in Sect. 3.4. Figure 2(b) indicates that the 3-axis accelerometer biases converge to almost 0. Figure 2(c) indicates that the components around \( x \) and \( z \) axes of gravity is converged quickly, and its y-axis component is converged to \( 9. 2 5 6 9 7 3\,\,{\text{m} \mathord{\left/ {\vphantom {\text{m} {\text{s}^{\text{2}} }}} \right. \kern-0pt} {\text{s}^{\text{2}} }} \) (near nominal gravity value). Hence the gravity direction is closed to y-axis. Figure 2(d) also shows the process of gravity estimation (depicted in blue), the green one is the nominal gravity value \( 9. 8 0 2\,{\text{m} \mathord{\left/ {\vphantom {\text{m} {\text{s}^{\text{2}} }}} \right. \kern-0pt} {\text{s}^{\text{2}} }} \), they also come near after 11 s.

Fig. 2.
figure 2

The converged procedure of scale, accelerometer biases and gravity in the sequence \( V1\_01\_easy \). (Color figure online)

Once we have estimated a stable and accurate scale. All 3D points in the map and the position of keyframes are updated according to the estimated scale. Figure 3(b) shows the final reconstructed sparse map, we also show a processed image in \( V1\_01\_easy \).

Fig. 3.
figure 3

A processed image and the reconstruction from sequence \( V1\_01\_easy \)

4.2 The Performance of Weighted Method for Scale Estimation Refinement

We evaluated the accuracy of proposed scale estimation and refinement by comparing it with the scale benchmark computed by the method in Sect. 3.4. As can be indicated in the Tables 1 and 2: for the sequence \( V1\_01\_easy \) and \( V2\_01\_easy \), we list the results of five tests. The second column is the scale estimation values \( s{\text{cale}} \) which is almost the same as the estimated scale \( {\text{s}} \) of [9], and the \( w\_scale \) is the results of estimation refinement introduced in Sect. 3.3. We show the scale benchmark in the last one. The results indicate that our scale estimation refinement method can improve the accuracy of the estimated scale.

Table 1. The results of scale estimation and refinement, compared with scale benchmark for \( V1\_01\_easy \)
Table 2. The results of scale estimation and refinement, compared with scale benchmark for \( V2\_01\_easy \)

4.3 The Effect of Termination Criterion

Here we test our automatic criterion to determine when we consider the scale estimation successful. In the sequence \( V1\_01\_easy \), the norm of recovered gravity \( {}_{W}g \) is gradually close to the nominal gravity value ~9.8 m/s2, after 11 s the difference is under the threshold (\( 0.1\,{\text{m} \mathord{\left/ {\vphantom {\text{m} {\text{s}^{\text{2}} }}} \right. \kern-0pt} {\text{s}^{\text{2}} }} \)). And the other condition is established after the estimated scales come near for \( n = 5 \) times. Both conditions are established after the procedure runs about 11 s as depicted in the Fig. 3(a) and (d), and the scale estimation achieves convergence at that moment. And the converged speed in the paper [16] is 30 s, but its termination criterion is not mentioned.

5 Conclusions

In this paper, we showed our approaches for visual scale estimation and refinement. Firstly, we have presented an approach for the estimation of scale, gravity and accelerometer bias. Secondly, we proposed a weighting method for monocular visual scale estimation refinement, which utilizes weight \( w \) derived from the robust norm for outliers handling. Thirdly, we proposed an automatic way to identify convergence and termination for scale estimation procedure. We experimentally showed that the scale estimation is accurate, and the deduced weighting method further promotes the scale accuracy for the monocular visual map, and the termination criterion performs well, tested in the EuRoC dataset [21].