Semi-dense 3D Reconstruction with a Stereo Event Camera

Zhou, Yi; Gallego, Guillermo; Rebecq, Henri; Kneip, Laurent; Li, Hongdong; Scaramuzza, Davide

doi:10.1007/978-3-030-01246-5_15

Yi Zhou^17,18,
Guillermo Gallego¹⁹,
Henri Rebecq¹⁹,
Laurent Kneip²⁰,
Hongdong Li^17,18 &
…
Davide Scaramuzza¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11205))

Included in the following conference series:

European Conference on Computer Vision

4888 Accesses
73 Citations
2 Altmetric

Abstract

Event cameras are bio-inspired sensors that offer several advantages, such as low latency, high-speed and high dynamic range, to tackle challenging scenarios in computer vision. This paper presents a solution to the problem of 3D reconstruction from data captured by a stereo event-camera rig moving in a static scene, such as in the context of stereo Simultaneous Localization and Mapping. The proposed method consists of the optimization of an energy function designed to exploit small-baseline spatio-temporal consistency of events triggered across both stereo image planes. To improve the density of the reconstruction and to reduce the uncertainty of the estimation, a probabilistic depth-fusion strategy is also developed. The resulting method has no special requirements on either the motion of the stereo event-camera rig or on prior knowledge about the scene. Experiments demonstrate our method can deal with both texture-rich scenes as well as sparse scenes, outperforming state-of-the-art stereo methods based on event data image representations.

You have full access to this open access chapter, Download conference paper PDF

EMVS: Event-Based Multi-View Stereo—3D Reconstruction with an Event Camera in Real-Time

Article 07 November 2017

Selection and Cross Similarity for Event-Image Deep Stereo

Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Multimedia Material

A supplemental video for this work is available at https://youtu.be/Qrnpj2FD1e4.

1 Introduction

Event cameras, such as the Dynamic Vision Sensor (DVS) [1], are novel devices that output pixel-wise intensity changes (called “events”) asynchronously, at the time they occur. As opposed to standard cameras, they do not acquire entire image frames, nor do they operate at a fixed frame rate. This asynchronous and differential principle of operation reduces power and bandwidth requirements drastically. Endowed with microsecond temporal resolution, event cameras are able to capture high-speed motions, which would typically cause severe motion blur on standard cameras. In addition, event cameras have a very High Dynamic Range (HDR) (e.g., 140 dB compared to 60 dB of most standard cameras), which allows them to be used on a broad illumination range. Hence, event cameras open the door to tackle challenging scenarios that are inaccessible to standard cameras, such as high-speed and/or HDR tracking [2,3,4,5,6,7,8], control [9, 10] and Simultaneous Localization and Mapping (SLAM) [11,12,13,14,15,16].

Because existing computer vision algorithms designed for standard cameras do not directly apply to event cameras, the main challenge in visual processing with these novel sensors is to devise specialized algorithms that can exploit the temporally asynchronous and spatially sparse nature of the data produced by event cameras to unlock their potential. Some preliminary works addressed this issue by combining event cameras with additional sensors, such as standard cameras [8, 17, 18] or depth sensors [17, 19], to simplify the estimation task at hand. Although this approach obtained certain success, the capabilities of event cameras were not fully exploited since parts of such combined systems were limited by the lower dynamic range or slower devices. In this work, we tackle the problem of stereo 3D reconstruction for visual odometry (VO) or SLAM using event cameras alone. Our goal is to unlock the potential of event cameras by developing a method based on their working principle and using only events.

1.1 Related Work on Event-Based Depth Estimation

The majority of works on depth estimation with event cameras target the problem of “instantaneous” stereo, i.e., 3D reconstruction using events from a pair of synchronized cameras in stereo configuration (i.e., with a fixed baseline), during a very short time (ideally, on a per-event basis). Some of these works [20,21,22] follow the classical paradigm of solving stereo in two steps: epipolar matching followed by 3D point triangulation. Temporal coherence (e.g., simultaneity) of events across both left and right cameras is used to find matching events, and then standard triangulation [23] recovers depth. Other works, such as [24, 25], extend cooperative stereo [26] to the case of event cameras. These methods are typically demonstrated in scenes with static cameras and few moving objects, so that event matches are easy to find due to uncluttered event data.

Some works [27, 28] also target the problem of instantaneous stereo (depth maps produced using events over very short time intervals), but they use two non-simultaneous event cameras. These methods exploit a constrained hardware setup (two rotating event cameras with known motion) to either (i) recover intensity images on which conventional stereo is applied [27] or (ii) match events across cameras using temporal metrics and then use triangulation [28].

Recently, depth estimation with a single event camera has been shown in [11,12,13,14, 29]. These methods recover a semi-dense 3D reconstruction of the scene by integrating information from the events of a moving camera over a longer time interval, and therefore, require information of the relative pose between the camera and the scene. Hence, these methods do not target the problem of instantaneous depth estimation but rather the problem of depth estimation for VO or SLAM.

Contribution. This paper is, to the authors’ best knowledge, the first one to address the problem of non-instantaneous 3D reconstruction with a pair of event cameras in stereo configuration. Our approach is based on temporal coherence of events across left and right image planes. However, it differs from previous efforts (such as the instantaneous stereo methods [20,21,22, 27, 28]) in that: (i) we do not follow the classical paradigm of event matching plus triangulation, but rather a forward-projection approach that allows us to estimate depth without explicitly solving the event matching problem, (ii) we are able to handle sparse scenes (events generated by few objects) as well as cluttered scenes (events constantly generated everywhere in the image plane due to the motion of the camera), and (iii) we use camera pose information to integrate observations over time to produce semi-dense depth maps. Moreover, our method computes continuous depth values, as opposed to other methods, such as [11], which discretize the depth range.

Outline. Section 2 presents the 3D reconstruction problem considered and our solution, formulated as the minimization of an objective function that measures the temporal inconsistency of event time-surface maps across left and right image planes. Section 3 presents an approach to fuse multiple event-based 3D reconstructions into a single depth map. Section 4 evaluates our method on both synthetic and real event data, showing its good performance. Finally, Sect. 5 concludes the paper.

2 3D Reconstruction by Event Time-Surface Maps Energy Minimization

Our method is inspired by multi-view stereo pipelines for conventional cameras, such as DTAM [30], which aim at maximizing the photometric consistency through a number of narrow-baseline video frames. However, since event cameras do not output absolute intensity but rather intensity changes (the “events”), the direct photometric-consistency-based method cannot be readily applied. Instead, we exploit the fact that event cameras encode visual information in the form of microsecond-resolution timestamps of intensity changes.

For a stereo event camera, a detectable^{Footnote 1} 3D point in the overlapping field of view (FOV) of the cameras will generate an event on both left and right cameras. Ideally, these two events should spike simultaneously and their coordinates should be corresponding in terms of the epipolar geometry defined by both cameras. This property actually enables us to apply (and modify) an idea similar to DTAM, simply by replacing the photometric consistency with the stereo temporal consistency. However, as shown in [31], stereo temporal consistency does not strictly hold at the pixel level because of signal latency and jitter effects. Hence, we define our stereo temporal consistency criterion by aggregating measurements over spatio-temporal neighborhoods, rather than by comparing the event timestamps at two individual pixels, as we show next.

2.1 Event Time-Surface Maps

We propose to apply patch-match to compare a pair of spike-history maps, in place of the photometric warping error as used in DTAM [30]. Specifically, to create two distinctive maps, we advocate the use of Time-Surface inspired by [32] for event-based pattern recognition. As illustrated in Fig. 1, the output of an event camera is a stream of events, where each event $e_k=(u_k,v_k,t_k,p_k)$ consists of the space-time coordinates where the intensity change of predefined size happened and the sign (polarity $p_k\in \{+1,-1\}$) of the change^{Footnote 2}. The time-surface map at time t is defined by applying an exponential decay kernel on the last spiking time $t_\text {last}$ at each pixel coordinate $\mathbf {x}=(u,v)^{T}$:

$$\begin{aligned} \mathcal {T}(\mathbf {x}, t) \doteq \exp \left( -\frac{t - t_\text {last}(\mathbf {x})}{\delta }\right) , \end{aligned}$$

(1)

where $\delta $, the decay rate parameter, is a small constant number (e.g., 30 ms in our experiments). For convenient visualization and processing, (1) is further rescaled to the range [0, 255]. Our objective function is constructed on a set of time-surface maps (1) at different observation times $t = \{t_s\}$.

2.2 Problem Formulation

We follow a global energy minimization framework to estimate the inverse depth map $\mathcal {D}$ in the reference view (RV) from a number of stereo observations $s \in \mathcal {S}_\mathrm{RV}$ nearby. A stereo observation at time t refers to a pair of time-surface maps created using (1), $\bigl (\mathcal {T}_{\text {left}}(\cdot ,t), \mathcal {T}_{\text {right}}(\cdot ,t)\bigr )$. A stereo observation could be triggered by either a pose update or at a constant rate. For each pixel $\mathbf {x}$ in the reference view, its inverse depth $\rho ^{\star } \doteq 1/z^{\star }$ is estimated by optimizing the objective function:

$$\begin{aligned} \rho ^{\star } = \mathop {\text {arg min}}\limits _{\rho } C(\mathbf {x}, \rho ) \end{aligned}$$

(2)

$$\begin{aligned} C(\mathbf {x}, \rho ) \doteq \frac{1}{|\mathcal {S}_\mathrm{RV}|} \sum _{s \in \mathcal {S}_\mathrm{RV}}{\Vert \tau _{\text {left}}^s(\mathbf {x}_1(\rho )) - \tau _{\text {right}}^s(\mathbf {x}_2(\rho )) \Vert _{2}^{2}}, \end{aligned}$$

(3)

where $\vert \mathcal {S}_\mathrm{RV}\vert $ denotes the number of involved neighboring stereo observations used for averaging. The function $\tau _\text {left/right}^s(\mathbf {x})$ returns the temporal information $\mathcal {T}_{\text {left/right}}(\cdot ,t)$ inside a $w \times w$ patch centered at image point $\mathbf {x}$. The residual

$$\begin{aligned} r_{s}(\rho ) \doteq \Vert \tau _{\text {left}}^s(\mathbf {x}_1(\rho )) - \tau _{\text {right}}^s(\mathbf {x}_2(\rho )) \Vert _{2} \end{aligned}$$

(4)

denotes the temporal difference in $l_2$ norm between patches centered at $\mathbf {x}_1$ and $\mathbf {x}_2$ in the left and right event cameras, respectively.

The geometry behind the proposed objective function is illustrated in Fig. 2. Since we assume the calibration (intrinsic and extrinsic parameters) as well as the pose of the left event camera $T_{sr}$ at each observation are known (e.g., from a tracking algorithm such as [12, 14]), the points $\mathbf {x}_1$ and $\mathbf {x}_2$ are given by $\mathbf {x}_1(\rho ) = \pi (T_{sr}\pi ^{-1}(\mathbf {x},\rho ))$ and $\mathbf {x}_2(\rho ) = \pi (T_\mathrm{E}T_{sr}\pi ^{-1}(\mathbf {x},\rho ))$, respectively. The function $\pi : \mathbb {R}^{3} \rightarrow \mathbb {R}^{2}$ projects a 3D point onto the camera’s image plane, while its inverse function $\pi ^{-1}: \mathbb {R}^{2} \rightarrow \mathbb {R}^{3}$ back-projects a pixel into 3D space given the inverse depth $\rho $. $T_\mathrm{E}$ denotes the transformation from the left to the right event camera. Note that all event coordinates $\mathbf {x}$ are undistorted and rectified.

To verify that the proposed objective function (3) does lead to the optimum depth for a generic event in the reference view (Fig. 3(a)), a number of stereo observations from a real stereo event-camera sequence [34] have been created (Figs. 3(c) and (d)) and used to visualize the energy at the event location (Fig. 3(b)). The size of the patch is $w=25$ pixels throughout the paper.

Note that our approach significantly departs from classical two-step event-processing methods [20,21,22] that solve the stereo matching problem first and then triangulate the 3D point, which is prone to errors due to the difficulty in establishing correct event matches during very short time intervals. These two-step approaches work in a “back-projection” fashion, mapping 2D event measurements to 3D space. Instead, our approach combines matching and triangulation in a single step, operating in a forward-projection manner (from 3D space to 2D event measurements). As shown in Fig. 2, an inverse depth hypothesis $\rho $ yields a 3D point, $\pi ^{-1}(\mathbf {x},\rho )$, whose projection on both stereo image planes for all times “s” gives curves $\mathbf {x}_1^s(\rho )$ and $\mathbf {x}_2^s(\rho )$ that are compared in the objective function (3). Hence, an inverse depth hypothesis $\rho $ establishes candidate stereo event matches, and the best matches are obtained once the objective function has been minimized with respect to $\rho $.

2.3 Inverse Depth Estimation

The proposed objective function (3) is optimized using non-linear least squares methods. The Gauss-Newton method is used here, which iteratively discovers the root of the necessary optimality condition

$$\begin{aligned} \frac{\partial C}{\partial \rho } = \frac{2}{\vert \mathcal {S}_\mathrm{RV}\vert } \sum _{s \in \mathcal {S}_\mathrm{RV}}r_s \frac{\partial r_s}{\partial \rho } = 0. \end{aligned}$$

(5)

Substituting the linearization of $r_s$ at $\rho _k$ using the first order Taylor formula, $r_s(\rho _k+\Delta \rho ) \approx r_s(\rho _k) + J_s(\rho _k)\Delta \rho $, in (5) we obtain

$$\begin{aligned} \sum _{s \in \mathcal {S}_\mathrm{RV}} J_s (r_s + J_s \Delta \rho ) = 0, \end{aligned}$$

(6)

where both, residual $r_s\equiv r_s(\rho _k)$ and Jacobian $J_s \equiv J_s(\rho _k)$, are scalars. Consequently the inverse depth $\rho $ is iteratively updated by adding the increment

$$\begin{aligned} \Delta \rho = -\frac{\sum _{s \in \mathcal {S}_\mathrm{RV}} J_s r_s}{\sum _{s \in \mathcal {S}_\mathrm{RV}} J_s^2}. \end{aligned}$$

(7)

The Jacobian is computed by applying the chain rule,

$$\begin{aligned} \begin{aligned} J_s(\rho )&\doteq \frac{\partial }{\partial \rho } \Vert \tau _{\text {left}}^s(\mathbf {x}_1(\rho )) - \tau _{\text {right}}^s(\mathbf {x}_2(\rho )) \Vert _2\\&= \frac{1}{\Vert \tau _{\text {left}}^s - \tau _{\text {right}}^s \Vert _2 + \epsilon } \big (\tau _{\text {left}}^{s} - \tau _{\text {right}}^{s} \big )_{1 \times w^2} ^{T} \left( \frac{\partial \tau _{\text {left}}^{s}}{\partial \rho } - \frac{\partial \tau _{\text {right}}^{s}}{\partial \rho } \right) _{w^2 \times 1}, \end{aligned} \end{aligned}$$

(8)

where, for simplicity, the pixel notation $\mathbf {x}_i(\rho )$ is omitted in the last equation. To avoid division by zero, a small number $\epsilon $ is added to the length of the residual vector. Actually, as shown by an investigation on the distribution of the temporal residual $r_s$ in Sect. 3.1, the temporal residual is unlikely to be close to zero for valid stereo observations (i.e., patches with enough events). The derivative of the time-surface map with respect to the inverse depth is calculated by

$$\begin{aligned} \frac{\partial \varvec{\tau }^{s}}{\partial \rho } =\frac{\partial \varvec{\tau }^{s}}{\partial \mathbf {x}} \frac{\partial \mathbf {x}}{\partial \rho } =\left( \frac{\partial \varvec{\tau }^s}{\partial u}, \frac{\partial \varvec{\tau }^s}{\partial v}\right) _{w^2 \times 2} \left( \frac{\partial u}{\partial \rho }, \frac{\partial v}{\partial \rho } \right) ^{T}. \end{aligned}$$

(9)

The computation of $\partial u / \partial \rho $ and $\partial v/\partial \rho $ is given in the supplementary material.

The overall procedure is summarized in Algorithm 1. The inputs of the algorithm are, respectively, the pixel coordinate $\mathbf {x}$ of an event in the RV, a set of stereo observations (time-surface maps) $\mathcal {T}^s_\text {left/right}$ ($s \in \mathcal {S}_\mathrm{RV}$), the relative pose $T_{sr}$ from the RV to each involved stereo observation s and the constant extrinsic parameters between both event cameras, $T_\mathrm{E}$. The inverse depths of all events in the RV are estimated independently. Therefore, the computation is parallelizable. The basin of convergence is first localized by a coarse search over the range of plausible inverse depth values followed by a nonlinear refinement using the Gauss-Newton method. The coarse search step is selected to balance efficiency and accuracy when locating the basin of convergence, and is also based on our observation that the width of the basin is always bigger than 0.2 m$^{-1}$ for the experiments carried out.

3 Semi-dense Reconstruction

The 3D reconstruction method presented in Sect. 2 produces a sparse depth map at the reference view (RV). To improve the density of the reconstruction while reducing the uncertainty of the estimated depth, we run the reconstruction method (Algorithm 1) on several RVs along time and fuse the results. To this end, the uncertainty of the inverse depth estimation is studied in this section. Based on the derived uncertainty, a fusion strategy is developed and is incrementally applied as sparse reconstructions of new RVs are obtained. Our final reconstruction approaches a semi-dense level as it reconstructs depth for all pixels that lie along edges.

3.1 Uncertainty of Inverse Depth Estimation

In the last iteration of Gauss-Newton’s method, the inverse depth is updated by

$$\begin{aligned} \rho ^{\star }\equiv \rho _k \leftarrow \rho _{k} + \Delta \rho (\mathbf {r}), \end{aligned}$$

(10)

where $\Delta \rho $ is a function of the residuals $\mathbf {r}\doteq \{r_1, r_2, \ldots , r_s \,\vert \, s \in \mathcal {S}_\mathrm{RV}\}$ as defined in (7). The variance $\sigma _{\rho ^{\star }}^2$ of the inverse depth estimate can be derived using uncertainty propagation [35]. For simplicity, only the noise in the temporal residuals $\mathbf {r}$ is considered:

$$\begin{aligned} \sigma _{\rho ^{\star }}^2 \approx \left( \frac{\partial \rho ^{\star }}{\partial \mathbf {r}}\right) ^{T} (\sigma _{r}^2 \mathtt {Id})\,\frac{\partial \rho ^{\star }}{\partial \mathbf {r}} =\frac{\sigma _r^{2}}{\sum _{s \in \mathcal {S}_\mathrm{RV}} J_s^{2}}. \end{aligned}$$

(11)

The derivation of this equation can be found in the supplementary material. We determine $\sigma _{r}$ empirically by investigating the distribution of the temporal residuals $\mathbf {r}$. Using the ground truth depth, we sample a large number of temporal residuals $\mathbf {r}= \{r_1, r_2, ..., r_n\}$. The variance $\sigma _{r}^2$ is obtained by fitting a Gaussian distribution to the histogram of $\mathbf {r}$, as illustrated in Fig. 4.

3.2 Inverse Depth Fusion

To improve the density of the reconstruction, inverse depth estimates from multiple RVs are incrementally transferred to a selected reference view, $\text {RV}^{\star }$, and fused. Assuming the inverse depth of a pixel in $\text {RV}_{i}$ follows a distribution $\mathcal {N}(\rho _a, \sigma _{a}^2)$, its corresponding location in $\text {RV}^{\star }$ is typically a non-integer coordinate $\mathbf {x}^f$, which will have an effect on the four neighboring pixels coordinates $\{\mathbf {x}_{j}^{i}\}_{j=1}^{4}$. Using $\mathbf {x}_{1}^{i}$ as an example, the fusion is performed based on the following rules:

1.
Assign $\mathcal {N}(\rho _a, \sigma _{a}^2)$ to $\mathbf {x}_{1}^{i}$ if no previous distribution exists.
2.
If there is an existing inverse depth distribution assigned at $\mathbf {x}_{1}^{i}$, e.g., $\mathcal {N}(\rho _b, \sigma _{b}^2)$, the compatibility between the two inverse depth hypotheses is checked to decide whether they are fused. The compatibility is evaluated using the $\chi ^2$ test at [35]:
$$\begin{aligned} \frac{(\rho _a - \rho _b)^2}{\sigma _a^{2}} + \frac{(\rho _a - \rho _b)^2}{\sigma _b^{2}} < 5.99. \end{aligned}$$
(12)
If the two hypotheses are compatible, they are fused into a single inverse depth distribution:
$$\begin{aligned} \mathcal {N}\left( \frac{\sigma _a^2 \rho _b + \sigma _b^2 \rho _a}{\sigma _a^2 + \sigma _b^2},\, \frac{\sigma _a^2 \sigma _b^2}{\sigma _a^2 + \sigma _a^2}\right) , \end{aligned}$$
(13)
otherwise the distribution with the smallest variance remains.

An illustration of the fusion strategy is given in Fig. 5.

4 Experiments

The proposed stereo 3D reconstruction method is evaluated in this section. We first introduce the configuration of our stereo event-camera system and the datasets used in the experiments. Afterwards, both quantitative and qualitative evaluations are presented. Additionally, the depth fusion process is illustrated to highlight how it improves the density of the reconstruction while reducing depth uncertainty.

4.1 Stereo Event-Camera Setup

To evaluate our method, we use sequences from publicly available simulators [36] and datasets [34], and we also collect our own sequences using a stereo event-camera rig (Fig. 6). The stereo rig consists of two Dynamic and Active Pixel Vision Sensors (DAVIS) [37] of $240\times 180$ pixel resolution, which are calibrated intrinsically and extrinsically^{Footnote 3} using Kalibr [38]. Since our algorithm is working on rectified and undistorted coordinates, the joint undistortion and rectification transformation are computed in advance.

As the stereo event-camera system moves, a new stereo observation $(\mathcal {T}^s_\text {left}, \mathcal {T}^s_\text {right})$ is generated when a pose update is available. The generation consists of two steps. The first step is to generate a rectified event map by collecting all events that occurred within 10 ms (from the pose’s updating time to the past), as shown in Fig. 3(a). The second step is to refresh the time-surface maps in both left and right event cameras, as shown in Figs. 3(c) and (d). One of the observations is selected as the RV. The rectified event map of the RV together with the rest of the observations are fed to the inverse depth estimation module (Algorithm 1). We use the rectified event map as a selection map, i.e., we estimate depth values only at the pixels with non-zero values in the rectified event map (as shown in Figs. 6(c) and (d)). As more and more RVs are reconstructed and fused together, the result becomes both more dense and more accurate.

4.2 Results

The evaluation is performed on six sequences, including a synthetic sequence from the simulator [36], three sequences collected by ourselves (hand-held) and two sequences from [34] (with a stereo event camera mounted on a drone). A snapshot of each scene is given in the first column of Fig. 7. In the synthetic sequence, the stereo event-camera system looks orthogonally towards three frontal parallel planes while performing a pure translation. Our three sequences showcase typical office scenes with various office supplies. The stereo event-camera rig is hand-held and performs arbitrary 6-DOF motion, which is recorded by a motion-capture system with sub-millimeter accuracy. The other two sequences are collected in a large indoor environment using a drone [34], with pose information also from a motion-capture system. These two sequences are very challenging for two reasons: (i) a wide variety of structures such as chairs, barrels, a tripod on a cabinet, etc. can be found in this scene, and (ii) the drone undergoes relatively high-speed motions during data collection.

Table 1. Quantitative evaluation on sequences with ground truth depth.

Full size table

Quantitative evaluation on datasets with ground truth depth are given in Table 1, where we compare our method with two state-of-the-art instantaneous stereo matching methods, “Fast Cost-Volume Filtering” (FCVF) [39] and “Semi-Global Matching” (SGM) [40], working on pairs of time-surface images (as in Figs. 3(c) and (d)). We report the mean depth error, the median depth error and the relative error (defined as the mean depth error divided by the depth range of the scene [13]). In fairness to the comparison, the fully dense depth maps returned by FCVF and SGM are masked by the non-zero pixels in the time-surface images. Besides, the boundary of the depth maps are cropped considering the block size used in each implementation. The best results per sequence are highlighted in bold in Table 1. Our method outperforms the other two competitors on all sequences. Although FCVF and SGM also give satisfactory results on the synthetic sequence, they do not work well in more complicated scenarios in which the observations are either not dense enough, or the temporal consistency does not strictly hold in a single stereo observation.

Reconstruction results on all sequences are visualized in Fig. 7. Images on the first column are raw intensity frames from the DAVIS. They convey the appearance of the scenes but are not used by our algorithm. The second column shows rectified and undistorted event maps in the left event-camera of a RV. The number of the events depends on not only on the motion of the stereo rig but also on the amount of visual contrast in the scene. Semi-dense depth maps (after fusion with several neighboring RVs) are given in the third column, pseudo-colored from red (close) to blue (far). The last column visualizes the 3D point cloud of each sequence at a chosen perspective. Note that only points whose variance $\sigma ^{2}_{\rho }$ is smaller than $0.8 \times (\sigma _{\rho }^\mathrm{max})^{2}$ are visualized in 3D.

The reconstruction of the rectified events in one RV is sparse and, typically, full of noise. To show how the fusion strategy improves the density of the reconstruction as well as reduces the uncertainty, we additionally perform an experiment that visualizes the fusion process incrementally. As shown in Fig. 8, the first column visualizes the uncertainty maps before the fusion. The second to the fourth column demonstrates the uncertainty maps after fusing the result of a RV with its neighboring 4, 8 and 16 estimations, respectively. Hot colors refer to high uncertainty while cold colors mean low uncertainty. The result becomes increasingly dense and accurate as more and more RVs are fused. Note that the remaining highly uncertain estimates generally correspond to events that are caused by either noise or low-contrast patterns.

5 Conclusion

This paper has proposed a novel and effective solution to 3D reconstruction using a pair of temporally-synchronized event cameras in stereo configuration. This is, to the best of the authors’ knowledge, the first one to address such a problem allowing stereo SLAM applications with event cameras. The proposed energy minimization method exploits spatio-temporal consistency of the events across cameras to achieve high accuracy (between 1% and 5% relative error), and it outperforms state-of-the-art stereo methods using the same spatio-temporal image representation of the event stream. Future work includes the development of a full stereo visual odometry system, by combining the proposed 3D reconstruction strategy with a stereo-camera pose tracker, in a parallel tracking and mapping fashion [14].

Notes

1.
A point at an intensity edge (i.e., non-homogeneous region of space), so that intensity changes (i.e., events) are generated when the point moves relative to the camera.
2.
Event polarity is not used, as [13] shows that it is not needed for 3D reconstruction.
3.
The DAVIS comprises both a frame camera and an event sensor (DVS) aligned perfectly on the same pixel array. Hence, we calibrate the stereo pair using standard methods on the intensity frames.

References

Lichtsteiner, P., Posch, C., Delbruck, T.: A 128 $\times $ 128 120 dB 15 $\upmu $s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circuits 43(2), 566–576 (2008). https://doi.org/10.1109/JSSC.2007.914337
Article Google Scholar
Mueggler, E., Huber, B., Scaramuzza, D.: Event-based, 6-DOF pose tracking for high-speed maneuvers. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2761–2768 (2014).https://doi.org/10.1109/IROS.2014.6942940
Lagorce, X., Meyer, C., Ieng, S.H., Filliat, D., Benosman, R.: Asynchronous event-based multikernel algorithm for high-speed visual features tracking. IEEE Trans. Neural Netw. Learn. Syst. 26(8), 1710–1720 (2015). https://doi.org/10.1109/TNNLS.2014.2352401
Article MathSciNet Google Scholar
Zhu, A.Z., Atanasov, N., Daniilidis, K.: Event-based feature tracking with probabilistic data association. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 4465–4470 (2017). https://doi.org/10.1109/ICRA.2017.7989517
Gallego, G., Lund, J.E.A., Mueggler, E., Rebecq, H., Delbruck, T., Scaramuzza, D.: Event-based, 6-DOF camera tracking from photometric depth maps. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2402–2412 (2018). https://doi.org/10.1109/TPAMI.2017.2658577
Article Google Scholar
Gallego, G., Scaramuzza, D.: Accurate angular velocity estimation with an event camera. IEEE Robot. Autom. Lett. 2, 632–639 (2017). https://doi.org/10.1109/LRA.2016.2647639
Article Google Scholar
Mueggler, E., Gallego, G., Rebecq, H., Scaramuzza, D.: Continuous-time visual-inertial odometry with event cameras. IEEE Trans. Robot. (2018, to appear). https://doi.org/10.1109/TRO.2018.2858287
Gehrig, D., Rebecq, H., Gallego, G., Scaramuzza, D.: Asynchronous, photometric feature tracking using events and frames. In: European Conference on Computer Vision (ECCV) (2018, to appear). https://doi.org/10.1007/978-3-030-01258-8_46
Conradt, J., Cook, M., Berner, R., Lichtsteiner, P., Douglas, R.J., Delbruck, T.: A pencil balancing robot using a pair of AER dynamic vision sensors. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 781–784 (2009). https://doi.org/10.1109/ISCAS.2009.5117867
Delbruck, T., Lang, M.: Robotic goalie with 3 ms reaction time at 4% CPU load using event-based dynamic vision sensor. Front. Neurosci. 7, 223 (2013). https://doi.org/10.3389/fnins.2013.00223
Article Google Scholar
Rebecq, H., Gallego, G., Scaramuzza, D.: EMVS: event-based multi-view stereo. In: British Machine Vision Conference (BMVC), September 2016. https://doi.org/10.5244/C.30.63
Kim, H., Leutenegger, S., Davison, A.J.: Real-time 3D reconstruction and 6-DOF tracking with an event camera. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 349–364. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_21
Chapter Google Scholar
Rebecq, H., Gallego, G., Mueggler, E., Scaramuzza, D.: EMVS: event-based multi-view stereo–3D reconstruction with an event camera in real-time. Int. J. Comput. Vis. 1–21 (2017). https://doi.org/10.1007/s11263-017-1050-6
Rebecq, H., Horstschäfer, T., Gallego, G., Scaramuzza, D.: EVO: a geometric approach to event-based 6-DOF parallel tracking and mapping in real-time. IEEE Robot. Autom. Lett. 2, 593–600 (2017). https://doi.org/10.1109/LRA.2016.2645143
Article Google Scholar
Rebecq, H., Horstschaefer, T., Scaramuzza, D.: Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization. In: British Machine Vision Conference (BMVC), September 2017
Google Scholar
Rosinol Vidal, A., Rebecq, H., Horstschaefer, T., Scaramuzza, D.: Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high speed scenarios. IEEE Robot. Autom. Lett. 3(2), 994–1001 (2018). https://doi.org/10.1109/LRA.2018.2793357
Article Google Scholar
Censi, A., Scaramuzza, D.: Low-latency event-based visual odometry. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 703–710 (2014). https://doi.org/10.1109/ICRA.2014.6906931
Kueng, B., Mueggler, E., Gallego, G., Scaramuzza, D.: Low-latency visual odometry using event-based feature tracks. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, pp. 16–23, October 2016. https://doi.org/10.1109/IROS.2016.7758089
Weikersdorfer, D., Adrian, D.B., Cremers, D., Conradt, J.: Event-based 3D SLAM with a depth-augmented dynamic vision sensor. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 359–364, June 2014. https://doi.org/10.1109/ICRA.2014.6906882
Kogler, J., Humenberger, M., Sulzbachner, C.: Event-based stereo matching approaches for frameless address event stereo data. In: Bebis, G. (ed.) ISVC 2011. LNCS, vol. 6938, pp. 674–685. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24028-7_62
Chapter Google Scholar
Rogister, P., Benosman, R., Ieng, S.H., Lichtsteiner, P., Delbruck, T.: Asynchronous event-based binocular stereo matching. IEEE Trans. Neural Netw. Learn. Syst. 23(2), 347–353 (2012). https://doi.org/10.1109/TNNLS.2011.2180025
Article Google Scholar
Camunas-Mesa, L.A., Serrano-Gotarredona, T., Ieng, S.H., Benosman, R.B., Linares-Barranco, B.: On the use of orientation filters for 3D reconstruction in event-driven stereo vision. Front. Neurosci. 8, 48 (2014). https://doi.org/10.3389/fnins.2014.00048
Article Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, New York (2003). https://doi.org/10.1017/CBO9780511811685
Book MATH Google Scholar
Piatkowska, E., Belbachir, A.N., Gelautz, M.: Asynchronous stereo vision for event-driven dynamic stereo sensor using an adaptive cooperative approach. In: International Conference on Computer Vision Workshops (ICCVW), pp. 45–50 (2013). https://doi.org/10.1109/ICCVW.2013.13
Osswald, M., Ieng, S.H., Benosman, R., Indiveri, G.: A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems. Sci. Rep. 7, 1–12 (2017). https://doi.org/10.1038/srep40703. 40703
Article Google Scholar
Marr, D., Poggio, T.: Cooperative computation of stereo disparity. Science 194(4262), 283–287 (1976). https://doi.org/10.1126/science.968482
Article Google Scholar
Schraml, S., Belbachir, A.N., Bischof, H.: An event-driven stereo system for real-time 3-D 360 panoramic vision. IEEE Trans. Ind. Electron. 63(1), 418–428 (2016). https://doi.org/10.1109/tie.2015.2477265
Article Google Scholar
Schraml, S., Belbachir, A.N., Bischof, H.: Event-driven stereo matching for real-time 3D panoramic vision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 466–474, June 2015. https://doi.org/10.1109/CVPR.2015.7298644
Gallego, G., Rebecq, H., Scaramuzza, D.: A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3867–3876 (2018). https://doi.org/10.1109/CVPR.2018.00407
Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM: dense tracking and mapping in real-time. In: International Conference on Computer Vision (ICCV), pp. 2320–2327, November 2011. https://doi.org/10.1109/ICCV.2011.6126513
Benosman, R., Ieng, S.H., Rogister, P., Posch, C.: Asynchronous event-based Hebbian epipolar geometry. IEEE Trans. Neural Netw. 22(11), 1723–1734 (2011). https://doi.org/10.1109/TNN.2011.2167239
Article Google Scholar
Lagorce, X., Orchard, G., Gallupi, F., Shi, B.E., Benosman, R.: HOTS: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. PP(99), 1 (2016). https://doi.org/10.1109/TPAMI.2016.2574707
Article Google Scholar
Liu, S.C., Delbruck, T.: Neuromorphic sensory systems. Curr. Opin. Neurobiol. 20(3), 288–295 (2010). https://doi.org/10.1016/j.conb.2010.03.007
Article Google Scholar
Zhu, A.Z., Thakur, D., Ozaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multivehicle stereo event camera dataset: an event camera dataset for 3D perception. IEEE Robot. Autom. Lett. 3(3), 2032–2039 (2018). https://doi.org/10.1109/lra.2018.2800793
Article Google Scholar
Mur-Artal, R., Tardos, J.D.: Probabilistic semi-dense mapping from highly accurate feature-based monocular SLAM. In: Robotics: Science and Systems (RSS) (2015). https://doi.org/10.15607/RSS.2015.XI.041
Mueggler, E., Rebecq, H., Gallego, G., Delbruck, T., Scaramuzza, D.: The event-camera dataset and simulator: event-based data for pose estimation, visual odometry, and SLAM. Int. J. Robot. Res. 36, 142–149 (2017). https://doi.org/10.1177/0278364917691115
Article Google Scholar
Brandli, C., Berner, R., Yang, M., Liu, S.C., Delbruck, T.: A 240 $\times $ 180 130 dB 3 us latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circuits 49(10), 2333–2341 (2014). https://doi.org/10.1109/JSSC.2014.2342715
Article Google Scholar
Furgale, P., Rehder, J., Siegwart, R.: Unified temporal and spatial calibration for multi-sensor systems. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2013). https://doi.org/10.1109/IROS.2013.6696514
Hosni, A., Rhemann, C., Bleyer, M., Rother, C., Gelautz, M.: Fast cost-volume filtering for visual correspondence and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 35(2), 504–511 (2013). https://doi.org/10.1109/tpami.2012.156
Article Google Scholar
Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2008). https://doi.org/10.1109/tpami.2007.1166
Article Google Scholar

Download references

Acknowledgment

The research leading to these results is supported by the Australian Centre for Robotic Vision and the National Center of Competence in Research (NCCR) Robotics, through the Swiss National Science Foundation, the SNSF-ERC Starting Grant and the NCCR Ph.D. Exchange Scholarship Programme. Yi Zhou also acknowledges the financial support from the China Scholarship Council for his Ph.D. Scholarship No. 201406020098.

Author information

Authors and Affiliations

Australian National University, Canberra, Australia
Yi Zhou & Hongdong Li
Australian Centre for Robotic Vision, Brisbane, Australia
Yi Zhou & Hongdong Li
Departments of Informatics and Neuroinformatics, University of Zurich and ETH Zurich, Zurich, Switzerland
Guillermo Gallego, Henri Rebecq & Davide Scaramuzza
School of Information Science and Technology, ShanghaiTech University, Shanghai, China
Laurent Kneip

Authors

Yi Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Henri Rebecq
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Kneip
View author publications
You can also search for this author in PubMed Google Scholar
Hongdong Li
View author publications
You can also search for this author in PubMed Google Scholar
Davide Scaramuzza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Zhou .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 84209 KB)

Supplementary material 2 (pdf 188 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, Y., Gallego, G., Rebecq, H., Kneip, L., Li, H., Scaramuzza, D. (2018). Semi-dense 3D Reconstruction with a Stereo Event Camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11205. Springer, Cham. https://doi.org/10.1007/978-3-030-01246-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-01246-5_15
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01245-8
Online ISBN: 978-3-030-01246-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics