Keywords

1 Introduction

The notion of a light field, which describes all light-rays traveling in 3-D free space [1,2,3], has been utilized in various applications, such as view synthesis [4,5,6], depth estimation [7,8,9], synthetic refocusing [10, 11], super resolution [7, 12], 3D displays [13,14,15,16], and object recognition [17, 18]. Several researchers have published light field datasets [19,20,21,22] for accelerating research in this field. A light field dataset is usually represented as a set of multi-view images that are aligned densely with tiny viewpoint intervals.

Acquiring a light field is a challenging task due to the amount of data because a light field typically consists of dozens of images. Several researchers have employed direct approaches such as using a moving camera gantry [2] or multiple cameras [23,24,25] to capture a target from different viewpoints. These approaches are costly in terms of the hardware or the time required to capture the entire light field. Meanwhile, lens-array based cameras [11, 26,27,28] and coded aperture/mask cameras [29,30,31,32,33,34,35] have also been utilized to achieve more efficient acquisition. In the case of lens-array based cameras, an entire light field can be captured with a single acquired image, but the spatial resolution of each image is in a trade-off relationship with the number of viewpoints. In contrast, coded aperture/mask cameras can capture a light field with the same spatial resolution as that of an image sensor, but the quality of the light field is in a trade-off relationship with the number of acquired imagesFootnote 1. Other researchers used view synthesis techniques [6, 7, 36, 37] to generate a dense light field from a few images that were acquired with sparser intervals or even from one image. However, view synthesis involves added difficulty in terms of accurately estimating depth/disparity from the given images and reproducing view-dependent phenomena such as non-Lambertian reflections and occlusions.

This paper is focused on the problem of efficiently acquiring a light field by using a coded aperture camera. Using this camera, a light field is computationally reconstructed from several images that are acquired with different aperture patterns. Earlier methods [30,31,32] could not drastically reduce the number of images required to reconstruct a light field. However, with recent methods [33, 35], the number of acquired images has been reduced to only a few for a light field consisting of 5 \(\times \) 5 viewpoints. We want to further improve the trade-off relationship between the number of acquired images and the quality of the reconstructed light field. The key to achieving this goal is to find good aperture patterns and a corresponding good reconstruction algorithm.

It is important to understand the principles behind reducing the number of acquired images. In previous works, this problem has been tackled from the perspective of compressive sensing [38,39,40]. Compressive sensing is a framework for reconstructing a signal from a reduced number of samples, where inherent structures in a target signal are exploited to seek the optimal sampling and reconstruction strategies. Along this theme, light-field acquisition has been formalized from the perspective of sparse representation using a learned dictionary [33] and approximation using the most significant basis vectors [35]. These methods can successfully reconstruct a light field from only a few acquired images. However, they often fail to reconstruct the details of the light field due to the limited representation capability of the dictionary and basis. Moreover, sparse reconstruction requires significant computational time due to the inherent complexity of the iterative algorithms that are used.

In contrast, we view the problem of acquiring a light field from the perspective of an auto-encoder [41, 42]. An auto-encoder employs a simple structure, in which an encoder network is connected to a decoder network, and it is trained to best approximate the identity mapping between the input and output. In our method, an encoder and decoder correspond to the image acquisition and reconstruction processes, respectively, because the original light field is once reduced (encoded) to only a few images through the physical imaging process of the coded aperture camera, and these images are then combined to reconstruct (decode) a light field with the same size as the original one. We implemented this auto-encoder as a fully convolutional neural network (CNN) and trained it end-to-end by using a collection of training samples. The parameters of the trained network correspond to the aperture patterns and reconstruction algorithm that are jointly optimized over the training dataset. In short, our method can learn to capture and reconstruct a light field through a coded aperture camera by utilizing the powerful framework of a deep neural network (DNN).

We implemented our method by using Chainer [43] and trained it by using samples taken from 51 light-field datasets. We experimentally show that our method can successfully learn good image-acquisition and reconstruction strategies. By using our method, light fields consisting of 5 \(\times \) 5 or 8 \(\times \) 8 images can be reconstructed with high quality from only a few acquired images. Moreover, our method achieved superior performance over several state-of-the-art methods [6, 33, 35, 36] including both compressive-sensing based and view-synthesis based approaches. We also applied our method to a real prototype camera to show that it is capable of capturing a real 3-D scene.

We briefly review the related works on the usage of DNNs for similar applications. DNNs have been utilized to optimize the imaging pipelines including camera designs for color image acquisition [44], depth estimation [45], and high-speed video acquisition [46]. A learning based approach was also used for temporal interpolation of light field video [47]. As the most similar work to ours, compressively sampled light fields were successfully reconstructed using DNNs in [48]. However, in contrast to our method, the compression process (camera design) in [48] was not optimized in the training stage but determined beforehand. To the best of our knowledge, we are the first to use a DNN for designing the entire pipeline of a computational camera for light-field acquisition. Moreover, the difference between the network architectures resulted in significant difference in the reconstruction time: 6.7 min for [48] and only 1 s for ours.

2 Proposed Method

2.1 Light Field Capturing Through Coded Aperture Camera

Fig. 1.
figure 1

Schematic diagram and implementation of coded aperture camera

A schematic diagram of a coded aperture camera is illustrated in Fig. 1. An architecture that is equivalent to this diagram can be implemented by using relay optics and an LCoS (liquid crystal on silicon) device [31, 33, 49]. Here, we present a mathematical formulation for acquiring a light field through a coded aperture camera.

All incoming light rays that will be recorded by this camera are parameterized with four variables (stuv), where (st) and (uv) denote the intersections with the aperture and imaging planes, respectively. Therefore, the light field is defined over 4-D space (stuv), with which the light intensity is described as l(stuv). The light field is equivalently described as a set of rectified multi-view images called “sub-aperture images”, \(\{ x_{s,t}(u,v) = l(s,t,u,v) \}\), where (st) corresponds to the viewpoint defined on the aperture plane.

We consider a coded aperture design where the transmittance of the aperture can be controlled for each position and for each acquisition. Let \(a_n(s,t)\) be the transmittance at position (st) for the n-th acquisition (\(n=1,\dots , N\)). The observed image \(y_n(u,v)\) is formed as

$$\begin{aligned} y_n (u,v) = \iint a_n (s,t) l(s,t,u,v) dsdt = \iint a_n (s,t) x_{s,t}(u,v) dsdt. \end{aligned}$$
(1)

When the aperture plane is quantized into a finite number of square blocks, they can be numbered by an integer m (\(m = 1,\dots , M\)). Equation (1) is rewritten as

$$\begin{aligned} y_n (u,v) = \sum _{m=1}^{M} a_{n,m} x_m(u,v), \end{aligned}$$
(2)

where M is the total number of blocks, and \(a_{n,m}\) [equivalent to \(a_n(s,t)\)] is the transmittance of the aperture at position m for the n-th acquisition.

Reconstructing the light field is equivalent to estimating M sub-aperture images \(x_m(u,v)\) from the N given observations \(y_n (u,v)\). In particular, we are interested in the case of \(N \ll M\), where the entire light field could be reconstructed from only a few observed images. To this end, we need to find good transmittance patterns \(a_{n, m}\) and a corresponding reconstruction method.

In the remainder of this paper, we assume that each image has only a single channel for simplicity, and this applies to both \(x_m(u,v)\) and \(y_n (u,v)\). When dealing with a color image, we simply divide it into three independent single-channel images and apply the same procedure to them.

2.2 Formulation as Convolutional Neural Network

The observation (image acquisition) process of a coded aperture camera, which is given by Eq. (2), can be written in a form of a mapping as

$$\begin{aligned} f: \mathcal{X} \rightarrow \mathcal{Y} \end{aligned}$$
(3)

where \(\mathcal{X}\) represents a vector that contains all the pixels of \(x_m(u,v)\) for all \(m \in \{1, ..., M\}\). When each image has \(U \times V\) pixels, the length of the vector is UVM. Similarly, \(\mathcal{Y}\) represents a vector that contains all the pixels of \(y_n(u,v)\) for all \(n \in \{1, ..., N\}\). The reconstruction is also written as

$$\begin{aligned} g: \mathcal{Y} \rightarrow {\hat{\mathcal{X}}} \end{aligned}$$
(4)

where \( \hat{\mathcal{X}}\) corresponds to an estimate of \(\mathcal{X}\). The composite mapping \(h=g \circ f\) should be as close to the identity as possible, under the condition that \(N \ll M\). This problem can be regarded as that of an auto-encoder, where the encoder and decoder correspond to f and g, respectively. The goal of optimization is formulated with the squared error loss as

$$\begin{aligned} \arg \min _{f,g} |\mathcal{X} - \hat{\mathcal{X}} |^2 = \arg \min _{f,g} \sum _m \sum _{u,v} | x_m(u,v) - \hat{x}_m(u,v) |^2 \end{aligned}$$
(5)

where an estimate of \({x}_m(u,v)\) is written as \(\hat{x}_m(u,v)\).

On top of this formulation, we implemented composite mapping \(h=g \circ f\) as a stack of 2-D convolutional layers. The entire network can be trained end-to-end by using a collection of training samples. The learned parameters in f and g correspond to the aperture patterns \(a_{n, m}\) and the reconstruction algorithm, respectively, both of which are jointly optimized over the training dataset. In short, our method can learn to capture a light field through a coded aperture camera by utilizing the powerful framework of a DNN. This network can easily be applied to a physical coded-aperture camera. The mapping f is conducted by the physical imaging process of the camera, the aperture patterns of which are configured in accordance to the learned parameters in f. Then, the images acquired by the cameras are inputted to the network corresponding to g, by which we can computationally reconstruct a target light field.

Fig. 2.
figure 2

Modeling light-field acquisition using CNN

An example with \(M=25\) and \(N=2\) is illustrated in Fig. 2. The architecture of our network is summarized below. Note that this is only an example we have found through trial and error, and one may find other architectures that are essentially equivalent in spirit but can perform better than ours.

In a 2-D convolutional network, data are represented as 2-D images with multiple channels. For the l-th convolution layer, the relationship between the input \(x^{l}_{c}(u,v)\) and output \(x^{l+1}_{c^{\prime }}(u,v)\), where c and \(c^{\prime }\) denote channels, is represented as

$$\begin{aligned} x^{l+1}_{c^{\prime }}(u,v) = \sum _c k_{c, c^{\prime }}(u,v) *x^{l}_{c}(u,v) + b_{c^{\prime }} \end{aligned}$$
(6)

which is optionally followed by an activation function. The 2-D convolution kernels represented by \(k_{c, c^{\prime }}(u,v)\) and the bias terms \(b_{c^{\prime }}\) are optimized through the training stage. We used one-pixel stride and an appropriate zero padding to keep the image size unchanged before and after the convolution. Meanwhile, the number of channels (the range for c and \(c^{\prime }\)) can arbitrarily be changed between the input and output. Therefore, the image size (height and width) is kept constant throughout the network, but only the number of channels is changed as the data proceed in the network. To apply this data structure to our purpose, we treated the indices n and m as the channels. Therefore, the viewpoint m corresponds to the channels in the input and output data, and the index for the acquired images n corresponds to the channel of the intermediate data between f and g.

The mapping f has a few restrictions because it corresponds to the physical imaging process described by Eq. (2); the transmittance \(a_{n,m}\) should be limited within [0, 1], each pixel (uv) cannot be mixed with other neighbors, and the number of channels should be reduced from M to N. We implemented this using a single 2-D convolution layer with \(1 \times 1\) convolution kernels, where the weights are limited within the range [0, 1], and \(b_{c^{\prime }}=0\) for all \(c^{\prime }\). In this case, \(k_{c, c^{\prime }}\) is equivalent to \(a_{c^{\prime }, c}\), and Eq. (6) corresponds to Eq. (2). To keep the range limitation during the training stage, we clipped the weights within [0, 1] each time when the network was updated by using the gradient information obtained from a mini-batch. To better simulate the physical imaging process, Gaussian noise with standard deviation \(\sigma \) was added to the output of this network. This network is called the “image acquisition network” and denoted as \(\mathcal{N}_A\).

Meanwhile, the mapping g can take an arbitrary form because it is a truly computational process. We decomposed this mapping into two networks, which are denoted as \(\mathcal{N}_{R1}\) and \(\mathcal{N}_{R2}\), respectively, where R stands for reconstruction. In the former network \(\mathcal{N}_{R1}\), the number of channels is gradually increased from M to N by using several convolution layers with \(5 \times 5\) kernels and linear activations, resulting in a tentative reconstruction of the target light field. This tentative result is further refined by the latter network \(\mathcal{N}_{R2}\) that employs the structure of very deep super resolution (VDSR) [50]. Here, 19 convolution layers with \(3 \times 3\) kernels and ReLU activations are followed by the addition of the tentative light field itself, which forces the network to learn only the residual (difference between the tentative result and ground truth). The former and latter networks are also called the “tentative reconstruction and refinement networks,” respectively.

2.3 Implementation and Training Details

We considered several network configurations for different values of M and N, as summarized in Table 1. The number of viewpoints in the target light field (M) was set to 25 (5 \(\times \) 5) or 64 (8 \(\times \) 8). The number of acquired images (N) was set to 1, 2, or 4. We employed different network architectures for \(\mathcal{N}_{R1}\) for different numbers of N. The numbers connected by arrows in the table indicate the transition in the number of channels. The design principle here is to increase the number of channels gradually from N to M. Meanwhile, \(\mathcal{N}_{R2}\) was only subject to M, and the number of channels for the intermediate layers was set to 64 for both \(M=25\) and \(M=64\). All of these architectures were found through trial and error, and there might be room for improvement.

Table 1. Network architecture for reconstruction
Table 2. Datasets used for training and testing

In the training stage, each training sample is a set of 2-D image blocks with 64 \(\times \) 64 pixels that are extracted from the same positions in the multi-view images. As mentioned earlier, a viewpoint is considered to be a channel, and thus, each sample is represented as an M channel image that has 64 \(\times \) 64 pixels. The training samples were collected from several light field datasets provided at [19,20,21,22], the details of which are summarized in Table 2. Three color channels of each dataset were used as three individual datasets. Moreover, we extracted many training samples from each dataset by changing the position of the 2-D blocks; we took blocks every 32 pixels both in the horizontal and vertical directions. We also augmented the data by changing the intensity levels of each sample uniformly; we multiplied 1.0, 0.9, 0.8, 0.7, 0.6 and 0.5 with the original samples. Samples that were too smooth, i.e., those in which the intensity was almost uniform across 64 \(\times \) 64 pixels, were removed from the training samples. Finally, we collected 295,200 and 166,680 samples for \(M=25\) and \(M=64\), respectively. Similar to the case with VDSR [50], our network can be trained with a small training dataset because it consists of only convolutional layers with small kernels, and thus, the number of parameters does not increase too much. In the testing stage, the entire light field can be handled at once because fully convolutional networks can accept images with arbitrary size.

Our method was implemented by using Chainer [43] version 3.2.0, a Python based framework for neural networks. The batch size for training was set to 15. We used a built-in Adam optimizer [51]. The number of epochs was fixed to 20 for all experiments. The training with \(M=25\) and \(N=2\) took approximately 4 h on a PC equipped with an NVIDIA Geforce GTX 1080 Ti. In the testing stage, it took about 0.5 s to reconstruct the entire light field with 840 \(\times \) 593 pixels and 25 viewpoints. Our software will be made public soon [52].

3 Experiments

3.1 Network Design and Performance

One of the key factors of our network is the noise that is added to the acquired images. Ideally, the noise level assumed in the training stage should be close to that in real situations. However, it is impractical to train different networks for all possible noise levels. It is more likely that a network trained with a certain noise level will be applied to different noise levels. To see the effect of the discrepancy in noise levels, we used different noise levels in the training and testing stages. Here, the noise level \(\sigma \) was defined with respect to the image intensity range [0, 1] for acquired images \(y_n(u,v)\). The results obtained with the Dragon and Bunnies dataset and \(N=2\) are summarized in Fig. 3 (left)Footnote 2. When noise in the testing stage was larger than that in the training stage, the reconstruction quality was largely degraded. Meanwhile, in the opposite case, the degradation was moderate. Therefore, we concluded that assuming a high noise level in the training stage is a safe strategy. We use \(\sigma \) = 0.005 in the training stage in the remainder of this paper.

Fig. 3.
figure 3

Performance analysis of our method with different noise levels (left) and different network architectures and training schemes (right)

Fig. 4.
figure 4

Obtained aperture patterns (left) and PSNR (right)

Fig. 5.
figure 5

Central views reconstructed by our method. Differences from ground truth (magnified by 5) are also presented

We next compared different network architectures and training schemes. We tested three cases: (a) \(\mathcal{N}_{A} + \mathcal{N}_{R1}\), (b) \(\mathcal{N}_{A} + \mathcal{N}_{R1} + \mathcal{N}_{R2}\), where \(\mathcal{N}_{A} + \mathcal{N}_{R1} \) and \(\mathcal{N}_{R2}\) were trained individually and then concatenated and fine-tuned, and (c) \(\mathcal{N}_{A} + \mathcal{N}_{R1} + \mathcal{N}_{R2}\), where the entire network was trained end-to-end from scratch. As shown in Fig. 3 (right), the importance of the refinement network \(\mathcal{N}_{R2}\) is clear. Meanwhile, the difference between (b) and (c) was negligible despite the fact that (b) required additional complexity for the training stage. Therefore, we adopted training strategy (c) for the remainder of this paper.

We then evaluated the performance of our method over several datasets. The number of viewpoints M was set to 25 (5 \(\times \) 5) and 64 (8 \(\times \) 8), and the number of acquired images N was set to 1, 2, and 4. The noise level for testing was set to 0.005. The obtained aperture patterns and quantitative scores in PSNR are summarized in Fig. 4. Several reconstructed images are presented in Fig. 5. From these results, we can see that the reconstruction quality improved as the number of acquired images increased. With N = 1, our method could not correctly reproduce the disparities among the viewpoints. However, by changing N from 1 to 2, the reconstruction quality improved significantly; fine details (see the close-ups) were successfully produced, and the correct amount of disparities was observed among the viewpoints. Meanwhile, the quality improvement from \(N=2\) to \(N=4\) was moderate. We observed the same tendency both with \(M=25\) and 64, though \(M=64\) was more challenging than 25 due to the increased number of viewpoints and less number of training samples.

Fig. 6.
figure 6

Analysis on aperture patterns with \(M=25\) and \(N=2\)

3.2 Analysis on Aperture Patterns

To analyze the effect of aperture patterns, we conducted another experiment. We used several aperture patterns with \(M=25\) and \(N=2\), as summarized in Fig. 6(a). For (i), we used a set of aperture patterns that was trained normally, which is denoted as “ours (normal)” or simply“ours”. For (ii)–(vii), we fixed the aperture patterns during the training stage; in this case, only the reconstruction network was optimized for the training dataset. We generated a set of random aperture patterns for (ii) and changed its brightness constantly for (iii)–(v). The patterns for (vi) corresponds to light field reconstruction from a pair of stereo images. The patterns in (vii) took randomized binary values, which serves as a simulation for binary apertures. Finally, the patterns in (viii) were trained over the dataset under the constraint that the second pattern was the first one rotated 90\(^\circ \). This was easily implemented by using the same convolution kernels for the first and second image acquisition but tweaking the order of channels in the input data. This pattern set may have a practical value because it can be implemented by using a static aperture pattern (such as a printed one, instead of an expensive electronic device such as an LCoS) with a simple rotation mechanism. This is denoted as “ours (rotated).” We set the noise level for testing to 0.005.

Fig. 7.
figure 7

Comparison with compressive-sensing based methods [33, 35]. (Top) quantitative reconstruction quality. (Bottom) reconstructed images and difference from the ground truth (magnified by 5)

3.3 Comparison with State-of-the-Art Methods

Several interesting indications are found from the results shown in Fig. 6(b), (c). First, a good reconstruction quality was attained even with non-optimized aperture patterns such as (ii) and (vii), probably because the reconstruction network is capable of adapting to different aperture patterns. Second, the brightness of the apertures (the total amount of light) is important for the quality. The quality degradation observed in (iii)–(vi) can be attributed to the bad signal-to-noise ratio of the acquired images. Finally, the rotated patterns in (viii) achieved quality close to that of normally trained patterns (i), suggesting the possibility of developing a camera with a rotatable static aperture in the future.

We compared our method against four state-of-the-art methods [6, 33, 35, 36], for which we used the software provided by the respective authors. To make the comparison fair, the configurations for the methods were kept unchanged from the original implementations as much as possible. To see the results more closely, please refer to the supplementary video.

We first present a comparison with two compressive-sensing based methods: Marwah’s method [33] using a sparse dictionary and Yagi’s method [35] using the most significant basis vectors selected by PCA or NMF. The number of viewpoints M was set to 25 (5 \(\times \) 5) in accordance with the original implementations. The noise level was set to \(\sigma = 0\) and \(\sigma =0.005\). The results are summarized in Fig. 7. As shown in the graphs of the figure, our method performed clearly better than Yagi’s method. Marwah’s method achieved slightly better scores than ours with \(N=1\), but with \(N=2\) and \(N=4\), it was inferior to our method with significant margins. Moreover, the main drawback of Marwah’s method is the computational complexity; it took 11–90 h to reconstruct a single light field on our desktop computer. Shown in the bottom are reconstructed central views and the differences from the ground truth obtained with Dragon and Bunnies, \(N=2\), and \(\sigma \) = 0.005. We can see that a better visual quality was achieved by our method over the others.

Fig. 8.
figure 8

Comparison with view-synthesis based methods [6, 36]. (Top) quantitative reconstruction quality. (Bottom) reconstructed images and difference from ground truth (magnified by 5) obtained with Dino and Kitchen

We next show another comparison with two view-synthesis based methods. Kalantari’s method [6] can reconstruct a light field with 8 \(\times \) 8 viewpoints given the four corner images. It was reported that this method performed better than several state-of-the-art view synthesis methods. Meanwhile, Srinivasan’s method [36] can reconstruct a light field with 8 \(\times \) 8 viewpoints only from one image corresponding to the central viewpoint (actually, it is located at the bottom-right nearest neighbor to the center). No noise was applied for these two methods. As for our method, the number of viewpoints M was set to 64 (8 \(\times \) 8) as well, but noise with \(\sigma =0.005\) was applied. The results are summarized in Fig. 8. Srinivasan’s method produced natural-looking images with around 30–60 s computation time, but it failed to reproduce correct disparities among the viewpoints. Kalantari’s method achieved better quality than Srinivasan’s method, but it required four images as the input and took about 30 min without GPU boost. Our method achieved overall better reconstruction quality than the other two with much less computation time (about 1 s).

3.4 Experiment Using Real Camera

Finally, we conducted an experiment using a real coded-aperture camera. We adopted the same hardware design as the ones reported in [31, 33, 49]. The resolution of the camera (FLIR GRAS-14S5C-C) was 1384 \(\times \) 1036 pixels, which corresponds to the spatial resolution of the light field acquired with it. The exposure time was set to 40 ms. We used a Nikon Rayfact lens (25 mm F1.4 SF2514MC). The aperture was implemented as an LCoS display (Forth Dimension Displays, SXGA-3DM) with 1280 \(\times \) 1024 pixels. We divided the central area of the LCoS display into \(5 \times 5\) regions, each with 150 \(\times \) 150 pixels. Accordingly, the angular resolution of the light field was 5 \(\times \) 5. We found that light rays could not completely be shut off by using the black aperture (all of the 25 regions were set to 0), which means that an image acquired with the black aperture was not completely black. Therefore, we subtracted the image acquired with the black aperture from each of the acquired images before using them for reconstruction.

The experimental setup and reconstructed light fields are presented in Fig. 9. We used four sets of aperture patterns: ours (normal), ours (rotated), NMF [35], and sequential pinholes.Footnote 3 For the former three methods, the number of acquired images was set to 2. The reconstruction was performed with the trained reconstruction networks for the first and second ones and the method presented in [35] for the third one. The last one, sequential pinholes, corresponds to the direct acquisition of 25 images, where only 1 region out of 25 on the aperture was opened in sequence. As shown in the bottom of Fig. 9, our method achieved a striking reconstruction quality with only two acquired images. The results with our methods were even better than those obtained with 25 sequentially acquired images because pinhole patterns suffer from noise due to a lack of light. We also observed natural disparities among the viewpoints in the reconstructed results. Please refer to the supplementary video for more detail.

Fig. 9.
figure 9

Experiment using real coded-aperture camera

4 Conclusion

We proposed a learning-based framework for acquiring a light field through a coded aperture camera to reduce the required number of acquired images to only a few. This framework was formulated from the perspective of an auto-encoder. This auto-encoder was implemented as a stack of fully convolutional layers and was trained end-to-end by using a collection of training samples. We experimentally showed that our method can successfully learn good image-acquisition and reconstruction strategies. Using our method, light fields consisting of 5 \(\times \) 5 or 8 \(\times \) 8 images can be reconstructed with high quality from only a few acquired images. Moreover, our method achieved superior performance over several state-of-the-art methods. We also applied our method to a real prototype camera to show that it is capable of capturing a real 3-D scene.

Our future work includes several directions. The performance of our method could be improved by tweaking the architecture of the reconstruction network or increasing the number of training samples. The squared loss used in our method could be replaced with or supplemented by other loss functions that are closer to the perceptual image quality such as VGG loss and the framework of a generative adversarial network (GAN) [53]. The development of a camera with a rotatable static aperture pattern would also be an interesting direction.