Keywords

1 Introduction

The automatic image colorization task is an image processing problem that is fundamental and extensively studied in the field of computer vision. The task consists in creating an algorithm that takes as input a gray-scale image and outputs a colorized version of the same image. The challenging part is to colorize it in a plausible and well-looking way. Many systems were developed over the years, exploiting a wide variety of image processing techniques, but recently, the image colorization problem, as many other problems in computer vision, was approached with deep-learning methods. Colorization is a generative problem from a machine learning perspective. Generative techniques, such as Generative Adversarial Networks (GANs) [7], are then suitable to approach such a task. In particular, conditional GANs (cGANs) models seem especially appropriate to this purpose, since their structure allows the network to learn a mapping from an image x and (only if needed) a random noise vector z to an output generated image y. On the contrary, standard GANs only learn the mapping from the noise z to y.

Fig. 1.
figure 1

Example images generated using MetalGAN for 100-epochs, and 100-meta-iterations. From left to right: gray scale image, ground truth, output of the network. The example images belong to two different clusters.

As many deep-learning techniques, the training of a GAN or a cGAN needs a large amount of images. Large datasets usually grant a great diversity among images, allowing the network to better generalize its results. Nevertheless, having a huge number of images is often not feasible in real-world applications, or simply it requires too much storage space for an average system, and high training computational times. Hence, porting the current deep-learning colorization technologies to a more accessible level and achieving a better understanding of the colorization training process are eased by using a smaller dataset.

For these reasons, one of the aims of this work is to achieve good performances in the colorization task using a little number of images compared to standard datasets. In few-shot learning, a branch of the deep-learning field, the goal is to learn from a small number of inputs, or from one single input in the ideal case (one-shot learning): the network is subject to a low quantity of examples, and it has to be capable to infer something when posed face-to-face to a new example. This problem underpins a high generalization capability of the network, which is a very difficult task and an open challenging problem in deep networks research.

Recently, some novel interesting ideas highlight a possible path to reach a better generalization ability of the network. These ideas are based on the concept of learning to learn, i.e., adding a meta-layer of learning information above the usual learning process of the network. The generalization is achieved by introducing the concept of tasks distribution instead of a single task, and the concept of episodes instead of instances. A tasks’ distribution is the family of those different tasks on which the model has to be adapted to. Each task in the distribution has its own training and test sets, and its own loss function. A meta-training set is composed of training and test images samples, called episodes, belonging to different tasks. During training, these episodes are used to update the initial parameters (weights and bias) of the network, in the direction of the sampled task. Results of meta-learning methods investigated in literature are encouraging and obtain good performances on some few-shot datasets. For this reason and since the goal of this work is to colorize images with a few number of examples, a meta-learning algorithm to tune the network parameters on many different tasks was employed. The chosen algorithm is Reptile [15], and it was combined with an adversarial colorization network composed by a Generator G and a Discriminator D. In other words, the proposed method approaches the colorization problem as a meta-learning one. Intuitively, Reptile works by randomly selecting tasks, then it trains a fast network on each task, and finally it updates the weights of a slow network.

In this proposal, tasks are defined as clusters of the initial dataset. In fact, a typical initial dataset is an unlabeled dataset that contains a wide variety of images, usually photographs. In this setting, for example, a task could be to color all seaside landscape, and another could be to color all cats photos. Those tasks refer to the same problem and use the same dataset, but they are very different at a practical level. A very large amount of images could overwhelm the problem, showing as much seasides and cats as the network needs in order to differentiate between them. The troubles start when only a small dataset is available. As a matter of fact, such a dataset could not have the suitable number of images for making the network learning how to perform both the two example colorizations decently. The idea is to treat different classes of images as different tasks. For dividing tasks, features were extracted from the dataset using a standard approach—e.g., a Convolutional Neural Network (CNN)—and the images were clusterized through K-means. Each cluster is thus considered as a single task. During training, Reptile tunes the network G on the specific task corresponding to an input query image and therefore it adapts the network to a specific colorization class.

The problems and main questions that emerge in approaching a few-shot colorization are various. First of all, how the clusterization should be made in order to generate a coherent and meaningful distribution of tasks? Does a task specialization really improve the colorization or the act of automatically coloring a photo is independent from the subject of the photo itself? Second, how the meta-learning algorithm should be combined with cGAN training, also to prevent overfitting the generator on few images? And last, since the purpose of the work is not to propose a solution to the colorization problem in general, but to propose a method that substantially reduce the amount of images involved in training without—or with minor—losses in state-of-the-art results, how to evaluate the actual performance of the network compared to other approaches? In particular, what are the factors that should be taken in account to state an enhancement, not in the proper colorization, but in few-shot colorization? In the light of these considerations, the contributions of this work are summarized as follows:

  • A new architecture that combines meta-learning techniques and cGAN called MetalGAN is proposed, specifying in detail how the generator and the discriminator parameters are updated;

  • A clusterization and a novel algorithm are described and their ability to tackle image-to-image translation problems is highlighted;

  • An empirical demonstration that a very good colorization can be achieved even with a small dataset at disposal during training is provided by showing visual results;

  • A precise comparison between two modalities (i.e. our algorithm and only cGAN training) is performed at experimental time, using the same network model and hyper-parameters.

2 Related Work

Image Retrieval: Since we need the clusterization to be as accurate as possible we reserved a particular attention to the recent image retrieval techniques that focus on obtaining optimal descriptors. Recently, deep learning allowed to greatly improve the feature extraction phase of image retrieval. Some of the most interesting papers on the subject are [2, 6, 19, 20, 33] and, in particular, MAC descriptors [27], that we ended up using.

Conditional GANs: When a GAN generator is not only conditioned with a random noise vector, but also with more complex information like text [21], labels [13], and especially images, the model to use is a conditional GANs (cGANs). cGANs allow a better control over the output of the network and thus are very suitable in a lot of image generation tasks. In particular, cGANs conditioned on images were used both in a paired [9] and unpaired [35] way, to produce complex texture [32], to colorize sketches [25] or images [3] and more recently to produce outstanding image synthesis results [16, 30]. In this work, the output must be conditioned by the input gray-scale image, in order to train the network at only generating the colors of the image but not shapes, or the image itself.

Meta-Learning: The most relevant meta-learning studies for this work are the Model-Agnostic Meta-Learning (MAML) [5] algorithm and Reptile [15] ones. In particular, we incorporate the Reptile algorithm inside the training phase, allowing the parameters of the generator to be updated in the same fashion as Reptile works. A similar work using MAML is MetaGAN [34], where a generator is used to enhance classification models in order to discriminate between real and fake data, providing generated samples for a task. The main purpose of MetaGAN is not to improve a generative network, but to perform a better few-shot classification, using generated images to sharpen the decision boundary of the problem. On the contrary, in our approach, the generator is fed with task-related images, and the meta-learner is used to enhance the generator itself, instead of a few-shot classifier. Both MAML and Reptile are based on hyper-parameterized gradient descent, and they learn how to initialize network parameters. Other types of meta-learners work differently. For example, there are many algorithms that learn how to parameterize the optimizer of the network [8, 18], or in other cases the optimizer itself is a network [1, 12, 31]. Moreover, one of the most general approach is to use a recurrent neural network trained on the episodes of a set of tasks [4, 14, 26, 29]. The most interesting result of these meta-learners is the achievement of high performance on small datasets [10, 22, 28], or datasets used for few-shot learning (e.g., Omniglot) [11].

3 Algorithm

This section goes in detail within the algorithm we propose. Therefore, each subsection focuses on a different aspect of the method. Then, the complete architecture is explained.

3.1 Clusterization of the Dataset

In order to exploit Reptile for image colorization we need to treat our image dataset as it would be composed by a series of separate tasks. For this reason we extract features from each image in the dataset using activation_43 layer of Resnet50. Then, we calculate MAC descriptors by applying max pooling and L2 normalization on the features. Having these MAC descriptors set F, we first apply Principal Component Analysis (PCA) to reduce features dimension from 2048 to 512 and then apply K-means. K-means produces k clusters, and therefore it divides the dataset in k tasks.

Fig. 2.
figure 2

Some of the results of the clusterization. It is evident how all the images have lots of features in common.

Hence, we expect to find, in each of these clusters, images which are similar to each other, accordingly to their features. For example, a cluster could contain images with grass, another one images with pets and so on and so forth. A visual proof of this assumption is showed in Fig. 2.

3.2 cGAN

As generator architecture, we choose the U-net [23] which is one of the most common for this type of task and we built the discriminator following the classic DCGAN architecture [17], i.e., having each modules composed by Convolutions, Batch Normalization and ReLU layers. Lab is the color space used in this work, because is the one that best approximate human vision and therefore the generator takes as input a grayscale image \(x_i\) (the L channel) and outputs the ab channels. Then, we concatenate input and outputs and obtain the final results.

We use L1 loss to model the low-frequencies of our output images and adversarial loss to model the high-frequencies in a similar way of the pix2pix architecture proposed by Isola et al. [9].

Therefore, our objective function became:

$$\begin{aligned} \mathcal {L} = \mathbf w _\mathbf{adv }\mathcal {L}_\mathbf{adv } + \mathbf w _\mathbf{L1 }\mathcal {L}_\mathbf{L1 } \end{aligned}$$
(1)

where \(\mathbf w _\mathbf{adv }\) and \(\mathbf w _\mathbf{L1 }\) are weights assigned to the different losses, because we want L1 loss to be more effective than adversarial loss during training.

3.3 Meta-learning

As previously briefly mentioned, we approached the generator training with a Reptile meta-learner. This means that, once a task had been chosen, for a fixed number of meta-iterations, the task is sampled and the gradient of the generator loss function (1) is evaluated to perform a SGD step of optimization. Fixed the initial generator parameters as \(\theta _G\), the inner-loop training defines a sequence \(\left( \tilde{\theta }_G^{(j)}\right) _{j = 0}^{N_{\mathrm {meta-iter}}}\), where \(\tilde{\theta }_G^{(0)} = \theta _G\). Hence it updates the \(\tilde{\theta }_G^{(j)}\) parameters in the direction of the task. Once the inner-loop is completed, the parameter are re-aligned with the Reptile rule:

$$\begin{aligned} \theta _G \leftarrow \theta _G + \lambda _{ML}\left( \tilde{\theta }^{(N_{\mathrm {meta-iter}})}_G - \theta _G\right) \end{aligned}$$
(2)

where \(\lambda _{ML}\) is the stepsize hyperparameter of Reptile.

figure a

3.4 Complete Architecture of the System

The MetalGAN training process is detailed in Algorithm 1. The algorithm is parameterized by the number of epochs \(N_{\mathrm {epochs}}\), the number of meta-iterations \(N_{\mathrm {meta-iter}}\), the generator and discriminator learning rates \(\lambda _G\) and \(\lambda _D\), the Reptile stepsize parameter \(\lambda _{ML}\), and the loss weights \(\mathbf w _\mathbf{adv }\) and \(\mathbf w _\mathbf{L1 }\). During training, we randomly select a query set \(Q = \{q_0,\dots ,q_z\}\). Each query \(q_i\) corresponds to a single cluster \(K(q_i)\). It is worth noting that two queries could point to the same cluster. Having this set, we are able to pick z different images at each epoch by sampling the task \(\tau (q_i)\) and to update the generator G as showed in Fig. 3.

Fig. 3.
figure 3

The MetalGAN architecture: the query \(q_i\) points to a cluster \(K(q_i)\) that is used as a task to train the generator G with reptile.

The generator is updated by evaluating gradients of its loss functions (adversarial loss \(\mathcal {L}_{\mathrm {adv}}\) and L1 loss \(\mathcal {L}_{\mathrm {L1}}\)), and by adding them to obtain the error \(\varepsilon _G\). Then, the network parameters obtained in the inner-loop \(\tilde{\theta }_G^{N_{\mathrm {meta-iter}}}\) are used to update the outer-loop generator parameters \(\theta _G\). In the last step, all images of the task \(\tau (q_i)\) are used to train the discriminator, calculating the gradients of the discriminator adversarial and L1 losses, and adding them to obtain the discriminator error \(\varepsilon _D\). The discriminator parameters \(\theta _D\) are updated consequently.

4 Experimental Results

For our experiments we choose a slightly modified version of Mini-Imagenet [18]. Since our goal is not classification, we create our training and test set using only images from the 64 classes contained in the training section of Mini-Imagenet. The total number of images in the dataset is 38392. We define two sets of experiments: the first one consists in training the cGAN without the use of Reptile and the second one introduces Reptile and the features clusterization. For both of them we set \(\mathbf w _\mathbf{adv } = 1\) and \(\mathbf w _\mathbf{L1 } = 10^2\). Learning rates of both the generator and the discriminator were set to \(\lambda _G = \lambda _D = 10^{-4}\). For K-means clusterization, the parameter k was set to 64 in order to have clusters as much disjoint as possible. For Reptile, we use 100 meta-iter, and a stepsize \(\lambda _{ML} = 10^{-3}\). The 10% of the dataset images are used as query images. The number of epochs was set to 200. All tests have been executed on a GPU Nvidia 1080 Ti.

Fig. 4.
figure 4

Results obtained using the cGAN only. Each group of three images is composed of the input of the network (grayscale image), the ground truth, and the output of the network.

4.1 cGAN Results

In Fig. 4 are reported some results produced after the training of the cGAN without the clusterization and without Reptile, i.e., with a standard adversarial algorithm. The training data at disposal are very scarce (\(\sim \)38 k images compared to 1.3 M of the whole Imagenet dataset) and, for this reason, the network is not able to produce compelling results. In particular, the network often fails to understand the difference between foreground and background objects and therefore it applies the colors without following edges and borders. In general, for the cGAN is very difficult to propagate the color correctly and is more common the tendency to apply uneven patches of color. Finally, due to the scarcity of data, the network cannot generalize in an acceptable way and hence the colors in the outputs are not sharp, but, on the contrary, the produced results are very blurry and often colors are applied almost randomly.

4.2 MetalGAN Results

Results of MetalGAN are showed in Fig. 5. It is immediately evident how Reptile improves the results of the cGAN. In particular, colors are sharper and more bright. The reason is that Reptile tunes the generator on each cluster and therefore allows the network to focus more on the more predominant colors present in each task and, as a consequence, even with few examples the produced results are compelling and plausible. For example, in a task with lots of images containing grass or plants there will be an abundance of different shades of green and thus the network will learn very quickly to reproduce similar colors over the test set. On the contrary, an image that is very different from the majority of images in the rest of its task could be colorized poorly. This problem, however, is not very frequent since the difference has to be very large in order to produce nasty results.

Other examples can be found at implab.ce.unipr.it/?page_id=1011.

4.3 Quantitative Evaluation

In order to evaluate the quality of the generated samples, we used the Inception Score [24], because it is a very good metric to simulate human judgement. We calculated the Inception Score of generated images using both cGAN and MetalGAN (see Table 1). The score also measures the diversity of the generated images, so a high score is better than a lower one. The MetalGAN approach significantly improves standard cGAN score.

Fig. 5.
figure 5

Results of MetalGAN. Each of three images consists of the grayscale input given to the network, the ground truth, and the output of the network. The four represented images belong to different clusters.

Table 1. The Inception Scores are computed on generated images from the MiniImageNet dataset, mean and standard deviation are reported for both cGAN and MetalGAN results.

5 Conclusions

In normal adversarial generative settings, having few images at disposal during training produces a complete failure in the colorization. In this paper, we proposed a novel architecture which mix adversarial training with meta-learning techniques, called MetalGAN. As shown by experimental results, even with few images the network trained with MetalGAN was able to produce a well-looking colorization. The clusterization of the dataset and the use of clusters as tasks help at directing the colorization to the most probable suitable colors for the image, and meta-learning allows to train the network on few examples. As future developments, we plan to include the discriminator in the meta-learning training phase, and to test the method on other small datasets in order to prove the generalization capability of the proposed MetalGAN architecture.