Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Photo sharing services such as Flickr, Instagram etc. are continuously evolving with the progressive introduction of new features for their evergrowing user bases. One of the most popular features is the option to apply photographic filters which allow the user to adjust the mood of his pictures in a completely automatic way. Several preset filters are available corresponding to various image transformations, mostly related to shifts in the color distribution, variations in brightness and contrast, and the like.

In this work we investigate the problem of the automatic detection of the application photographic filters commonly used in the photo sharing services. We show how it is possible to reliably distinguish between original and processed images. Moreover, we also show how it is possible to identify with a very high confidence which filter has been used. The objective of this preliminary work is twofold. On one hand, it shows that it is possible to reliably identify certain kinds of distortions, paving the way for future investigation about the automatic classification of processed vs. unprocessed images; on the other hand, it would allow to take into account the influence of photographic filters in other computer vision tasks. In fact, Chen et al. [9] showed that state-of-the-art image recognition approaches using Convolutional Neural Networks (CNN) fail to correctly classify social media photos (especially Instagram), where a lot of pictures have been edited with photographic filters.

The approach we investigate in this paper is based on the use of Convolutional Neural Networks trained on a large dataset of images processed with 22 different photographic filters designed to reproduce those available on Instagram. We experimented with different architectures taken from the image recognition literature and we show how they can be adapted to achieve a very high classification rate.

The paper is organized as follows: Sect. 2 reports all the information about the photographic filters and the data used in the experimentation; Sect. 3 illustrates the classification strategy; Sect. 4 reports the results obtained and discusses their implications; Sect. 5 concludes the paper by summarizing our findings and by suggesting future directions of research.

2 Photographic Filters

In this work we consider the following 22 types of Instagram-like filters (descriptions are taken from the Instagram website):

  1. 1.

    1977: the increased exposure with a red tint gives the photograph a rosy, brighter, faded look;

  2. 2.

    Amaro: adds light to an image, with the focus on the centre;

  3. 3.

    Apollo: lightly bleached, cyan-greenish color, some dusty texture;

  4. 4.

    Brannan: increases contrast and exposure and adds a metallic tint;

  5. 5.

    Earlybird: gives photographs an older look with a sepia tint and warm temperature;

  6. 6.

    Gotham: produce a black and white high contrast image, with bluish undertones;

  7. 7.

    Hefe: hight contrast and saturation, with a similar effect to Lo-Fi but not quite as dramatic;

  8. 8.

    Hudson: creates an “icy” illusion with heightened shadows, cool tint and dodged center;

  9. 9.

    Inkwell: direct shift to black and white — no extra editing;

  10. 10.

    Kelvin: increases saturation and temperature to give it a radiant “glow”;

  11. 11.

    Lo-Fi: enriches color and adds strong shadows through the use of saturation and “warming” the temperature;

  12. 12.

    Mayfair: applies a warm pink tone, subtle vignetting to brighten the photograph center and a thin black border;

  13. 13.

    Nashville: warms the temperature, lowers contrast and increases exposure to give a light “pink” tint — making it feel “nostalgic”;

  14. 14.

    Poprocket: adds a creamy vintage and retro color effect;

  15. 15.

    Rise: adds a “glow” to the image, with softer lighting of the subject;

  16. 16.

    Sierra: gives a faded, softer look;

  17. 17.

    Sutro: burns photo edges, increases highlights and shadows dramatically with a focus on purple and brown colors;

  18. 18.

    Toaster: ages the image by “burning” the centre and adds a dramatic vignette;

  19. 19.

    Valencia: fades the image by increasing exposure and warming the colors, to give it an antique feel;

  20. 20.

    Walden: increases exposure and adds a yellow tint;

  21. 21.

    Willow: a monochromatic filter with subtle purple tones and a translucent white border;

  22. 22.

    X-Pro II: increases color vibrance with a golden tint, high contrast and slight vignette added to the edges.

An example of the application of the 22 filters on an input image is reported in Fig. 1. Each filter is implemented by a sequence of basic image processing operations, such as: adjustment of color levels, adjustment of color curves (i.e. nonlinear channel transformation), brightness and contrast adjustment, addition of blur and/or noise, hue, saturation and lightness adjustment, addition of vignette, use of a color layer (to generate a color cast), use of a gradient layer, conversion to black & white, and addition of flare. A schematic view of which basic operations are used for each filter is reported in Table 1.

Fig. 1.
figure 1

Examples of the application of the 22 Instagram-like filters on one input image.

To generate a large scale dataset with filters, we randomly sampled 20 000 images from Places-205 [17]. After that, we applied the 22 filters to generate filtered images forming a dataset that contains 0.46 M images (original images included) in total. Original images are randomly divided into training, validation and test set with ratio 75%, 5%, and 20%.

Table 1. Summary of the basic image processing operation used in each of the 22 Instagram-like filters.

3 Investigated Strategy

In the last years convolutional neural networks (CNNs) emerged as the de facto standard for image classification. According to the deep learning paradigm, networks are composed of several layers that progressively transform the raw data into high-level information [11]. The input consists in the image pixels, and the image features are learned, instead of being explicitly designed. The main drawback of CNNs is that their training require a large amount of annotated data and of computational time.

In most cases training a network from scratch is not really necessary. In fact it is possible to reuse a network previously trained on a different task by fine-tuning it with a relatively small amount of data. This strategy works because the features learned by the network tend to be quite general, providing information that can be exploited for various image classification domains (only the last layer need to be adapted to the actual classification task) [1, 16].

The baseline for image classification is represented by the AlexNet [10], a CNN that has been trained on more than one million images, distributed for the 2012 edition of the Imagenet Large Scale Visual Recognition Challenge [14]. Several other image classification tasks have been successfully addressed by fine-tuning AlexNet [13]. We argue that likewise to other similar computer vision tasks [5, 6], the simple fine-tuning of a pre-trained network is not a viable solution to the problem of classifying Instagram-like filters. In fact, networks trained for object recognition tend to learn features that detect specific spatial patterns (i.e. those useful to discriminate the salient parts of the objects). For instance, the fist convolutional layer usually learns to extract features that resemble Gabor filters. The network learns to be as invariant as possible with respect to variations in color, contrast etc. In particular, it is expected that the network is able to recognize the same objects in images that have been modified by the application of the Instagram-like filters.

To address the problem of classifying the images into the 23 categories (22 filters + the original image) we experimented with three different networks derived from the AlexNet, the GoogLeNet and the LeNet architectures.

AlexNet is a network designed for the recognition of 1000 image categories. It includes three convolutional layers, followed by three fully-connected layers [10]. The input of the network is the image resampled to \(227\times 227\) pixels. The output of each convolutional layer is further processed by spatial max pooling, and rectified linear activations are applied to the output of both convolutional and fully connected layers. A final softmax layer maps the activation values to a vector of 1000 probability estimates.

GoogLeNet has a very complex architecture including a large number of different layers, the majority o which perform convolutions, pooling, and rectified linear activations. Groups of convolutions are used to form “Inception modules” that represent complex transformations of the data while requiring a relatively small number of parameters [15]. AlexNet and GoogLeNet have been designed for the same classification problem and, as a result, they have the same kind of inputs and outputs (with the minor difference that GoogLeNet accepts \(224 \times 224\) input images).

LeNet is the first CNN proposed for an image classification task [12]. It has been designed for the recognition of handwritten digits and includes two convolutional and two fully connected layers. The network takes as input monochrome \(32 \times 32\) images and produces as output a vector of probabilities (one for each of the ten symbols in the Arabic numeral system).

We adapted the three networks to our problem by resizing the last layer to 23 output units. In the case of the LeNet we also modified the input to \(224 \times 224\) color images (note that this significantly increased the number of parameters). We trained each network by 450 000 iterations of the stochastic gradient descent algorithm, where each iteration processed a mini-batch of 256 images. For the AlexNet we also experimented with fine-tuning from the standard version trained on the Imagenet data (to do so, we allowed the training procedure to update only the coefficients of the last layer).

4 Experimental Results

The results we obtained on the test set are shown in Table 2. All the three network architectures allow to obtain high classification rates. Even for the simple LeNet we have more than 94% percent of accuracy. The best performing network is the AlexNet which obtained 99% of classification rate. GoogLeNet obtained slightly worse results (97.6%), but with little more than a tenth of the parameters. As expected, fine tuning of the original AlexNet trained for object recognition leads to quite poor results (67.5% of classification rate).

Table 2. Summary of the networks evaluated in the experiments. For each network are reported the training method (fine tuning or full training from scratch), the depth (numbers of learnable layers between input and output), the number of parameters, and the classification rates obtained on the test set (percentages of time in which the correct class is the predicted one, and in which it is among the five with the highest prediction scores).

More details can be found in Table 3, that reports the confusion matrix obtained with AlexNet on the test set. The diagonal of the matrix shows how the network was able to detect with very high precision the 22 filters considered. For all of them we have more than 98% of correct classifications. The main difficulty for the network consists in detecting the absence of any filter. In fact, only 91.6% of the times the original images where recognizes as such. Instead, they are often classified as if the Hefe, Hudson or Mayfair filters would have been applied. These filters do not include any strong variation in the color distribution and, to human inspection, appear quite natural. Among the filters, the highest level of confusion (about 2%) occurs between the Inkwell and the Willow filters, which both produce gray-level images. In all the other cases, the off-diagonal entries of the confusion matrix are below 1%.

Table 3. Confusion matrix obtained on the test set by the AlexNet architecture retrained for the filter detection task. Results are reported as percentages.

The behavior of the GoogLeNet is very similar, as can be seen from the confusion matrix in Table 4. Results are in general slightly worse than those obtained by AlexNet, with the exception that the confusion among Willow and Inkwell filters raises up to 10.7%. For the sake of brevity we omit the confusion matrix of LeNet and of the fine-tuned AlexNet.

Table 4. Confusion matrix obtained on the test set by the GoogLeNet architecture retrained for the filter detection task. Results are reported as percentages.
Fig. 2.
figure 2

Graphical representation of the coefficients learned for the 96 convolutions in the first level of the AlexNet architecture. (Color figure online)

As we previously argued, the poor performance of the fine-tuned network can be explained by the fact that the original training forced the network to discard information about the color distribution that can be deceiving for image recognition, but useful for the classification of the filters. A qualitative evidence of this can be obtained by analyzing the coefficients learned by the first convolutional layer, that are reported in Fig. 2. For the standard AlexNet these coefficients form Gabor-like filters that are able to identify local features such as edges and corners. We obtained, instead, mostly low-pass filters sensitive to specific colors (red, green, blue, purple, yellow, cyan among the others). A few filters seems able to detect edges at a particular orientation, often with opponent colors at the two sides. Only four filters have been learned for the detection of fine details.

5 Conclusions

In this paper we have investigated the problem of automatically detect the application of photographic filters commonly used in the photo sharing services. To this end, a total of 22 types of Instagram-like filters is considered to generate a dataset of more than 0.46 M images from the Places-205 dataset. Three different deep Convolutional Neural Networks (CNNs) have been compared: AlexNet, GoogLeNet, and LeNet. Experimental results show that it is both possible to determine with high accuracy whether or not one of these filters has been applied, and also which one. In particular, we showed that a recognition accuracy of about 99.0% can be obtained by training an AlexNet from scratch for this specific problem.

The contribution of this preliminary work is twofold: first, it shows that it is possible to reliably identify certain types of distortions, opening the possibility of future investigation about the automatic classification of processed vs. unprocessed images; second, it allows to take into account the influence of photographic filters in other computer vision tasks [2,3,4, 7, 8].