Keywords

1 Introduction

Image forensics is a branch of study that focuses on the verification of the authenticity of images. Image forensics techniques can be broadly classified into two groups: active and passive. Active techniques rely on the insertion of digital signature or watermark into an image, which gets modified if the image is tampered to create forgeries. On the other hand, passive techniques do not require any prior knowledge about the test image, and can detect forgeries using the pixel intensity values only.

Image splicing is one of the popular image forgery techniques, which involves the creation of a composite image using parts from two or more images. Figure 1 shows the example of a splicing forgery, where parts from an image is copied and pasted onto a different image, to represent an event that has never happened. There are different approaches to detect splicing forgeries, such as compression-based [7], lighting condition-based [5], noise-based [6], camera sensor-based [1, 2] etc. In the compression-based approach, the properties of JPEG compression are utilized to detect and localize the spliced regions. The lighting condition-based approach uses the mismatch in the lighting conditions estimated from different objects present in an image as a cue to detect the forgery. The inconsistencies in the noise profile is utilized for detecting the spliced parts in noise-based methods. In the camera sensor-based methods, different attributes of the imaging sensor, such as the color filter array (CFA) demosaicing technique and the photo response non-uniformity (PRNU) noise, are estimated and compared to reveal the presence of possible forgery.

This paper proposes a camera sensor-based method for the detection and localization of splicing forgery. The method is based on the intuition that the spliced region in an image is more likely to be captured by a camera which is different from the one used to capture the pristine regions. On the other hand, all the pixels in an authentic image are captured by a single camera only. Therefore, extracting camera-specific features from different parts of an image and checking their consistency may expose splicing forgery. The proposed method utilizes a convolutional neural network (CNN), trained for extracting camera-specific features for the camera model identification task. These features are further processed for detecting and localizing the spliced regions.

Fig. 1.
figure 1

An example of image splicing. (Photo tampering throughout history, http://www.fourandsix.com/photo-tampering-history.)

2 Related Work

Camera model identification finds the model of the source camera used to capture a given image. Motivated by the success of deep learning in various computer vision tasks, researchers have focused on developing CNN-based camera model identification techniques [1, 9]. In these techniques, CNNs are trained on a large set of training images coming from different camera models. Once trained, the CNN learns to extract different features which maps the test images to their respective models.

Bondi et al. [2] proposed a method to detect and localize splicing forgeries using a CNN trained for camera model identification. While training the CNN, the authors considered only the patches having high texture contents and excluded the homogenous and the saturated patches. This is because the homogenous and the saturated patches do not contain enough statistical information to extract camera specific features [2]. Given an image under investigation, it is first divided into non-overlapping patches of size \(64\times 64\). An 18-dimentional feature vector is extracted from each patch using the pretrained CNN. The features are extracted from the last layer of the CNN, which is a softmax layer with 18 neurons. The output of each neuron in the softmax layer represents the probability of a patch being classified into one of the 18 camera models considered in the training stage. The features of all the patches present in the test image are clustered using the k-means clustering algorithm. If there are more than one clusters, the test image is decided as spliced, otherwise it is decided as authentic. In case of a spliced image, the patches corresponding to the features present in the cluster having the highest cardinality are considered as authentic patches and the rest are considered as spliced patches. The limitation of Bondi et al.’s method is that the features extracted from the softmax layer do not generalize well to unknown camera models. Since the output of the softmax layer is class probabilities, it can extract features from the patches coming from known camera models accurately. However, the patches coming from unknown camera models will also be classified into any of the known camera models. Therefore, the 18-dimensional feature may sometimes fail to differentiate between different camera models and hence may fail to detect splicing forgeries.

Fig. 2.
figure 2

Block diagram of the proposed method depicting the steps involved.

3 Proposed Method

The proposed method is based on the assumption that a spliced image is the composition of two or more images, which are more likely to be captured through different cameras. Therefore, extracting the camera-specific features from small patches of the images and clustering them may give information about the authenticity of the images. In case of an authentic image, since all the patches are captured by a single camera only, there should be only one cluster. On the other hand, if the features form more than one clusters in the feature space, then we can decide the test image to be spliced.

The main principles of the proposed method are as follows:

  1. 1.

    The method extracts the camera-specific feature from a layer before the softmax layer of a CNN, trained to identify camera models. Because of this, the extracted features are not directly dependant on the camera models present in the training stage. This in contrast to the features extracted by Bondi et al., which are the probability scores extracted from the softmax layer of the CNN.

  2. 2.

    The extracted features are clustered using the hierarchical agglomerative clustering (HAC) technique [3, 8] based on cosine distance. The cosine distance between two feature vectors extracted from two patches coming from the same camera is expected to be less than that coming from two different cameras.

Figure 2 shows the steps involved in the proposed method to localize and detect splicing forgeries. They are described below.

3.1 Feature Extraction Using CNN

3.1.1 CNN

The CNN architecture used in this work is similar to the one proposed by Bondi et al. [1]. It has 5 convolutional layers, 3 max-pooling layers and 2 fully connected layers. The input to the CNN is an image patch of size \(64\times 64\). There is a trade off between the camera model discrimination ability and the localization ability of the method. As the size of the patch increases, the camera model discrimination ability of the method increases and the splicing localization ability of the method decreases. The details of the kernel size, number of filters and the output feature map size of each layer are shown in Fig. 3.

Fig. 3.
figure 3

The CNN architecture used in the proposed method.

3.1.2 Feature Extraction

Given an image under investigation \(\text {I}_T\) of size \(W \times H\), it is first divided into a number of non-overlapping patches \(\text {P}_{ij}\) of size \(64\times 64\), where \(i \in (1 ,\left\lfloor \frac{W}{64} \right\rfloor )\), \(j \in (1 ,\left\lfloor \frac{H}{64} \right\rfloor ) \) and \(\left\lfloor . \right\rfloor \) denotes floor operator. Each image patch is fed to the CNN which is already trained to perform camera model identification. Since our motive is not to get the camera model scores, which is at the output of the last layer, i.e. the softmax layer, we propose to extract features from the first fully-connected layer as \(FC_1\). As \(FC_1\) has 128 neurons, the output features will be 128-dimensional vectors. The reasons for extracting features from \(FC_1\) are as follows: Since the last layer classifies the patches to one of the camera model classes, the output of this layer will be dependent on the camera models present during training. Hence, the features extracted from this layer will not generalize to patches coming from cameras not present during the training stage. On the other hand, \(FC_1\), does not learn to classify patches to any specific class. Rather, it extracts features which are more generic. Therefore, the features extracted from \(FC_1\) will be more efficient in discriminating camera models not present in the training stage. In the feature space if feature vectors of two patches lie in close proximity they are considered to be from the same camera. If the feature vectors are far apart, they are considered to be captured from two different cameras.

3.1.3 Localization of Forged Regions using HAC

After obtaining the feature for all the patches, a binary image is estimated based on the proximity of patches in the feature space. Initially the CNN features for all the patches are clustered using HAC [3]. HAC combines feature vectors into small clusters, those clusters into larger clusters, and so forth, thus creating a hierarchy of clusters. Initially, all the CNN features are considered as a single-element cluster. At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster. This method is iterated till either all feature vectors are clustered into a single cluster or the distances between the clusters formed are greater than a certain threshold \(T_d\). Therefore, the cluster which has the highest number of patches, is considered to be the one consisting of the authentic patches, and the other clusters are considered to be composed of either tampered regions or noise. In the case of a pristine image, all the features will be clustered into a single cluster. In the case of a spliced image, we are likely to get more than one cluster. This is because there is a high chance that the forged parts in a spliced image were captured by camera models different from the one used to capture the authentic region.

Let \(\text {M}_{ij}\) be the binary output image for the patch \(\text {P}_{ij}\). If \(\text {P}_{ij}\) is an authentic patch, we assign all the pixels of \(\text {M}_{ij}\) as 0. Otherwise we assign 1 to all the pixels of the \(\text {M}_{ij}\). A binary image \(\text {M}\) is created by putting together all the \(\text {M}_{ij}\)s in the appropriate image coordinates, i.e. \(\text {M}=[\text {M}_{ij}]\), where \(i \in (1 ,\left\lfloor \frac{W}{64} \right\rfloor )\) and \(j \in (1 ,\left\lfloor \frac{H}{64} \right\rfloor )\). Because of noise, there may be a few false alarm patches scattered throughout \(\text {M}\). To remove these false alarms, morphological opening is performed on \(\text {M}\) with structuring element of size \(128 \times 128\) to get the binary image \(\widehat{\text {M}}\). This is based on the assumption that the smallest possible forgery can be of size \(128\times 128\). \(\widehat{\text {M}}\) localizes the potential forged patches of \(\text {I}_T\).

3.1.4 Detection of the Forged Image using Thresholding

Because of the presence of noise, there may be some patches of an authentic image misidentified as forged patches. Therefore, there is a need to take image level decision about the test image, i.e. whether the image is pristine or spliced. For this, we check the percentage of forged patches present in \(\text {I}_T\). Let \(T_f\) represent the percentage of forged patches present in \(\text {I}_T\). We decide \(\text {I}_T\) to be pristine if the percentage of forged patches is less than a certain threshold \(T_p\), i.e. if \(T_f<T_p\). Otherwise, \(\text {I}_T\) is decided to be spliced (Fig. 4).

Fig. 4.
figure 4

An example of forged image along with the ground truth binary image, the binary output image obtained after clustering and the final binary output image obtained after the morphological post-processing operation.

4 Experimental Results

We use the Dresden Image Database [4] to train the CNN for performing the camera model identification task. The database contains 16, 961 images captured using 28 different camera models. Out of these 28 models, we use 18 camera models for training the CNN. Since the CNN accepts images of size \(64\times 64\), we have divided the images present in all the 18 classes into patches of the same size. For testing the performance of splicing detection and localization, two tampered datasets are created. Dataset \(\text {D}_1\) is created using images from the 18 known camera models (i.e. models present in the training stage) and dataset \(\text {D}_2\) is created using 10 unknown camera models. Both the known and the unknown datasets contain 300 pristine and 300 spliced images along with their groundtruth binary image, representing the authentic and spliced parts. The tampered images are created as follows: Two images \(\text {I}_1\) and \(\text {I}_2\) are first selected randomly from the dataset. Then, a rectangular region from \(\text {I}_2\) is copied randomly and pasted onto \(\text {I}_1\) at a random location. The size of the rectangular region is varied from \(128\times 128\) to \(512\times 512\).

The CNN is first trained to perform the camera model identification task. Once the CNN is trained, we use it to extract features from the patches of size \(64\times 64\) present in the image under investigation. The features are later clustered to detect and localize the spliced regions.

As explained earlier, the proposed algorithm depends upon two thresholds \(T_d\) and \(T_p\). The threshold \(T_d\) is the maximum distance between the means of two clusters. To find the optimal value for \(T_d\), we created 18, 000 authentic feature pairs and 18, 000 spliced feature pairs. The authentic pairs are obtained by taking two patches captured by same camera model. The spliced pairs are obtained by taking two patches from two different cluster of features which are captured using different camera models. The receiver operating characteristic curve (ROC) is then computed by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold \(T_d\) as shown in Fig. 5. The optimal value of threshold \(T_d\), is selected as a value which maximizing accuracy according to the ROC. The threshold \(T_p\) is used to take the image level decision of a test image. The optimal value of threshold \(T_p\), is selected as a value which maximizes the accuracy according to ROC results shown in Fig. 6a and b.

Fig. 5.
figure 5

ROC curve varying the threshold \(T_d\).

Fig. 6.
figure 6

ROC curve for splicing detection

Table 1. Tampering detection results.

The ROC is obtained by plotting the TPR against the FPR at various threshold \(T_p\) and fixing threshold \(T_d=0.4\) for the unknown dataset and known dataset as shown in Fig. 6a and b respectively. In Fig. 6a, area under the curve (AUC) is 0.9845 which shows that best forgery detection is achieved for \(T_d\) = 0.4 and AUC obtained using the Bondi et al. [2] method is 0.8584. In Fig. 6b, AUC is 0.9833 which shows that best forgery detection is achieved for \(T_d\) = 0.4. AUC obtained using the Bondi et al. [2] method is 0.9542. Also, on known and unknown datasets the proposed method achieves accuracies (ACCs) of \(94.9\%\) and \(93.3\%\) respectively. While, Bondi et al. achieves ACCs of \(91.8\%\) and \(82\%\) on known and unknown datasets respectively. Comparing both the ROC it is evident that for any unknown dataset, our method outperforms the existing method. Table 1 shows the results in terms of accuracy (ACC), TPR and TNR. For \(T_p\) = 0, an image is considered to be pristine only if all the patches are pristine so the TPR is maximum. From the table it is evident that for \(T_d = 0.4\) and \(T_p= 0.08\), we get the best result in terms of ACC. These values are later used in checking the authenticity of a given test image.

5 Conclusions

In this paper, we proposed an algorithm which exploits a CNN to extract camera-specific features from image patches. These features are further analysed in order to detect whether an image is forged or not. In the first phase, we train a CNN using images from the Dresden dataset to learn the camera-specific features for 18 camera models. In the second phase, we divide the test image into a number of non-overlapping patches and extract features using the trained CNN. After that, these features are clustered using the HAC technique for detecting and localizing splicing forgeries. The method can detect splicing present in images coming from known as well as unknown camera models. Experiments were conducted on 1200 known as well as unknown images coming from 28 camera models. The experimental results show that the proposed method outperforms the state-of-the-art for images captured by both known and unknown camera models. Our ongoing work aim at detecting splicing forgeries on post-processed image using this framework.