Keywords

1 Introduction

Breast Cancer is a commonly found cancer related disease in women which leads to a high increase in the death rate every year [1]. So it is essential to detect and diagnose cancer at an early stage to begin with the right treatment. Even so, it was found that breast cancer histopathology dataset (BREAKHIS) is an imbalanced image dataset. Hence, this imbalanced breast cancer dataset is challenging to deal with because of the disproportionate number of samples in each class respectively [2]. It is important to recognize cancer (Malignant) class from non-cancer (Benign) class correctly so that an appropriate diagnosis of cancer is done at the right stage. It is often observed that most real-world problems involve imbalanced datasets. The imbalanced nature of such a dataset yields results that are biased towards the majority class and the minority class is left undetected. An investigation is conducted in our work using the breast cancer dataset, to effectively discriminate minority class samples from the majority class through deep learning based experiments. A data augmentation based approach using pre-trained networks and transfer learning is used to correctly classify cancerous pattern from non-cancerous conditions. The transfer learning approach is used in the experiment to transfer the knowledge from one domain to another domain for performing a two-class classification task to segregate Benign from Malignant class. Transfer learning is a way of transferring knowledge from one specific domain or task to another task [3]. There are multiple advantages of using transfer learning approach (1) Firstly, model is already trained on a large scale dataset, and we are using the already existing model along with predefined weights for our own classification task, which saves a lot of computational processing time. (2) Secondly, we can transfer the knowledge from large scale dataset and perform classification well even with the small dataset. Pre-trained networks are trained on a large scale dataset such as ImageNet, which is composed of millions of high-resolution images belonging to multiple categories or classes (approximately 1.4 million images and 1000 classes).

Data augmentation [4] is the process of synthetically increasing the number of samples in the dataset, which helps to resolve the overfitting problem that arises due to inadequate number of samples in the dataset. There are several transform operations which can be applied for data augmentation such as scaling, rotation, horizontal and vertical flip, etc. This transformation help in increasing the number of samples of minority class as well as majority class so that we are able to correctly detect the minority class while performing the classification task. This paper is ordered into following subsequent sections. Section 2, discusses related work. In Sect. 3, the methodology for extracting features using pre-trained networks along with the data augmentation approach is described. Sections 4 and 5 discuss about the dataset and the experimental work conducted respectively. Finally, the conclusion is inferred in Sect. 5.

2 Literature Survey

BREAKHIS involves the imbalanced dataset having an unequal class distribution with the number of Benign samples significantly lesser than Malignant class. A large flux is observed in these images. Several studies have been conducted in order either to resolve the issue or study the effect of the imbalance. Gardner & Nicholas (2017), experiment on satellite images by implementing a Convolution Neural Network (CNN) model to perform multi-label classification through a comparison between VGG-16, Inception-v3, ResNet-50 and ResNet-50 with data augmentation. Best performance is achieved with ResNet-50 model after combination with data augmentation and ensemble technique by using various evaluation parameters such as F1-score, precision and recall [5]. Wang et al. (2017) proposed a method called neural augmentation which allows a neural net to learn augmentations that improve the classifier performance and comparison was carried out with traditional approaches used for data augmentation in the image classification task. Results state that a combination of traditional augmentation followed by neural augmentation further improves the classification strength [6].

In data science, sampling is the most popular remedy for the class imbalance problem. Chawla (2009), described in detail about changes that can be performed at data level or algorithmic level to overcome the class imbalance issue, such as sampling techniques at data level i.e., oversampling, undersampling, SMOTE and even variants of SMOTE such as SMOTEBOOST etc [7]. Hybrid strategies that combine oversampling of the minority class with the undersampling of the majority class have been proposed [21, 22]. In the field of computer vision, data augmentation has proved to mitigate the imbalance problem to a certain extent, and hence it is the approach used in our deep learning experiment to counter the class imbalance.

Chen, et al. (2012) conducted an experiment by applying data augmentation on various pre-trained networks such as VGG16, ResNet-50, and DenseNet for satellite images. From the experimental results, it was found that the best performance is achieved in case of the ensemble model which is formed after a combination of the results obtained from three DenseNet121 models with VGG-16 model [8].

3 Proposed Methodology and Techniques Used

Following the cue from the existing work, our work induces data augmentation [4], however on the minority class only, similar to the oversampling step in [21]. The minority augmented dataset, which is in essence balanced, is then applied for transfer learning through pre-trained networks and eventually classified using Weighted Support Vector Machine (SVM).

3.1 Data Augmentation of the Minority Class

The following data augmentation operations were applied on the minority class for the complete dataset: shear with range of 10, upper and lower zoom with the range of 20 percent and horizontal flip, along with resizing and pre-processing operations. Instances from the augmented minority set are shown in Fig. 1.

Fig. 1.
figure 1

Illustration of data augmentation applied on the minority class (Benign)

3.2 Phase I: Classification Using Pre-trained Networks

The pre-trained model was used for performing the classification experiment [19]. The pre-trained networks used for our experiment are described below.

  1. (i)

    ResNet-50:- Architecture of ResNet-50 [5] consists of various residual blocks. Each residual block consists of a set of repeated layers. ResNet-50 architecture basically consists of 50 layers comprising of repeated convolution and pooling layers along with fully connected layers.

  2. (ii)

    Inception-v3:- GoogleNet introduced the first Inception model. With time span, various other versions of Inception were introduced such as Inception-v2, Inception-v3, and Inception-v4. For our experiment, we have used Inception-v3, which has factorization as the main idea along with batch Normalization [8]. Inception-v3 architecture basically consist of 42-layer architecture, having fewer parameters, and more computational efficiency in comparison to previous Inception architectures.

Global Average Pooling 2D is added along with the dense layers [11], and Sigmoid and RELU activation functions are used.

3.3 Phase II: Deep Feature Extraction and Transfer Learning

Phase II, comprises of deep feature extraction from the pre-trained model fine-tuned with the minority augmented dataset. The deep features are extracted from all the layers of the Inception-v3 pre-trained model taken in the right order. The classifiers used for our experiment involving transfer learning based on the deep feature extracted are explained below:-

SVM.

SVM classifier can be used for linearly separable and non-separable problems based upon the type of kernel used. The kernel is the parameters set to change/transform the dimension of input feature space from low to high. The kernel can be selected as ‘Linear’, ‘Polynomial’, ‘rbf’ etc., based upon the task performed. SVM [9, 10] is commonly used as a classifier to solve numerous tasks by selecting optimal decision boundary or hyperplane. Selected optimal decision boundary helps to segregate the samples from the two classes. For the experimental task, we have used ‘rbf’ kernel.

Weighted SVM.

Weighted SVM [11] is also used for the evaluation in Phase II. The class weight is assigned the balanced kernel value in order to deal with the imbalanced dataset. The kernel used while performing the classification task is ‘linear’ along with other values that were set such as C = 1.0 (C is penalty parameter). When the value of C is set to 1, it indicates that the classifier is easily able to tolerate the wrong classification data items/points. Other parameters which was considered in training the classifier was a random state which was set equal to zero and probability is set as true while performing the experiment.

4 Experimental Results and Discussions

For the purpose of our deep learning experiments, the popular python framework Keras v2.1.6 has been used. It uses a Tensorflow backend to perform all its internal computations. The model will be trained across 3 epochs with 29 steps in each epoch. While training the pre-trained network, sparse categorical cross-entropy is used as the loss function for each fold along with Adam optimizer (Adaptive Moment Estimation), batch size set as 32 in each fold, with RELU activation function [12] in dense layer, is used for the experiment.

4.1 Dataset

Images present in the BREAKHIS dataset belong to of four different magnification factors i.e. 40X, 100X, 200X and 400X [18, 20]. Experimental evaluation was performed using 400X magnification factor images. Dataset comprises of 5,88 Benign and 1,232 Malignant images as shown in Table 1. Random sampling is applied in order to segregate testing images from training images, such that 100 samples each from Malignant and Benign class will be used for testing. Whereas the remaining number of the sample is kept for training purpose. Each input image was resized into 224 × 224 size before performing the experiment. For the experiment, five-fold cross-validation technique has been used in Phase I.

Table 1. Imbalanced image dataset of a patient having cancer and those not suffering from cancer obtained from 400 X magnification factor in training and testing dataset.

4.2 Performance Evaluation

Several studies show that accuracy is not always a reliable parameter to be used to determine the model performance while dealing with the imbalanced dataset. Accuracy illustrates how proficiently classifier is able to classify the samples in the different classes. Precision denotes the fraction of relevant samples which are correctly classified from the total number of samples. Recall denotes the fraction of samples which are correctly classified from the correct samples. F-Measure/F1-score is obtained from the combination of both recall and precision as shown in Eq. 1. Whereas the ROC curve is the graph plotted between the true positive rate and false negative rate.

$$ F1 - score = \frac{{2 * \left( {Precision * Recall} \right)}}{{\left( {Precision + Recall} \right)}} $$
(1)

Classification report has been generated using the scikit-learn package in python for analyzing the performance of the model with precision, recall, F1-score [16] and ROC curve (Receiver Operating curve).

4.3 Results and Discussions

4.3.1 Phase I Experiments

From the experimental analysis, it was found that there is a huge difference between test and train accuracy. As average train accuracy in case of Inception-v3 is approximately between 70 to 80% range with and without data augmentation. Whereas, average test accuracy has improved from 47 to 50% as shown in Table 3. Experiments performed using Inception-v3 pre-trained networks with and without data augmentation are shown in Tables 2 and 3. From the experimental evaluation, it was found that Inception-v3 is unable to detect minority class. Whereas experiments conducted by applying data augmentation technique on minority class separately are effectively able to detect the minority class, as illustrated in Table 2 and Fig. 2. According to the literature, it was found that while dealing with the imbalanced dataset, results are biased towards the majority class. Figure 3 depicts the test samples tested on Inception-v3 pre-trained network after applying data augmentation on minority class using one of the best fold out of the average (Avg) five-fold. Whereas, data augmentation applied separately on the minority is able to detect minority class efficiently in comparison to Inception-v3 pre-trained network after data augmentation is applied on both classes. Figure 4 represents sample test images without applying data augmentation. Similar experiment was conducted with ResNet-50 architecture to measure the performance of data augmentation technique as illustrated in Tables 4 and 5. From experiment results, it was found that Inception-v3 outperforms ResNet-50 in all the cases.

Table 2. Comparison of Performance evaluation on the basis of Precision, Recall, and F1-score using cancer dataset in case of Inception-v3 pre-trained network.
Table 3. Comparison of performance evaluation of train and test accuracy using cancer dataset in case of Inception-v3 pre-trained network.
Fig. 2.
figure 2

Various visualization Curves demonstrates from (i) to (iv) in case of Inception-v3 pre-trained networks after applying data augmentation on Minority class with 4 fold validation

Fig. 3.
figure 3

Samples tested on Inception-v3 pre-trained network after applying data augmentation on minority class

Fig. 4.
figure 4

Samples tested on Inception-v3 pre-trained network without applying data augmentation

Table 4. Comparison of performance evaluation on the basis of precision, recall, and F1-score using cancer dataset in case of ResNet-50 pre-trained network.
Table 5. Comparison of performance evaluation of train and test accuracy using cancer dataset in case of ResNet-50 pre-trained network.

4.3.2 Phase II Experiments

Experiments were conducted to evaluate the performance of Inception-v3 under different conditions on the basis of various parameters such as precision, recall, F1-score and accuracy to evaluate the performance of the model. The comparison was done between Inception-v3 with data augmentation applied on minority class, obtained as the best result in Phase I experiment, with the features extracted from Inception-v3 pre-trained model and trained on the SVM and weighted SVM classifiers. From the experimental analysis, it was found that Inception-v3 with weighted SVM and data augmentation applied on the minority class outperforms other Inception-v3 based experiments, as shown in Table 6 and Fig. 5.

Table 6. Comparison of Performance evaluation of techniques applied on the minority class
Fig. 5.
figure 5

(i) ROC curve in case of Inception-v3 with weighted SVM (ii) ROC Curve in case of Inception-v3 with weighted SVM and data augmentation applied on the minority class

5 Conclusion

A solution for countering class imbalance in deep learning is proposed in this work specifically for BREAKHIS breast cancer dataset. It is proved that we cannot only rely on accuracy for the identification of imbalanced dataset. Various other parameters need to be considered such as F1-score, Precision, Recall, and ROC. We have conducted experiments in two phases. Phase-I investigates the effect of data augmentation technique when applied on minority class only for the pre-trained networks Inception-v3 and ResNet-50. Results obtained are found better with the application of data augmentation on the minority class in Inception-v3 pre-trained model. In Phase-II, features were extracted from all layers of Inception-v3 and learned by the SVM and weighted SVM classifiers. Results show that Inception-v3 with data augmentation on minority class works best with transfer learning using weighted SVM as compared to other networks. However, it was also observed that despite having high test accuracy, if the model is unable to give correct predictions and the results are biased towards the majority class then the minority class is left undetected and the model will evaluate every sample image from test data as non-cancerous even though the person is having cancer. So, it is important to consider model performance using all the described parameters instead of using only accuracy as the evaluation criteria. Future work involves using ensemble deep feature approach to extract a new set of features to effectively distinguish the minority class from the majority class using pre-trained models.