1 Introduction

This microscopic image analysis in the medical field is a very helpful and emerging field. The hematoxylin and eosin (H&E)-stained histopathology images (HIs) are segmented to get the specific region characteristics for the detailed analysis. The statistical features are utilized in image segmentation to mark the object of interest [1,2,3]. Wang et al. presented a multi-scale region growing and curvature scaling for automatic breast cell nuclei segmentation and classification (ANSC) [2]. Wang in [4], has proposed a semi-automatic method (SAM) for cell segmentation. Various statistical features have been studied to enhance different regions of interest [4,5,6]. The HI segmentation finds applications in identification of diverse objects like tissue, gland, etc. [7,8,9]. Vu et al. [8] proposed class specific features learning based technique to separate the interclass difference named as discriminative feature-oriented dictionary learning method (DFDLM). Naylor et al. [9] presented a nuclei segmentation using deep regression (NSDR) approach in order to target the touching nuclei regions. In most of the HI segmentation methods, firstly the basic marking is performed, followed by stain decomposition [10] as per the dataset requisites. The Laplacian-of-Gaussian filtering and the Gaussian mixture model (GMM)-based pixel clustering have been investigated for seed point extraction for nuclei segmentation in [11] and [12], respectively.

Many researchers have been investigating the use of localization using segmentation as preprocessing of feature extraction. Support vector machine (SVM) and convolutional neural network based classifiers are the classifiers based on hyperplane selection [13,14,15,16,17]. Yan et al. [18] presented a hybrid technique for breast cancer HI classification using convolutional and recurrent deep neural network (CRDNN). Yang et al. [19] worked on the feature selection for high-dimensional data mining using the nearest neighbor-based feature weighting. Klein et al. [20] developed a fast Bayesian optimization technique of machine learning hyperparameters on large datasets. Most of the above discussed methods faces problem in extraction of features due to the overlapped nuclei which in turn leads to the reduction in the dependability of classification.

This paper suggests a classification of HIs for cancer detection using nuclei localization as a preprocessing step. A significant number of relevant features have been extracted using a combination of bag of visual words and the handcrafted features form segmented HI. The significance of handcrafted features has been tested using neighborhood component analysis (NCA) [19]. Further, the SVM [15] and multilayer perceptron (MLP) [16] classifiers have been applied along with optimized hyperparameters using Bayesian optimization [20] for the benign and malignant HI classification.

In remaining manuscript, the second section presents nuclei localization method, Sect. 3 explains the selection of appropriate features, and details of classification models are given in Sect. 4. Experimental setup is presented in Sect. 5, followed by result analysis in Sect. 6. Finally, Sect. 7 concludes the research findings.

2 Localization method

The proposed method is presented through the block diagram in Fig. 1a shows the localization part and Fig. 1b shows the classification part. In localization, input HI (\( f \)) is preprocessed followed by the identification of nuclei region and nuclei boundaries. The final outcome of the described HI processing provides complete nuclei segmented (localized) image (\( f_{L} \)).

Fig. 1
figure 1

Block diagram of the proposed algorithm: a nuclei localization in HI and b HI classification in benign and malignant

Fig. 1b is presenting the combination of proposed feature extraction and classification. Details of proposed feature extraction and classification are explained in Sect. 3 and 4, respectively.

For preprocessing, firstly the stain decomposition is applied on \( f \). It focuses on stains co-occurrence in association with the circular mixture model and soft-clustering of pixels [10]. The pixel level clustering is done through periodicity of hue signals on the unit for decomposition. It results a preprocessed image \( f_{\text{p}} \). The nucleus contains euchromatic (active region comparatively brighter than other region) and heterochromatic (inactive region comparatively darker than other region). The nucleus is always a darker region and called as key points. Figure 2 illustrates the benign and malignant images with their ground truth and stain decomposed counterparts.

Fig. 2
figure 2

Hematoxylin & Eosin stained images illustration. a original image, b preprocessed image, c ground truth and d the decomposed hematoxylin stain component for benign (upper) and malignant (lower) images

2.1 Nuclei initialization

The nuclei initialization is performed on preprocessed HI to initiate the segmentation. Firstly, \( f_{\text{p}} \) is converted to respective grayscale image \( f_{\text{g}} \) in the range {0, 1} and further enhanced using normalization factor \( \alpha \) in order to transform the range from [0.15–0.4] to [0–0.4]. The pixel values of enhanced grayscale HI (\( f_{\text{r}} \)) are shown in Eq. 1

$$ f_{\text{r}} (x,y) = \left\{ {\begin{array}{*{20}l} {0;} \hfill & {{\text{if}}\;f_{\text{g}} (x,y) \le 0.15} \hfill \\ {\alpha \times f_{\text{g}} (x,y);} \hfill & {{\text{if}}\quad 0.15 < f_{\text{g}} (x,y) \le 0.4} \hfill \\ {f_{\text{g}} (x,y);} \hfill & {\text{else}} \hfill \\ \end{array} } \right.. $$
(1)

The difference of Gaussian (DoG) is applied on \( f_{\text{r}} \) and returns \( f_{\text{DoG}} \). Similarly, Hessian of Laplacian of Gaussian (HLoG) operators is applied on \( f_{r} \) and produces \( f_{HLoG} \). All three images (\( f_{\text{DoG}} , f_{\text{HLoG}} \) and \( f_{\text{r}} \)) are segmented through Otsu thresholding [21], and three segmented images are combined to identify the nuclei key points. The nuclei region is processed with morphological erosion and overlapped nuclei are separated by considering nuclei radius \( r \le R \) as constraints to get the nuclei seeds. The large regions are taken as multiple nuclei, while considering the nuclei shape and size. The ultimate result of this step is nuclei center marked image (\( f_{Mark} \)).

2.2 The nuclei region estimation

The nuclei regions are identified through application of the normalized graph cut method [22] on \( f_{\text{p}} \) followed by the application of key point-contour link creation algorithm [5] by considering marked key points of \( f_{\text{Mark}} \). It connects the key points with the outcome of the normalized graph cut method. The link length is chosen as 3–7 pixels based on the size of nuclei. Image \( f_{\text{p}} \) is represented as a graph \( G = (V, E) \), where \( V \) defines set of vertices \( \{ {\text{v}}_{1} , {\text{v}}_{2} , \ldots \} \) and \( E \) defines set of edges \( \{\upvarepsilon_{1} ,\upvarepsilon_{2} , \ldots \} \). The links are categorized in two sections: object \( O \) and background \( B \).

$$ N_{\text{cut}} (O,B) = \frac{{{\text{cut}}(O,B)}}{{{\text{assoc}}(O,V)}} + \frac{{{\text{cut}}(B,O)}}{{{\text{assoc}}(B,V)}} $$
(2)

where

$$ \begin{aligned} {\text{cut}}(O,B) = \sum\limits_{u \in O,v \in B} {w(u,v)} \quad \hfill \\ {\text{assoc}}(O,V) = \sum\limits_{u \in O,v \in V} {w(u,v)} \hfill \\ \end{aligned} $$
(3)
$$ {\text{Here}}\;{\text{w}}_{\text{ij}} = \left\{ {\begin{array}{*{20}c} 1 & { {\text{if}}\,{\text{v}}_{i} {\text{v}}_{\text{j}} \in \varepsilon , \forall ({\text{v}}_{i} ,{\text{v}}_{\text{j}} ) \in V } \\ 0 & { {\text{otherwise}} } \\ \end{array} } \right. $$
(4)

The normalized graph cut method is applied in a recursive manner and separates strong and weak links. Strong links signify the nuclei connects and other objects are signified by weak links. The overall shape of the nuclei in HI is defined by strong links and images containing the extracted nuclei region is defined as \( f_{\text{RE}} \). The image \( f_{\text{RE}} \) may suffer from the boundary region problem. In the proposed method, the solution of this problem is addressed by nuclei boundary estimation.

2.3 Nuclei boundary estimation

The nuclei boundary estimation corresponding to estimated nuclei region points is graphically illustrated in Fig. 3a, b. The contours are extracted by nuclei edges with an optimum boundary estimation as displayed at bottom in Fig. 3b. Nuclei boundary extraction by the combination of receptive field (CORF) model is based on the edge detection which are extracted unit wise. The combination of small edge sections is taken as the receptive field (RF) unit. The response \( R_{\text{S}} \) of a CORF operator is defined as the weighted geometric mean of the responses of all edges sections, for more detail please refer [11].

Fig. 3
figure 3

Nuclei boundary estimation by combination of receptive field (CORF) and improved by modified gradient at discontinuity (MGD), a working principle of the CORF model and MGD model, b the visual illustration of nuclei boundary estimation by CORF model and boundary refinement by MGD model

The segmentation outcome has boundaries that are not aligned to each other are removed using the nucleus center to boundary association. CORF followed by modified gredient at discontinuity [23] results a clear demarcation of nuclei and indicates the nuclei boundary in HI and returns \( f_{\text{BE}} \). Further, the complete nuclei localization is furnished by transforming the inside region of estimated boundary by 1 s and the rest of the area of image \( f_{\text{BE}} \) as 0 s which results as compete nuclei mask \( f_{\text{Mask}} \). The application of \( f_{\text{Mask}} \) on \( f_{\text{p}} \) produces final localized image \( f_{\text{L}} \).

3 Feature extraction and selection

Appropriate class prediction is the prime focus of the proposed HI analysis. Basically, a classifier needs a set of features to classify the data into their suitable classes.

3.1 Bag of visual words

A set of hundred and fifty shape features is extracted using BoW model from \( f_{\text{p}} \) [24]. The codebook containing a certain number of code words (or visual words) is constructed with their local descriptors or features.

3.2 Handcrafted features

Here, the handcrafted (HC) features based on the internal structure of the nuclei are proposed. The \( f_{\text{L}} \) is utilized for the extraction of HC features. The \( f_{\text{L}} \) is separated into two regions: heterochromatic region (HCR) [25] and euchromatic region (ECR) [25], through the application of stain color differentiation. A modified threshold is utilized to separate the HCR and ECR of the nuclei which is twice of the Otsu threshold. The nuclei component above the threshold value is the ECR and the rest is HCR. As per the histopathology analysis, HIs have the ECR and HCR constituents about to be equivocal for benign tumor and for malignant tumors the HCR is dominating as that of ECR in all the tissue structures. The size of the nuclei increases in the presence of the tumor along with the increases in shape irregularity and heterochromaticity. The set of 31 HC features (F1–F31) is defined in Table 1.

Table 1 Features description of nuclei and image level

3.3 Neighborhood components analysis

Neighborhood components analysis (NCA) is a supervised learning method for classifying multivariate data into distinct classes according to the significance parameter in data [19].

To understand NCA, let us consider a set of \( N \) number of training samples \( T = \{ (x_1, y_1), \ldots , (x_i, y_i), \ldots (x_N, y_N)\} . \) where vector \( x \) is d-dimensional feature space with class label \( y \). The weighting vector \( w \) is determined in such a way to select a feature subset by optimizing the nearest neighbor classification. The weighted distance \( D_{\text{W}} \) between two samples \( x_{i} \) and \( x_{j} \) in terms of the weighting vector \( w \) is given as:

$$ D_{\text{w}} (x_{i} ,x_{j} ) = \sum\limits_{l = 1}^{d} {w_{l}^{2} \left| {x_{il} - x_{jl} } \right|} $$
(5)

where \( w_{l} \) represents the associated weight of lth feature. The probability distribution based effective approximation of reference point is determined first using 1-nearest neighbor classification by maximizing its leave-one-out classification accuracy on the training set \( T \). The related probability of \( x_i \) to pick \( x_j \) as its reference point is given as:

$$ p_{ij} = \left\{ {\begin{array}{*{20}c} {\frac{{\kappa \left( {D_{w} \left( {x_{i} ,x_{j} } \right)} \right)}}{{\sum\nolimits_{k \ne i} {\kappa \left( {D_{w} \left( {x_{i} ,x_{k} } \right)} \right)} }}} & {{\text{if}}\;i \ne j} \\ {0,} & {{\text{if}}\;i = j} \\ \end{array} } \right. $$
(6)

where the kernel \( \kappa (z) = \exp ( - z/\sigma ) \) is used with a kernel width \( \sigma \). The kernel width is taken as input variable that plays a key role in deciding the reference point. Thus, the probability of correct classification of the query point \( x_i \) is given by:

$$ p_{i} = \sum\nolimits_{j} {y_{{i{\kern 1pt} j}} p_{{i{\kern 1pt} j}} } $$
(7)

where \( y_{ij} = 1 \) only for \( y_{i } = y_{j} \) and \( y_{ij} = 0 \) else. As a result, the leave-one-out classification accuracy can be approximated as:

$$ \rho (w) = \frac{1}{N}\sum\limits_{i} {p_{i} } = \frac{1}{N}\sum\limits_{i} {\sum\limits_{j} {y_{{i{\kern 1pt} j}} p_{{i{\kern 1pt} j}} } } . $$
(8)

As σ tends to zero, \( \rho (w) \) becomes the true leave-one-out classification accuracy. A regularization term (\( \lambda > 0 \)) is further introduced to perform feature selection and alleviate overfitting, hence the object function modified as:

$$ \rho (w) = \frac{1}{N}\sum\limits_{i} {\sum\limits_{j} {y_{{i{\kern 1pt} j}} p_{{i{\kern 1pt} j}} } } - \lambda \sum\nolimits_{l = 1}^{d} {w_{l}^{2} } $$
(9)

The regularization parameter is tuned using cross validation. To update weights, the object function with regularization \( \rho (w) \) is differentiated with respect to \( w_l \) as follows:

$$ \frac{\partial \rho (w)}{{\partial w_{l} }} = \sum\limits_{i} {\sum\limits_{j} {y_{{i{\kern 1pt} j}} \left[ {\frac{2}{\sigma }p_{{i{\kern 1pt} j}} \left( {\sum\limits_{k \ne i} {p_{{i{\kern 1pt} k}} \left| {x_{il} - x_{kl} } \right| - \left| {x_{il} - x_{kl} } \right|} } \right)w_{l} } \right]} } - 2\lambda w_{l} $$
(10)

Using the above derivative that leads to the corresponding gradient ascent update equation, features optimization is performed on 31 extracted HC features. The NCA is applied here to reduce redundant features. Which results that one-third of the features are not carrying useful information and removed. This optimization provides 20 significant HC features; hence, further processing is performed using significant HC features.

4 Classification models

The block diagram of the general classification model with features extraction and selection, along with model hyperparameter optimization is shown in Fig. 1b. In the proposed computer aided cancer diagnosis method, two classifiers SVM and MLP model are employed for classification.

4.1 Support vector machine

The SVM [15] classification model provides high flexibility to classify distinct classes. The nonlinearity can be introduced in SVM using a soft margin parameter \( C \). The formulations of soft margin linear SVM are given as:

$$ {\text{Minimize}}\left[ {\frac{1}{2}\sum\limits_{i = 1}^{n} {w_{i}^{2} } + C\sum\limits_{i = 1}^{N} {\xi_{i} } } \right] $$
(11)

subjected to \( y_{i} \left( {\vec{w}.\vec{x} + b} \right) \ge 1 - \xi_{i} \) for \( i = 1, \ldots , N \).

The additional separation distance can be introduced by nonlinear projection in the high-dimensions using Gaussian kernel, defined as:

$$ K\left( {\vec{x}_{i} ,\vec{x}_{j} } \right) = \exp \left( { - \gamma \left\| {\vec{x}_{i} - \vec{x}_{j} } \right\|^{2} } \right) $$
(12)

The SVM model’s training is performed using \( k \)-fold cross validation. The training and testing process are repeated \( k \) times, by tracking the performance of the model in predicting the holdout set using a performance metric such as accuracy. The tenfold cross validation with \( C = 1 \), and \( \gamma = 1 \) is used for SVM model training and testing. A set of 40 objective evolutions provide best feasible point box constraint 0.0012 and kernel scale 0.0192.

4.2 Multilayer perceptron model

The neural network based classifier performance depends on the model selected and appropriate training of the model. We have trained a MLP [16] for the binary classification with some nonlinearity, described for input feature vector \( x \) as:

$$ O^{0} = x,\quad O^{l} = F^{l} \left( {W^{l} \hat{o}^{{\left( {l - 1} \right)}} } \right)\quad {\text{for}}\quad l = 1, \ldots ,L. $$
(13)

The input vector \( x \) is taken as “output of the zeroth layer”. A hat notation \( \hat{o}^{{\left( {l - 1} \right)}} \) represents an operation where a number \( 1 \). prepended to a vector to increase its dimension. Hence, the bias terms of the layer \( l \) can be written as the first column of matrix \( W^{l} \). The notation \( F^{l} \) represents the application of an activation function (sigmoid) on all components of a vector. The softmax function is also used as the activation function in the output layer of the five layered (\( {\text{Input}} + 3 {\text{hidden}} + {\text{output}} \)) MLP model. Number of neurons in 3 hidden layers are 6, 10, and 8, respectively. The MLP feed forward fully connected model is implemented with a sigmoid activation. The model is trained using backpropogation with learning rate \( \eta \) = 0.12 by considering input features as training parameters and image labels as target variable. For the parameters update, the gradient is computed using the stochastic gradient descent algorithm, termed as weight error. The parameters are updated in such a way as they move the MLP, one step closer to the error minimum. We have taken the batch size of 5 samples and 1000 epochs for optimum result.

4.3 Hyperparameter tuning framework

Further, hyperparameter tuning is done with the objective to maximize the validation accuracy as:

$$ {\rm X}* = \arg \mathop {\hbox{max} }\limits_{{{\rm X} \in \chi }} f({\rm X}) $$
(14)

where \( {\text{X}} \subseteq R^{D} \) and f(x) represent the model performance on validation data for a set of hyperparameters X. Let the hyperparameters search space is bounded between l and u are the D-dimensional vectors denoting the lower and upper ranges, respectively. The ultimate goal is to optimize the hyperparameters on whole training data. We start by taking a small subset of the training data to identify the optimal hyperparameters using Bayesian optimization [20], which is repeatedly applied to a number of different smaller subsets. The mean of the optimal hyperparameters is determined to find the robust estimate. The parameters C, γ and kernel are optimized for SVM and the hidden layers, activation, solver and learning rate are optimized for MLP classifier.

5 Experimental setup

The experimental setup covers the datasets used in the localization and classification and the evaluation parameters calculated for performance analysis. The dataset name Bisque is an acronym for Bio-Image Semantic Query User Environment. Which provides a cloud based system to store, organize, visualize and analyze the various dataset images, and breast cancer (BC) dataset is one of the collections of Bisque. The Bisque dataset contains 32 benign and 26 malignant cases [26]. The dataset Breast Cancer Histopathological Image Classification (BreakHis) contains 9109 microscopic images (from 82 patients) of breast tumor tissue using 40×, 100×, 200× and 400× magnifying factors [7].

5.1 Proposed multi-organ dataset

A set of microscopic images from multiple organs is prepared and analyzed with 10 cases for each. The images are taken at different magnification as 40×, 100×, 400× and 1000×. The images at 100× are utilized for the gross analysis of cancer detection based on localized nuclei characteristics and 1000× are utilized for analysis of nuclei localization along with ECR and HCR segmentation.

5.2 Evaluation parameters

To measure the performance of the proposed localization method, the parameters used are: F1-score (FS) [27, 28], Jaccard index (JI), [29] and Hausdorff distance (HD) [30] for segmentation. The accuracy and area under curve (AUC) [31] are used for the classification performance measurement along with the receiver operating characteristic (ROC) curve [31].

6 Results and discussion

The evaluation of localization and classification is analyzed qualitative and quantitative.

6.1 Localization work evaluation

The image localization is visually illustrated in Fig. 4. The images in the upper three rows are of Bisque dataset, fourth row image is from proposed dataset, and lower two rows are from the BreakHis dataset. Figure 4a shows the H&E stained original image, and Fig. 4b depicts the preprocessed image followed by their corresponding ground truth in Fig. 4c. The hematoxylin component of the stain decomposed image is shown in Fig. 4d. Figure 4e–g visualizes the nuclei segmented image, HCR and ECR components, respectively. First and fifth rows are the benign cases, and rests are the malignant cases. For the malignancy, the HCR increases inside the nuclei.

Fig. 4
figure 4

a Original image, b preprocessed image, c ground truth, d the hematoxylin component of stain decomposed image, e nuclei segmented image, f heterochromatic component, g euchromatic component of the nuclei

The quantitative performance of the proposed localization method is presented in terms of average FS, JI and HD is shown in Table 2, for Bisque and BreakHis (400× magnification images) dataset. The proposed method provides an average FS of 0.861, which is 23% and 21% better than the best performing baseline methods NSDR and DFDLM, respectively. The proposed method provides a JI value for BreakHis dataset images about 0.721 with overall average is 0.713. The average accuracy is 0.919 which is 10% greater than the NSDR method.

Table 2 Average FS, JI and HD for Bisque and BreakHis

The performance of the proposed localization method is also validated by the proposed multi-organ dataset, which comprises 10 images from each organ, including breast, cervix and tongue with equal counts of benign and malignant cases, shown in Table 3. The accuracy of the proposed method is 0.918 which is at par to the standard datasets result.

Table 3 The FS, JI and HD of the proposed multi-organ (Breast = Bst, Cervix = Cvx, Tongue = Ton) image dataset at 1000X magnification

6.2 Experimental setup for classification

The experimental setup is divided into three categories of different dataset and image augmented combinations. The small dataset size is increased by flipping and shearing operation at 10 degrees along horizontal (or vertical) axis gave two set of images. The random cropping and 10% random noise addition provided other two sets of images. The image dataset size becomes (four regenerated sets + original) five times as shown in Table 4. Exp. 1 is designed using Bisque dataset with 160 benign and 130 malignant HIs. Exp. 2 is designed with BreakHis data that has total 9050 images with 2890 benign and 6160 malignant cases as an imbalance dataset. In Exp. 3, the dataset imbalance problem is taken care. For balancing purpose, five–five selected images are taken from the ductal carcinoma in situ (DCiS) [7]. It provides a set of 5850 images at 400× from BreakHis dataset. The Exp.3 comprises of 2890 benign image and 2960 malignant HIs.

Table 4 Dataset description for classification experiments five times the original samples images

6.3 Evaluation of the classification work

The MLP classifier performed best using the combination of BoW and HC features with an average accuracy of 96.75% for the balanced dataset (Exp.3), while for imbalanced dataset (Exp.2) 93.86% as shown in Table 5.

Table 5 Average accuracy (Ac%) and area under the curve (AUC) for Bisque and BreakHis datasets classification

The MLP provides the highest AUC of 0.94 for balance dataset. The confusion matrix for Exp. 1–Exp.3 using BoW with HC features is shown in Table 6. The average accuracy of the proposed method is 95.03%, which is 10% and 7% higher than the CRDNN and MDC methods, respectively. The average AUC has reported 0.92 in comparison to 0.84 and 0.85 provided by MDC and CRDNN, respectively as shown in Table 7.

Table 6 Confusion matrix Exp. 1–Exp. 3(in  %) data into benign (B) and malignant (M) classes
Table 7 Comparison of classification performance with previous works on BreakHis dataset

7 Conclusion

This paper presents an innovative HI localization based classification method using a combination of BoW features and HC features. The proposed method categorizes the HIs in benign and malignant classes. The extraction of HC features is performed on the basis of the intra-nuclei region separation of localized image in two components: HCRs and ECRs. Total 31 HC features are extracted out of which 20 significant features are selected using neighborhood components analysis. The BoW in association with HC features, is used for classification using MLP and SVM models. The simulation results are obtained using Bisque, BreakHis and proposed datasets. The proposed localization method has attained an average accuracy of 91.85%. The performance of the MLP model using balanced dataset has reported as best with an average accuracy of 95.03%, which is 10% and 7% higher as that of CRDNN and MDC methods.