Keywords

1 Introduction

Analysis of binary images still belongs to the most popular applications of machine vision both in industry and some other computer vision tasks, where the shape of objects plays the dominant role. Although in some industrial applications with controlled lighting conditions, as well as in the analysis of high quality scanned documents, some classical global thresholding algorithms, such as e.g. well-known Otsu [24] method, may be sufficient, for unevenly illuminated objects or degraded document images, even the use of more advanced adaptive methods might be challenging in some cases. For some of the popular adaptive methods, e.g. proposed by Niblack [18] or Sauvola [29], the obtained results may be far from expectations, especially in outdoor scenarios. On the other hand, some more sophisticated methods may be troublesome to implement in some embedded systems and devices with low computing performance.

Some typical areas of applications, where the quality of binary images obtained from natural images is important, are Optical Text Recognition (OCR), Optical Mark Recognition (OMR), recognition of QR codes, self-localization, terrain exploration and path following in autonomous navigation of vehicles and mobile robots, video monitoring and inspection, etc. Nevertheless, due to the lack of image and video datasets, containing both natural and ground truth images other than document images, a widely accepted approach to performance evaluation of image binarization methods is the use of the datasets provided yearly by the organizers of Document Image Binarization COmpetitions (DIBCO), taking place during two major conferences, namely International Conference on Document Analysis and Recognition (ICDAR) and International Conference on Frontiers in Handwriting Recognition (ICFHR).

Although these datasets contain images with more and more challenging image distortions each year, another interesting possibility is the additional verification of the proposed methods for the images included in Bickley Diary dataset [5], containing 92 photocopies of individual pages from a diary written ca. 100 years ago by the wife of one of the first missionaries in Malaysia – Bishop George H. Bickley. Since the distortions in this dataset are related not only to overall noise caused by photocopying, but also discolorization and water stains, as well as differences in ink contrast for different years, it may be considered as even more challenging in comparison to DIBCO datasets [26]. To ensure a reliable verification of the advantages of the method proposed in this paper, all currently available DIBCO datasets together with Bickley Diary database have been used.

Although many various approaches to image binarization have been presented over the years, including adaptive methods e.g. proposed by Bradley [2], Feng [6], Niblack [18], Sauvola [29] or Wolf [35], and their modifications [28, 30], for each newly developed algorithm its required computational effort usually increases. Good examples may be the applications of local features with the use of Gaussian Mixture Models [17] or the use of deep neural networks [32], where multiple processing stages are necessary. Some comparisons of popular methods and their overviews can be found in recent survey papers or books [3, 31]. In many methods the additional background removal, median filtering or morphological processing are required, as well as time-consuming training process for recently popular deep convolutional neural networks. Therefore, our motivation is the increase of performance of some classical methods due to efficient image preprocessing rather than comparison with sophisticated state-of-the-art methods and solutions based on deep learning, considering also the time-quality efficiency challenges [14].

2 Modelling the Histograms of Distorted Images

2.1 Generalized Gaussian Distribution and Gaussian Mixture Model

The application areas of the Generalized Gaussian Distribution (GGD) cover a wide range of signal and image processing methods utilizing the designation of various models, including e.g. tangential wavelet coefficients used to compress three-dimensional triangular mesh data [12] or generation of augmented quaternion random variables with the GGD [7]. Some other popular applications are related to no-reference image quality assessment (IQA) based on natural scene statistics (NSS) model, used to describe certain regular statistical properties of natural images [38], as well as image segmentation [33] and approximation of an atmosphere point spread function (APSF) kernel [34].

One of the main advantages of the GGD is the coverage of the other popular distributions, namely Gaussian distribution, Laplacian distribution, a uniform one, an impulse function, as well as some other special cases [9, 10]. Estimation of its parameters is possible using various methods [37]. Its extension into multidimensional case [25] and covering the complex variables [19] is also possible.

The probability density function of the GGD can be expressed as [4]:

$$\begin{aligned} f(x)=\frac{\lambda \cdot p}{2 \cdot \varGamma \left( \frac{1}{p}\right) }e^{-[\lambda \cdot |x|]^{p}}, \end{aligned}$$
(1)

where p denotes the shape parameter, \(\varGamma (z)=\int _{0}^{\infty }t^{z-1}e^{-t}dt, z>0\) [23] and \(\lambda \) is the parameter based on the standard deviation \(\sigma \) of the distribution. Their relation is given by the equation \(\lambda (p,\sigma )=\frac{1}{\sigma }\left[ \frac{\varGamma (\frac{3}{p})}{\varGamma (\frac{1}{p})}\right] ^{\frac{1}{2}}\). Choice of the parameter \(p=1\) corresponds to Laplacian distribution, whereas \(p=2\) is typical for Gaussian distribution. When \(p \rightarrow \infty \), the GGD density function goes to a uniform distribution and for \(p \rightarrow 0\), f(x) becomes an impulse function.

A Gaussian Mixture Model (GMM) consists of Gaussian distribution components defined by their locations \(\mu \) and standard deviations \(\sigma \) and additionally, a vector of mixing proportions. Due to the main purpose of investigation, related to image binarization, the application of two Gaussian distribution components is considered, since only two classes of pixels are assumed. In the other words, it is assumed that only two clusters are present in the image, consisting of pixels representing text and background respectively, and – in the ideal case – each cluster is represented by a single Gaussian distribution component. Having computed the parameters of the GMM with two Gaussian distribution components, the initial threshold should be located between the locations \(\mu \) of both distributions and may be calculated in several ways e.g. as the intersection point of two determined curves.

The GMM parameters can be determined using the iterative Expectation-Maximization (EM) algorithm. The algorithm iterates over two steps until the convergence is achieved. The first step would estimate the expected value for each observation and the maximization step would optimize the parameters of the probability distributions using the maximum likelihood.

2.2 General Assumptions for Natural Images

Natural images, representing old handwritten or machine-printed documents, contain some specific distortions, being the result of gradual degradation of original manuscripts or printings during years. Some visible imperfections, such as faded and low contrast ink, as well as the presence of noisy distortions and some stains, influence the histogram of the image. Hence, assuming the analysis of greyscale images, more intermediate grey levels may be observed, similarly as for binary images corrupted by Gaussian noise, as illustrated in Fig. 1 for the sample image no. 7 from DIBCO2017 dataset. This similarity is especially well visible assuming the use of two Gaussian distributions modelling the histogram of the ground truth (GT) binary image corrupted by Gaussian noise.

Fig. 1.
figure 1

Illustration of the general assumption of the proposed approach based on the similarity of histograms: (a) - sample greyscale document image, (b) - ground truth binary image, (c) - binary image corrupted by Gaussian noise, (d)–(f) - histograms of respective images.

Fig. 2.
figure 2

Exemplary results of the approximation of the histogram of the sample image from DIBCO2017 dataset with obtained thresholds: (a) – using the GGD, (b) – for two Gaussian distribution components of the GMM.

Therefore, the approximation of histograms by the GMM with two Gaussian distribution components should be useful for the initial thresholding step, eliminating the most of the background information. It can be conducted by choosing a threshold between two peaks of the approximated histogram defined by their location parameters \(\mu \). Another possible approach, investigated in one of the previous papers [11], is the use of a single Gaussian distribution or the GGD, being its extended version. Nevertheless, the choice of its location parameter \(\mu \) as the initial threshold leads to elimination of less background information. The comparison of parameters of the GGD and GMM with two Gaussian distribution components (further referred as GMM2), obtained for the sample image no. 7 from DIBCO2017 with a typical bimodal histogram, is shown in Fig. 2, where the threshold selected for the GMM is marked as the intersection point of both Gaussian curves.

Fig. 3.
figure 3

Exemplary results obtained for image no. 8 from DIBCO2018 dataset: (a) – original image, (b) – ground truth image, (c) – histogram and obtained using the GGD, (d) – histogram and threshold for the GMM2, (e) and (f) – respective thresholding results obtained for two methods.

For some images the GMM2 components may be located closer to each other and therefore the choice of an appropriate threshold may be more troublesome. In such situations the solution proposed in the paper [11] may be insufficient and the application of the GMM2 makes it possible to remove the background information better. Some exemplary results, obtained for sample image no. 8 from more challenging DIBCO2018 dataset, are presented in Fig. 3, where the greater ability to remove unnecessary background can be clearly observed for the GMM2. Such obtained images may be subjected to further binarization steps.

3 Proposed Method

3.1 Improved Two-Step Binarization Algorithm

Taking the advantage of similarity of histograms of degraded document images converted to greyscale and binary GT images corrupted by Gaussian noise, chosen as the most widespread type of noise in practical applications, the first step of the proposed algorithm is the calculation of parameters of the Gaussian distributions. Due to a great importance of the universality of the proposed approach, three possible models are used: a single Gaussian distribution, GGD and GMM2. Using the most relevant parameters: \(\mu \) and \(\sigma \), several variants of possible thresholds \(X_{thr}\) have been tested, including:

  • location parameter \(\mu _{GGD}\) of the GGD (originally proposed in [11]),

  • location parameter \(\mu _G\) of the single Gaussian distribution,

  • location parameter \(\mu _G\) lowered by Gaussian standard deviation \(\sigma _G\),

  • intersection of two GMM2 curves (thr), as shown in Fig. 2b,

  • upper location parameter of two GMM2 curves \(\mu _{GMMmax}\),

  • weighted average of two GMM2 locations: \(\mu _{GMM01} \cdot w_{01}\) + \(\mu _{GMM02} \cdot w_{02}\),

  • weighted average of two GMM2 locations lowered by the respective standard deviations: (\(\mu _{GMM01} - \sigma _{01}) \cdot w_{01}\) + (\(\mu _{GMM02} - \sigma _{02}) \cdot w_{02}\),

  • minimum values of the above thresholds.

The weighting coefficients \(w_{01}\) and \(w_{02}\) have been determined during the calculation of the GMM and normalized so that \(w_{01} + w_{02} = 1\). To avoid the necessity of using sophisticated estimators based on maximum likelihood, moments, entropy matching or global convergence [27], the values of the four GGD parameters: shape parameter p, location parameter \(\mu \), variance of the distribution \(\lambda \), and standard deviation \(\sigma \), as well as parameters of the GMM2, have been determined using the fast approximated method based on the standardized moment, described in the paper [8].

Additionally, all the above parameters have also been calculated after filtration of the 256-bin image histograms using 5-element median filter used to remove peaks. However, the results obtained using this approach have been worse for all databases and slightly higher binarization accuracy has been observed only for a few images. Therefore, all further experiments have been conducted using the original histograms without the additional time-consuming filtering.

To improve text readability, the determined thresholds (\(X_{thr}\)) are used instead of the maximum intensity values in the classical normalization of pixel intensity levels, applied only for the intensity levels not exceeding \(X_{thr}\), as

$$\begin{aligned} Y(i,j) = \left| \frac{(X(i,j) - X_{min})\cdot 255}{X_{thr} - X_{min}}\right| , \end{aligned}$$
(2)

where \( 0 \le X_{min}< X_{thr} < X_{max} \le 255 \) and

  • \( X_{thr} \) is the upper threshold determined during the proposed preprocessing,

  • \( X_{min} \) is the minimum intensity of all image pixels,

  • \( X_{max} \) is the maximum intensity of all image pixels,

  • X(ij) is the intensity level of the input pixel at (ij) coordinates,

  • Y(ij) is the intensity level of the output pixel at (ij) coordinates.

Assuming the presence of a dark text on a brighter background, to remove partially the bright background data, usually containing some distortions not influencing the text information, intensity values for all pixels with brightness higher than \(X_{thr}\) are set to 255 independently on the formula (2).

As the result, the limitation of the brightness range from \( \langle X_{min} \; ; \; X_{max} \rangle \) to \( \langle X_{min} \; ; \; X_{thr} \rangle \) with additional normalization to the range \( \langle 0 \; ; \; 255 \rangle \) is obtained, where the increase of dynamic range for images with overexposure or visible low ink contrast is achieved regardless of the selected upper threshold. Finally, the intermediate image with partially eliminated background is obtained, which is the input for some other classical global or adaptive binarization methods. Since such obtained images are better balanced in terms of text and background information, is it assumed that the finally obtained thresholds should be closer to expectations in comparison with those achieved by the same methods without the proposed preprocessing.

3.2 Acceleration of Calculations Using the Monte Carlo Method

The idea of the Monte Carlo method is based on the significant decrease of the number of analysed pixels, preserving the statistical properties of the image histogram. According to the law of large numbers and the central limit theorem, for a statistical experiment the sequence of successive approximations of the estimated value is convergent to the sought solution. Therefore, using the pseudo-random number generator with a uniform distribution, a limitation of the number of analysed pixels, decreasing the computational burden, is possible [21].

To prevent the necessity of using two independent generators to draw the coordinates of pixels, the image is initially reshaped into one-dimensional vector V containing the intensities of all \(M \times N\) pixels. Applying the pseudo-random number generator with possibly good statistical properties and a uniform distribution, n independent draws of the positions in the vector V are conducted. To build an estimate of the simplified histogram, the total number of the randomly drawn pixels (k) for each intensity level is calculated according to

$$\begin{aligned} \hat{L}_{MC} = \frac{k}{n} \cdot M \cdot N, \end{aligned}$$
(3)

where k denotes the number of randomly chosen pixels of the given intensity, n is the total number of draws and \(M \times N\) determines the image size. In some applications a random choice of pixels can also be made in parallel to increase the computational speed.

Analysing the convergence of the method [11], the estimation error can be determined as

$$\begin{aligned} \varepsilon _\alpha = \frac{u_\alpha }{\sqrt{n}} \cdot \sqrt{\frac{K}{M \cdot N} \cdot \left( 1 - \frac{K}{M \cdot N}\right) }, \end{aligned}$$
(4)

where K is the total number of pixels for a given intensity and \(u_\alpha \) represents the two-sided critical range. Nevertheless, the influence of even relatively high values of the above estimation error (calculated for the histogram) on the determined binarization thresholds is marginal.

Such obtained estimated simplified histogram may be successfully used as the input data for histogram based global thresholding methods [13, 22], however its use for adaptive thresholding would be possible assuming the division of images into regions. Nevertheless, the direct application of this approach for typical adaptive methods based on the analysis of the local neighbourhood of each pixel, such as Bradley [2], Niblack [18] or Sauvola [29], would be troublesome.

In the proposed approach the Monte Carlo method is applied to reduce the computational effort of the first step of the algorithm. Due to the use of the simplified histogram it is possible to estimate the initial upper threshold \(X_{thr}\) using a significantly reduced number of samples. To reduce the possibility of the influence of potentially imbalanced intensities of the randomly chosen pixels for a small number of draws, the Monte Carlo experiment may be repeated and then the median from the determined thresholds would be selected as the result. To verify the stability of this approach, some experiments have been conducted with the use of 2.5%, 5%, 7.5%, 10%, 12.5% and 15% of the total numbers of pixels in consecutive images from all available DIBCO datasets, as well as for all 92 images from Bickley Diary database. For the lower percentages median values from 3, 5, 7 and 9 Monte Carlo experiments have been chosen, although it should be noted that from computational point of view e.g. the use of 9 draws for 5% of pixels can be treated as equivalent to a random choice of 45% of all pixels (not considering the time necessary for selection of the median value).

The second stage of the proposed approach, assuming the use of some previously proposed binarization methods, is not based on the use of the Monte Carlo method, although for some of the global methods, it might be possible as well and should be considered in further research. Nevertheless, in the first experiment it has been assumed that n is equal to the total number of pixels (\(M \times N\)), however – as described further – it may be significantly reduced applying the Monte Carlo method, without affecting the accuracy of binarization.

4 Experimental Results

The verification of the proposed approach has been made using 208 images: 116 images from 9 available DIBCO datasets (2009 to 2018), converted to greyscale according to popular ITU-R Recommendation BT.601, and 92 monochrome images from Bickley Diary dataset. All the calculations have been made for full images, as well as for the limited number of samples, applying the Monte Carlo method with repetitions and median choice, as stated above. During the first stage of the algorithm all threshold variants listed in Sect. 3.1 have been examined. Such obtained images with partially eliminated background information have been subjected to further binarization in the second stage, using popular thresholding methods, such as: fixed threshold (0.5 of the intensity range), Bernsen [1] (also with the local Gaussian window), Bradley [2], Otsu [24], Sauvola [29] and Wolf [35].

Finally, the obtained results have been compared with the direct use of the above mentioned methods without the proposed preprocessing. To make a reliable comparison of the final binarization results, according to widely accepted methodologies [20], some typical metrics based on the counting of true positive (TP) pixels, true negatives (TN), false positives (FP) and false negatives (FN), such as Precision, Recall, F-Measure, Specificity and Accuracy, have been calculated, assuming the pixels representing text as “ones” and background pixels as “zeros”. Additionally, some other metrics, such as PSNR, Distance Reciprocal Distortion (DRD) [15] and Misclassification Penalty Metric (MPM) [36], have also been computed.

Although the values of the estimated parameters may be slightly different for each independent execution of the Monte Carlo method, especially for a low number of drawn samples (n), the overall influence of the number of randomly drawn pixels on the final binarization accuracy is unnoticeable, even for the use of 2.5% of the pixels assuming the 3-fold drawing and the choice of the median threshold in the first stage. Hence, only the results obtained for full images, considered as easier for potential recalculation, are presented in this paper, although the same results have been achieved applying the Monte Carlo method almost for all images. To avoid the presentation of all metrics based on the number of TP, TN, FP and FN pixels, we have focused on accuracy and PSNR, as well as some alternative metrics, such as DPD and MPM.

Fig. 4.
figure 4

Comparison of the average accuracy, PSNR, DRD and MPM values for 208 images using popular binarization methods with and without proposed preprocessing.

The results of experiments conducted for all DIBCO datasets and Bickley Diary database are illustrated in Fig. 4. Better results are represented by higher accuracy and PSNR, but lower DRD and MPM values.

As it may be observed, the influence of preprocessing for Sauvola and Wolf methods is marginal, however its application for some other methods, including the simplest fixed thresholding and the classical global Otsu method, leads to a significant improvement of all average metrics presented in Fig. 4. Analysing the accuracy, PSNR and DRD values, the use of the proposed preprocessing for Otsu method, as well as for adaptive Bradley and Bernsen algorithms, leads to better results than achieved by Sauvola and Wolf algorithms, even though these methods applied directly have led to much worse binarization results. In almost all cases the application of the proposed preprocessing method, based on the weighted average of two GMM2 locations lowered by the respective standard deviations (\(\mu _{GMM01} - \sigma _{01}) \cdot w_{01}\) + (\(\mu _{GMM02} - \sigma _{02}) \cdot w_{02}\), leads to better results, also in comparison with the previously proposed method [11] using the location parameter \(\mu _{GGD}\) of the GGD.

Fig. 5.
figure 5

Comparison of exemplary final binarization results: (a) and (b) – without preprocessing, (c) and (d) – with the use of \(X_{thr} = \mu _{GGD}\) [11], (e) and (f) – with the use of the proposed preprocessing with \(X_{thr} = (\mu _{GMM01} - \sigma _{01}) \cdot w_{01}\) + (\(\mu _{GMM02} - \sigma _{02}) \cdot w_{02}\), for Bernsen (left images) and Bradley (right images) thresholding.

Nevertheless, considering the results indicated as “best”, some minor exceptions occur, especially for the MPM results, as the results obtained for some other variants of possible thresholds \(X_{thr}\) are slightly better. For the fixed threshold (0.5) the best accuracy, PSNR and DRD may be obtained applying the \(X_{thr} = max(\mu _{GMM01},\mu _{GMM02})\), whereas the use of \(X_{thr} = \mu _{GGD} - \sigma _{GGD}\) leads to the best DRD and MPM values for Otsu and Bradley methods, as well as the DRD for Wolf and all metrics for Bernsen thresholding. The use of weighted average of two GMM2 locations (without lowering) slightly improves the MPM results for Wolf method and DRD for Sauvola, as well as PSNR and accuracy for both of them. Nonetheless, as can be seen in Fig. 4, the differences between the results for “best” variants and the most universal method, utilizing the formula \(X_{thr} = (\mu _{GMM01} - \sigma _{01}) \cdot w_{01}\) + (\(\mu _{GMM02} - \sigma _{02}) \cdot w_{02}\), are relatively small.

The best overall accuracy equal to 0.9336 and PSNR = 13.1614 has been achieved applying the proposed method followed by Bradley thresholding. The same popular adaptive method, implemented e.g. as the adaptthresh function in MATLAB environment, with the GGD based preprocessing [11], leads to noticeably worse results (ACC = 0.9279 and PSNR = 12.5857), whereas its direct application gives the accuracy equal to 0.9187 and PSNR = 12.1072 (without preprocessing).

A visual comparison of the final binarization results for a sample image no. 8 from the challenging DIBCO2018 dataset is shown in Fig. 5, where the advantages of the proposed approach, also over the GGD based preprocessing [11], are clearly visible, especially for Bernsen method shown in the left part. Nevertheless, the improvements of results can also be noticed for Bradley thresholding.

5 Summary and Future Work

The experimental results presented in the paper confirm the usefulness of the preprocessing of degraded document images based on the histogram modelling using the GGD and GMM based methods. A combination of the proposed approach with some well-known thresholding algorithms makes it possible to enhance the binarization results significantly, making them comparable with the use of more sophisticated methods. Even though the final results may be outperformed by some other methods, e.g. utilizing deep learning [32], the presented approach is relatively fast, also due to the use of the Monte Carlo method, and does not require the long training process with many images. The proposed approach may be easily combined with some other methods proposed by some other researchers, although – considering the results achieved for Sauvola and Wolf methods – achieved improvements may be smaller.

Some of the directions of our future research will be an attempt to a further simplification of the histogram modelling step, as well as the combination of the proposed preprocessing with statistical methods [13] and some region based thresholding methods [16], being usually much faster in comparison with typical adaptive methods, which require the analysis of the local neighbourhood of each pixel. Considering potential applications in robotics, related to the real-time analysis of natural images, our efforts will be oriented towards a further acceleration of image processing operations preceding the final binarization step.