Keywords

1 Introduction

In recent years, with the development of Internet and the popularity of digital cameras and mobile phones, images and video resources are full of people’s lives. The study of understanding the content of videos or image has become hotspots in Multimedia Information technology. The text in the videos and images contains rich semantic information, which is an important clue for understanding the content of images and videos. Detecting and recognizing the text in such images or videos is of great significance in many fields such as image understanding, video content analyzing and content based image or video retrieval. For text recognition, it can meet a certain degree of application requirements, and many companies have also released some commercial software packages [1, 2]. However when it is used in a natural scene, the performance will degrade. One of the key reasons is that the accuracy of the text detection reduces, thus the overall performance of the character recognition system decreases as well. Usually, a rectangular box is used to label the text region. However, due to complex background and different view position, text in images maybe shown in different position, orientation and scale. Thus, a rectangular box always cannot mark the text region accurately. Consequently, the distorted text image will also influence the performance of text recognition algorithms. Therefore a robust text detection and system should detect text in arbitrary position and orientation accurately.

The current text detection methods can be summarized as three categories as [3], gradient-based methods, the connected region-based methods and texture based methods. Always, the text area is rich of edges, while the background region has relatively less edges. Based on this feature, edges are used to determine the existence of text. Some popular edge detectors are applied for the text detection, such as Sobel edge detector, Harris corner detector. However, when the background become complex, the edge features bring a lot of false alarms [4].

Text detection based on the connected region mainly utilizes the similarity of some features in a local region, such as the character stroke width and color [5,6,7,8,9]. With color-based feature, it can detect the text in different directions, but it is sensitive to the change of environment. Maximum Extreme Stability region (MSER) algorithms [10,11,12] consider the variance of local features to determine the region that has similar features. The MSER detector is invariant to rotation, scale and affine. It can effectively detect the text in variable view and scale.

For texture feature, Yi and Tian [13] proposed a text detection algorithm based on the gradient, color and geometric characteristics of characters. They concatenated the characters with similar features into the text area. The texture feature of text area is considered to be different from that of background. Zhao et al. [14] applied the theory of wavelet transform to extract the texture information from images. They divided the image into multiple patches and classified each patch into text region and background according to the texture feature.

Images can be taken from any position, thus text in image can appear in any pose. However, text detection method only use rectangular box to label text region which always includes a lot of background region. In this paper, we proposed a text detection method to detect text and label the text region more accurately. With the quadrilateral label for text region, the inertial spindle is used to estimate the affine parameters for text region. Then text regions can be rectified according to affine transform, where the rectified text can be recognized more accurately.

2 Text Detection

Given an image, to localize the text region precisely is the first step to guarantee the performance of the whole text recognition system. For each character, they share some similar features, such as color and texture. Thus, we applied the region-based feature to detect the text. However, the region-based feature will also introduce some false alarms. Then geometric feature and stroke width are used further for the purification of candidate region. With all these initial character detection results, a quadrilateral is introduced to local the text region accurately. The procedure of text detection in shown in Fig. 1.

Fig. 1.
figure 1

The flow chart of text detection

In order to detect all the possible text regions, region-based feature, MSER [10] is applied. The principle of MSER is based on watershed algorithm. The increment of the threshold is similar to rising the water level. As the water rises, some shallow valleys will be flooded. If you look down from the sky, the earth is divided into two parts, land and water, which is similar to the binary image. In image, the connect regions that has low variance in features will be merged together. Its mathematical definition is as follows:

$$ \Delta \,q_{i} = \left| {Q_{i + \Delta } - Q_{i - \Delta } } \right|/\left| {Q_{i} } \right| $$
(1)

where \( Q_{i} \) denotes a certain connection region when the threshold is \( i \), \( \Delta \) is a minimal change of the gray scale threshold, and \( q_{i} \) is the changing rate of the region \( Q_{i} \) when the threshold value is \( i \). When \( q_{i} \) is a local minimum, then \( Q_{i} \) is the maximum stable extremum region. MSER is robust to affine transformation, thus it can effectively detect the text in the case of perspective transformation and scale change.

Although the MSER can pick out most of the text, but it can also include some parts without text. Non-text area can be discarded by examining the properties of the candidate region. Due to the geometric properties of text region, the width-to-height ratio, eccentric and Euler number are applied. Eccentricity is defined as the ratio between the main axis and the auxiliary axis of the image area. The Euler number is the total number of objects in the image minus the total number of holes in those objects.

Besides geometric feature, the stroke width is another robust feature. The text areas tend to have a smaller stroke width variation than non-text areas. Thus we applied stroke width detector (SWT) [15] to calculate the stroke width for candidate regions. The variance of stroke width is used to determine the existence of text.

3 External Bounding Box Estimation

After text detection, each candidate character will be labeled by a rectangle box. Always text region is represented by the maximal external box of all the candidate text region, as shown in Fig. 2(b). However, it includes many background and cannot represent the orientation of the text region, which is useful for further affine transform. In this section, we will introduce the proposed method for accurate external bounding box estimation as shown in Fig. 2(c).

Fig. 2.
figure 2

Bounding box for text region. (a) bounding box for each character from text detection, (b) maximal external bounding box, (c) adjusted bounding box.

In order to determine the affine parameters, we need to find an accurate bounding box to represent the contour of text region. To determine the bounding box, the four corner points of the quadrilateral should be decided according to the distribution of characters. Since the bounding box of each character has presented the position of each character, the corners of the bounding box for each character are used to determine the external quadrilateral. In fact, we find the extreme points from all the corners of character bounding box, naming the maximal and minimal horizontal position as \( X_{\hbox{min} } ,X_{\hbox{max} } \), and maximal and minimal vertical position as \( Y_{\hbox{min} } ,Y_{\hbox{max} } \). According to the distribution of these extreme points, there are three different cases to determine the external quadrilateral.

3.1 Quadruple Extreme Points

In this case, the four extreme points are not coincided together. That is, each extreme point represents one corner of external bounding box, as shown in Fig. 3(a). Then the external quadrilateral can be determined directly by these four points. With the external bounding box, the parameters of affine transform can be estimated. The details about affine transform will be introduced in Sect. 4.

Fig. 3.
figure 3

Three different cases to determine the external bounding box. (a) the condition of quadruple extreme points; (b), (c) the condition of triple extreme points and the external bounding box for it; (d), (e) two possible conditions for twin extreme points, (f) the external bounding box for twin extreme points case.

3.2 Triple Extreme Points

In this condition, two extreme value appears in one points, and the other two extreme value appears in two different points. Thus, the four extreme value can determine three points of external bounding box, as shown in Fig. 3(b). With three points, a triangle can be determined. Then the forth point should be determined by the distribution of characters. First an external rectangular can be found for the three extreme points defined triangle by drawing lines parallel to axis through the triangle points. Then the region inside the rectangular is divided into four sub-regions, and there are three sub-regions outside the triangle. Then the forth point should lie in the region where most characters stay. Here the density of character can be described by the density of corner points of character bounding boxes. Then the character density of these three sub-regions is calculated and compared. The corner point of the sub-region that has the maximum character is selected as the forth point for the external quadrilateral. Figure 3(c) shows a solution for triple extreme points.

3.3 Twin Extreme Points

In this condition, the four extreme points can only determine two corner points because of the coincidence of extreme points, as shown in Fig. 3(d), (e). Then we need to determine another two points for the external quadrilateral, which is based on the distribution of characters. In fact, the inertia spindle of the detected characters represents the main orientation of text region. Thus we propose to use the direction of inertia spindle to determine the other two points of the external bounding box. For each of the extreme points, two lines that are parallel and appendicular to the inertia spindle can be set, respectively. Then the intersection of two sets of lines can determine the other two corner points of the external bounding box, as shown in Fig. 3(f).

Similarly, the calculation of inertia spindle of text region also can be based on the corner points of character bounding box. Let C represents a set of corners \( (x_{i} ,y_{i} ) \), \( i = 1,2, \ldots ,N \) where \( N \) is the total number of corners of the detected character bounding boxes. The centroid point of all the corners is defined as \( (\bar{x},\bar{y}) \). Then the moment of inertia for corner points C is defined as

$$ G_{c} = \sum\nolimits_{i = 1}^{N} {[(xi - \bar{x})^{2} + (yi - \bar{y})^{2} } ] $$
(2)

The moment of inertia for corner set C against the straight line L with angle θ is defined as

$$ G_{\theta } = \sum\nolimits_{i = 1}^{N} {[(xi - \bar{x})\sin \theta - (yi - \bar{y})\cos \theta ]^{2} } $$
(3)

The inertia spindle is defined as the angle \( \hat{\theta } \) which make the moment of inertial \( G_{\theta } \) minimal as

$$ \hat{\theta } = \arg \hbox{min} G_{\theta } $$
(4)

Then let the derivative of inertia moment \( G_{\theta } \) equal to zero, as \( G_{\theta }^{'} = 0 \)

$$ \begin{aligned} G_{\theta }^{'} = 2\sum\nolimits_{{{\text{i}} = 1}}^{N} {[(xi - \bar{x})\sin \theta - (yi - \bar{y})\cos \theta ]} \cdot [(xi - \bar{x})\cos \theta - (yi - \bar{y})\sin \theta ] \hfill \\ \, = \sum\nolimits_{i = 1}^{N} {[(xi - \bar{x})^{{_{2} }} - (yi - y)^{2} ]\sin 2\theta } - 2\sum\nolimits_{i = 1}^{N} {(xi - \bar{x})(y_{i} - \bar{y})\cos 2\theta } \hfill \\ \end{aligned} $$
(5)

then

$$ \tan 2\hat{\theta } = \frac{{2\sum\nolimits_{i = 1}^{N} {(xi - x)(yi - y)} }}{{\sum\nolimits_{i = 1}^{N} {[(xi - \bar{x})^{2} - (y_{i} - \bar{y})^{2} ]} }} $$
(6)

Let \( m_{20} = \sum\nolimits_{i = 1}^{N} {(x_{i} - x)^{2} } \), \( m_{02} = \sum\nolimits_{i = 1}^{N} {(y_{i} - y)^{2} } \), \( m_{11} = \sum\nolimits_{i = 1}^{N} {(x_{i} - \bar{x})(y_{i} - \bar{y})} \), then Eq. (5) can be written as,

$$ \tan 2\hat{\theta } = \frac{{2m_{11} }}{{m_{20} - m_{02} }} $$
(7)

Because

$$ \tan 2\theta = \frac{2\tan \theta }{{1 - \tan^{2} \theta }} $$
(8)

Then bring Eq. (9) into Eq. (3), we can get

$$ m_{11} \tan^{2} \hat{\theta } + (m_{20} - m_{02} )\tan \hat{\theta } - m_{11} = 0 $$
(9)
$$ \tan \hat{\theta }_{1,2} = \frac{{ - (m_{20} - m_{02} ) \pm \sqrt {(m_{20} - m_{02} )^{2} + 4m_{11}^{2} } }}{{2m_{11} }} $$
(10)

For the two orientation \( \hat{\theta }_{1,2} \), the one makes the second derivative of \( G_{\theta } \) greater than zero is selected as the direction of inertia spindle. Then the inertia spindle can be described as the line passing through centroid point with the direction of \( \hat{\theta } \)

$$ y{}_{i} - \bar{y} = (x_{i} - \bar{x})\tan \hat{\theta } $$
(11)

Let \( \rho = \tan \hat{\theta } \), then the other two points of the external bounding box can be determined by the intersection of parallel and perpendicular lines to direction \( \rho \) through extreme points. Equations 1417 are the equations for four lines.

$$ \begin{aligned} y_{1} = \rho (x - x_{\hbox{max} } ) + y_{\hbox{min} } ;\,\,\,y_{2} = - \frac{1}{\rho }(x - x_{\hbox{max} } ) + y_{\hbox{min} } \hfill \\ y_{3} = \rho (x - x_{\hbox{min} } ) + y_{\hbox{max} } ;\,\,\,y_{4} = - \frac{1}{\rho }(x - x_{\hbox{min} } ) + y_{\hbox{max} } \hfill \\ \end{aligned} $$
(12)

Solve the Eq. 17, the coordinates of the other two corners of the external bounding box can be calculated as,

$$ (\frac{{\rho^{2} x_{\hbox{min} } + \rho (y_{\hbox{min} } - y_{\hbox{max} } ) + x_{\hbox{max} } }}{{\rho^{2} + 1}},\;\frac{{\rho^{2} y_{\hbox{min} } + \rho (x_{\hbox{max} } - x_{\hbox{min} } ) + y_{\hbox{max} } }}{{\rho^{2} + 1}}) $$
(13)
$$ (\frac{{\rho^{2} x_{\hbox{max} } + \rho (y_{\hbox{max} } - y_{\hbox{min} } ) + x_{\hbox{min} } }}{{\rho^{2} + 1}},\;\frac{{\rho^{2} y_{\hbox{max} } + \rho (x_{\hbox{min} } - x_{\hbox{max} } ) + y_{\hbox{min} } }}{{\rho^{2} + 1}}) $$
(14)

4 Affine Transformation

After the external bounding box is calculated, the parameters for affine transform should be estimated first, then the affine transform will be applied to rectify the distorted text region.

4.1 Affine Parameter Estimation

According to imaging principle, the original text will be projected to image plane. For two parallel lines \( l_{1} ,l_{2} \) in image plane, where \( l_{i} :\,a_{i} x + b_{i} y + c_{i} = 0\left( {i = 1, \, 2} \right) \). The text plane \( (r_{1} ,s_{1} ,t_{1} ) \) can be represented as

$$ \left( \begin{aligned} r_{1} \hfill \\ s_{1} \hfill \\ t_{1} \hfill \\ \end{aligned} \right) = \frac{1}{{\sqrt {a_{1}^{2} + b_{1}^{2} } }}\left( \begin{aligned} b_{1} \hfill \\ - a_{1} \hfill \\ 0 \hfill \\ \end{aligned} \right) $$
(15)

If \( l_{1} ,l_{2} \) are not parallel in the image plane, and the vanishing point is \( (x_{1} ,y_{1} ) \), then the original text plane \( (r_{1} ,s_{1} ,t_{1} ) \) can be represented as

$$ \left( \begin{aligned} r_{1} \hfill \\ s_{1} \hfill \\ t_{1} \hfill \\ \end{aligned} \right) = \frac{1}{{\sqrt {x_{1}^{2} + y_{1}^{2} + f^{2} } }}\left( \begin{aligned} x_{1} \hfill \\ y_{1} \hfill \\ f \hfill \\ \end{aligned} \right) $$
(16)

where f is the focus length of the camera. Then with two set of lines in the image plan, for example the two sets of lines of the external bounding box, the original text plane can be determined, as

$$ \left( \begin{aligned} r_{3} \hfill \\ s_{3} \hfill \\ t_{3} \hfill \\ \end{aligned} \right) = \left( \begin{aligned} r_{1} \hfill \\ s_{1} \hfill \\ t_{1} \hfill \\ \end{aligned} \right) \times \left( \begin{aligned} r^{\prime}_{2} \hfill \\ s^{\prime}_{2} \hfill \\ t^{\prime}_{2} \hfill \\ \end{aligned} \right) $$
(17)

where \( (r_{2}^{'} ,s_{2}^{'} ,t_{2}^{'} ) \) is determined by another pair of lines in image plane.

4.2 Affine Correction

With the estimation for the original text plane, we can rectify the distorted text in the image plane. Figure 4 shows the relationship between the text plane and image plane. The mapping from a point \( (X_{t}^{0} ,Y_{t}^{0} ,Z_{t}^{0} ) \) on the text plane \( O_{t} X_{t} Y_{t} Z_{t} \) to a point \( \left( {x_{i} ,y_{i} } \right) \) of the image plane can be written as follows:

Fig. 4.
figure 4

The projection of text to image

$$ \left( \begin{aligned} x_{t}^{0} \hfill \\ y_{t}^{0} \hfill \\ \end{aligned} \right) = \left( {\begin{array}{*{20}c} f & 0 \\ 0 & f \\ \end{array} } \right)\left( \begin{aligned} \left( {\frac{{X_{it}^{0} + X_{0} }}{{Z_{it}^{0} + Z_{0} }}} \right) \hfill \\ \left( {\frac{{Y_{it}^{0} + Y_{0} }}{{Z_{it}^{0} + Z_{0} }}} \right) \hfill \\ \end{aligned} \right) $$
(18)

where

$$ \left( \begin{aligned} X_{it}^{0} \hfill \\ Y_{it}^{0} \hfill \\ Z_{it}^{0} \hfill \\ \end{aligned} \right) = \left( {\begin{array}{*{20}c} {r_{1} } & {r_{2} } & {r_{3} } \\ {s_{1} } & {s_{2} } & {s_{3} } \\ {t_{1} } & {t_{2} } & {t_{3} } \\ \end{array} } \right)\left( \begin{aligned} X_{t}^{0} \hfill \\ Y_{t}^{0} \hfill \\ Z_{t}^{0} \hfill \\ \end{aligned} \right) $$
(19)

In order to reconstruct the front view of the text plane, we can use the affine transformation to correct.

$$ \left( \begin{aligned} x_{it}^{0} \hfill \\ y_{it}^{0} \hfill \\ \end{aligned} \right) = \left( {\begin{array}{*{20}c} f & 0 \\ 0 & f \\ \end{array} } \right)\left( \begin{aligned} \left( {\frac{{X_{t}^{0} + X_{0} }}{Z}} \right) \hfill \\ \left( {\frac{{Y_{t}^{0} + Y_{0} }}{{Z_{0} }}} \right) \hfill \\ \end{aligned} \right) $$
(20)

In the text plane, \( Z_{t}^{0} = 0 \). Since the origin point \( (X_{0} ,Y_{0} ,Z_{0} ) \) of \( O_{t} X_{t} Y_{t} Z_{t} \) and \( O_{it} X_{it} Y_{it} Z_{it} \) is mapped to the same point \( \left( {x_{0} ,y_{0} } \right) \) in the image panel, we can get

$$ x_{0} = f\frac{{X_{0} }}{{Z_{0} }},y_{0} = f\frac{{Y_{0} }}{{Z_{0} }} $$
(21)

Then the correction can be done by Eq. (22)

$$ x_{it}^{0} = x_{0} + f\frac{{\left| {\begin{array}{*{20}c} {x_{0} - x_{t}^{0} } & {r_{2} f - t_{2} x_{t}^{0} } \\ {y_{0} - y_{t}^{0} } & {s_{2} f - t_{2} y_{t}^{0} } \\ \end{array} } \right|}}{{\left| {\begin{array}{*{20}c} {r_{1} f - t_{1} x_{t}^{0} } & {r_{2} f - t_{2} x_{t}^{0} } \\ {s_{1} f - t_{1} y_{t}^{0} } & {s_{2} f - t_{2} y_{t}^{0} } \\ \end{array} } \right|}},\,\,y_{it}^{0} = y_{0} + f\frac{{\left| {\begin{array}{*{20}c} {r_{1} f - t_{1} x_{t}^{0} } & {x_{0} - x_{t}^{0} } \\ {s_{1} f - t_{1} y_{t}^{0} } & {y_{0} - y_{t}^{0} } \\ \end{array} } \right|}}{{\left| {\begin{array}{*{20}c} {r_{1} f - t_{1} x_{t}^{0} } & {r_{2} f - t_{2} x_{t}^{0} } \\ {s_{1} f - t_{1} y_{t}^{0} } & {s_{2} f - t_{2} y_{t}^{0} } \\ \end{array} } \right|}} $$
(22)

That is for each point on the image plane, it can be mapped to a new position according to the affine transform. In order to make the image more consistent, the bilinear interpolation method is applied to reconstruct the rectified text images.

5 Experimental Results

We used the MSRA-TD500 database [16] to test the performance of the algorithm. This dataset contains 500 indoor and outdoor natural images. The indoor images are mainly signs, doorplates and caution plates while the outdoor images are mostly guide boards and billboards in complex background. The resolutions of the images vary from 1296 × 864 to 1920 × 1280. Some of the rectified results in different cases are shown in Fig. 5. We can see the proposed method can provide accurate detection for text region in different conditions. With the external quadrilateral of text region, the parameters of affine transform can be estimated precisely and consequently distorted images can be rectified. For Fig. 5(d), the corner points mainly lies in the middle of text region. Thus the detected text region is shifted. However, the inertia spindle still provides the main direction of the test, and consequently it can be rectified as shown in Fig. 5(f).

Fig. 5.
figure 5

Rectified results. (a) bounding box for quadruple extreme points case. (b), (c) bounding box for triple extreme points case. (d) bounding box for twin extreme points case. (e), (f), (g) rectified results for (a), (d), (b), respectively.

In order to quantify the effectiveness of the detection algorithm, the commonly used accuracy rate (P), recall rate (R) and F-score (Eq. 23) are applied

$$ P = \frac{tp}{tp + fp} \times 100\% ,\;R = \frac{tp}{tp + fn} \times 100\% $$
(23)
$$ F = 2\frac{P \times R}{P + R} $$
(24)

where \( tp \) is the area of the text region correctly detected in the image, \( fp \) is the non-text area in the image that is erroneously detected, that is, the false detection area. \( fn \) represents the text area that is not detected in the scene image. F score is the harmonic mean of precision and recall rate. The proposed method (TD-Affine) is test on the MSRA-TD500 database, it achieved precision rate of 0.58 and recall rate of 0.62. Compared with other methods as shown in Table 1, the proposed method keeps a high recall rate which is achieved by the MSER method. Then geometric and stroke width feature are used to remove false alarms. The final text region is determined due to different distributions of character. Since we try to keep the main direction of text region, but not the exact test region, therefore the precision rate still can be improved. Keeping the main direction is useful for the estimation of affine transformation. Figure 6 shows more results of affined rectified images. For the first two rows, all the text are in the same direction, then the proposed method can re-project the text to the frontal view. But when text are in different directions, as the third row, the global inertia spindle is balanced among local regions. Thus the proposed method cannot adjust each local region. Similarly, when the background is complex, as the forth row, it will also influence the correct estimation of inertia spindle for the text region. According the wrongly estimated affine parameters results in the wrongly rectification.

Table 1. Comparison of text detection results on MSRA-TD500
Fig. 6.
figure 6

More results for affine rectification. The left column shows the character detection results, the right column shows the corresponding rectified results.

In order to shown the effectiveness of affine correction, we also performed text recognition experiments on the dataset. We used the ABBYY Finereader [17] to recognize the text. Table 2 shows the performance of text recognition with and without affine correction. From the results, we can see that the proposed affine parameter estimation and correction method can improve the text recognition rate. The proposed method mainly consider the main direction of text region, even when the detection region is not so accurate, the parameters of affine transform still can be accurately estimated. Consequently, the distorted images can be rectified precisely.

Table 2. Text recognition results on MSRA-TD500

6 Conclusion

Text always provide important context information for images. In order to realize the text detection and recognition in natural environment, this paper proposed an affine transform based test detection algorithm. Since text detection is always an essential prerequisite for text recognition, providing high quality images for recognition is necessary. Thus we not only try to detect the location of text but also try to estimate the distortion of text region. In this paper, we utilize the MSER and SWT algorithms to detect the text area, and then combines the inertial spindle with some known extreme points to find the four vertices of the outer quadrilateral of the text area. Affine parameters can be obtained based on the four vertices, which will be used for affine rectification for distorted text. As a result, the text recognition rate can be improved significantly with the rectified text input.