Entity Extraction and Correction Based on Token Structure Model Generation

Rahal, Najoua; Benjlaiel, Mohamed; Alimi, Adel M.

doi:10.1007/978-3-319-49055-7_36

Najoua Rahal¹⁸,
Mohamed Benjlaiel¹⁹ &
Adel M. Alimi¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10029))

Included in the following conference series:

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

1232 Accesses
1 Citations

Abstract

The logical and semantic structure analysis is a basic process for invoice understanding. Be able to carry out a robust layout analysis is very difficult due to highly heterogeneous invoice templates. In this paper, we propose a local structure for entity extraction and correction from scanned invoices. It attempts to extract entity in contiguous and noncontiguous structure by automatic finding the local structure of each entity without structure model matching and user intervention. Firstly, the entities are labeled in OCRed invoice. Combining labeled entities with geometric and semantic relations, token structure models are generated. These models are used for entity extraction and mislabeling correction by ignoring some superfluous tokens detected by labeling step. The correction module to the contiguous structure differs from that of the noncontiguous structure. The obtained results with a dataset of real invoices are reported in experimental section.

You have full access to this open access chapter, Download conference paper PDF

Automatic Annotation Service APPI: Named Entity Linking in Legal Domain

Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents

Article 15 February 2024

Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

Keywords

1 Introduction

In accordance with [1], Automatic document processing refers to three main categories; doctype classification, data capture/Functional Role Labeling, and document sets. Doctype classification is to assign a document image to a prestored template. Data capture represents the extraction of relevant human understandable information from document image. The category Document sets relates between documents and their contents depending on business logic. In this paper, we focus on automatic data capture from invoices regardless of their high geometric variations.

Figure 1 shows some examples of entities in contiguous (Fig. 1(a)) and noncontiguous (Fig. 1(b)) structure. It illustrates how closeness, direction and graphical elements may differ in conjunction Reference Words (RWs) e.g., “FACTURE N^o”, “Date”, “Net à payer”, etc. with Key Fields (KFs) e.g., “006651”, “22/08/2015”, “228 276.300”, etc. for an entity, in various invoices.

In this context, many initiative works, like [2], learn a local structure layout from training document and reuse it for extracting the fields in the test document. The weakness of such work is that require the human intervention for labeling semantic fields. Authors in [3, 4] propose to correct the mislabeling by adding the missing labels. However, they require high regularity of structures and automatic blocks and segments obtained by OCR (Optical Character Recognition). Also, the mislabeling correction is based on matching a structure graph with a model graph.

Experimental studies have shown that the mismatch between unstructured data obtained by the OCR, like Tesseract^{Footnote 1} (OCR without layout analysis), and its physical representation generates another type of mislabeling called superfluous tokens. This problem is caused by mishandling of spaces by OCR. Figure 2 shows sample of entities with superfluous tokens mislabeling. At the bottom of Fig. 2(a), there is the result of labeling applied to the text of OCR to extract the “Balance Due”. At the top, there is the physical representation of this labeling. The final impact of this mishandling is a wrong extraction of an entity in which “0,000” represents a noisy token.

In an earlier work [5], we have proposed a method for treatment entity in contiguous structure. However, it does not able to extract entity in noncontiguous structure. Also, it does not benefit of physical and logical structure of the entity in the invoice which is the purpose of this work. The ultimate goal of our method is to extract only the relevant tokens from an entity (framed in red color) and increases the accuracy of the extraction process.

Our contributions are: (i) a robust system for entity extraction based on contextual search of local structure of each entity. Then, there is no need to classify contiguous and noncontiguous structure. Our system starts its contextual search in contiguous structure. If no result is found, then, it moves automatically to the treatment of extraction entity in noncontiguous structure. (ii) Adoption of correction step to eliminate the superfluous tokens caused by the labeling step.

In the remainder of our paper, we firstly describe in detail our solution. Next, we discuss obtained experimental results. Finally, the paper is concluded.

2 Proposed Method

The overview of our proposed method is given by Fig. 3. Firstly, invoice image is processed by OCR engine. Secondly, entities are labeled in OCRed invoice image. Once labeled, the local structure of each entity is detected. For each entity structure, a token model is generated which aims to eliminate the superfluous tokens caused by the labeling step. In this model, the KFs represent the tokens and the distances represent the relationships between them. Finally, an incremental algorithm is applied to concatenate each two consecutive tokens of which the distance between them respects a certain threshold.

2.1 Labeling

Entities are labeled in the invoice using Patterns of Regular Expressions (Regex). Each invoice I is defined as:

$$\begin{aligned} I=\{L_i\} \end{aligned}$$

(1)

Where $\{L_i\}$ is a set of labels. Each label is represented by:

$$\begin{aligned} L_{i}=\{R_i,F_i\} \end{aligned}$$

(2)

Where $R_i$ is the Reference Words (RWs) label and $F_i$ is the Key Fields (KFs) label.

2.2 Extraction Entity in Contiguous Structure

Tokenization. The tokenization allows the presentation of a label in the form of a set of tokens, SetT, which are separated by whitespace character as:

$$\begin{aligned} L_{i}=SetT \end{aligned}$$

(3)

The tokens of $R_i$ are defined as:

$$\begin{aligned} {SetR=\{T_i^R|SetR\in SetT\}} \end{aligned}$$

(4)

The tokens of $F_i$ are defined as:

$$\begin{aligned} {SetF=\{T_i^F|SetF\in SetT\}} \end{aligned}$$

(5)

Tokens Filtering. In this step, Algorithm 1 is iteratively used to delete SetR. This algorithm is stopped when SetR is empty. At this stage, SetT contains only the entire tokens SetF. Each token is stored on its bounding box which is defined by:

$$\begin{aligned} {T_i^F\rightarrow [x_i^F,y_i^F,w_i^F,h_i^F]} \end{aligned}$$

(6)

Where $x_i^F$ and $y_i^F$ represent the coordinate of upper left corner, $w_i^F$ the width and $h_i^F$ is the height of the rectangle.

Relevant Tokens Clustering. In this step, we propose a correction module for elimination the superfluous tokens. It represents the arrangement of relevant tokens of local entity. The geometric relations of a structure is modeled by distances measuring. The clustering of relevant tokens and eliminating the noisy content require the distances measuring between consecutive tokens $T_i^F$ and $T_j^F$ ($j=i+1$). Each distance is calculated as:

$$\begin{aligned} {d_{ij}=x_j^F-(x_i^F+w_i^F)} \end{aligned}$$

(7)

The incremental algorithm, detailed in Algorithm 2, is applied to concatenate relevant tokens. SetF contains at least one token. In this case, the latter represents the relevant token. If $SetF > 1$, then, we need to cluster relevant tokens. To achieve this goal, a threshold S is defined as the maximum distance between two consecutive tokens. This threshold is empirically defined. The measured distance is compared with S. If $d_{ij}\le S$, then, the tokens are concatenated. If it is not the case, then, the algorithm is stopped and the rest of tokens, SetN, are ignored.

2.3 Extraction Entity in Noncontiguous Structure

Entity in noncontiguous structure means that RWs and KFs appear in the invoice in vertical structure. Since the drawing of relationships between RWs and all the KFs is time consuming and no avail, we propose to filter the labels. This requires the detection of KFs in a given region.

For relevant entity extraction, we build a graph of structural relationships. This graph is called Noncontiguous Graph.

Noncontiguous Graph Building. For noncontiguous entity structure extraction, as detailed in Algorithm 3, a graph is built $G=(N,M,E)$ in which N is a node of the label $R_i$. M is a set of finite nodes that represent the labels $F_j$ having the centers under the center of N. $E\subseteq N\times M$ is a finite set of arcs which represent a geometric relationships between the node N and the nodes of M. Each arc $e_{ij}\in E$ relating the node N and $m_j$ is represented by $Nm_j$. We define a feature vector which describes the geometric relationships between N and $m_j$.

$$\begin{aligned} {a_{ij}=(CN_i,Cm_j,e_{ij})} \end{aligned}$$

(8)

Where: $CN_i$ is the center of the node N (step 4 in Algorithm 3). $Cm_j$ is the center of each node $m_j$ (step 6 in Algorithm 3). $e_{ij}$ is the distance that separates the bounding boxes of the labels corresponding to N and $m_j$ (step 10 in Algorithm 3), as we can view in Fig. 4. The idea is to detect the nearest $m_j$ to N (step 13 in Algorithm 3). We consider only the nodes having the centers under the center of N. The distances are calculated as:

$$\begin{aligned} e_{ij}= {\left\{ \begin{array}{ll} 1,&{} \text {if } Cm_j(2)>CN_i(2) \\ 0,&{} \text {else} \end{array}\right. } \end{aligned}$$

(9)

Where: $Cm_j(2)$ is the second coordinate (ordinate) of the center $Cm_j$. $CN_i(2)$ is the second coordinate of the center $CN_i$. $e_{ij}$ is calculated to filter the KFs labels i.e., we bethink only the centers having the upright under the center of N.

The centers are calculated to determinate the nearest $m_j$ to N. In Fig. 4, $m_4$ represents the nearest label KFs node to N i.e., $m_4$ is the relevant KFs. However, the latter may contain noisy tokens that must be eliminated. So, we need to tokenize the relevant KFs (step 14 in Algorithm 3) for clustering relevant tokens and ignore the noisy one.

The difficulty of detecting the relevant tokens of field in a vertical structure resides in this step. In a horizontal structure, the starting token from which begins the clustering of tokens is known. In addition, noisy tokens are found only on the right side. By cons, in a vertical structure, it is first necessary to determine the starting token. Then, we have to perform a sweeping to eliminate noisy tokens to the left and then to the right. To achieve this goal, a subgraph of relationships is built between the node N and the tokens $K=\{k_j\}$ of the nearest node $m_4$. The nearest token is the frame used for a sweeping. To determinate the nearest token, we calculate the distance $p_{ij}$ between the node N and each token $k_j$ (step 16 in Algorithm 3). The nearest token possesses the minimum distance with N (step 18 in Algorithm 3). We call this token “ind” as it represents an index from which begins the sweeping. In Fig. 5, the “ind” is $k_2$.

Sweeping. The sweeping is the exploration token by token of an entity. It is done in both directions to the left (step 20 in Algorithm 3) and then to the right (step 23 in Algorithm 3) for superfluous tokens elimination. The geometrical relationships, provided by the distances measured between tokens inside $Left\_M$, are used to concatenate relevant tokens. This matrix is defined as:

$$\begin{aligned} {Left\_M=(K(1:ind))} \end{aligned}$$

(10)

Whenever, we calculate two distances between two consecutive tokens. The first distance is calculated as:

$$\begin{aligned} {n_{Z-1,Z}=Left\_M_{(Z,1)}-(Left\_M_{(Z-1,1)}+Left\_M_{(Z-1,3)})} \end{aligned}$$

(11)

This distance must not exceed the threshold S previously identified (explained in Sect. 2.3).

To ensure the horizontal alignment of consecutive tokens, we need to calculate the distance between their second coordinates. This distance must not exceed certain threshold H and is calculated as:

$$\begin{aligned} {g_{Z-1,Z}=Left\_M_{(Z,2)}-Left\_M_{(Z-1,2)}} \end{aligned}$$

(12)

The left sweeping outcome is $Left\_M$ containing only one element grouping the relevant tokens. This element is added to the beginning of the created matrix $Right\_M$ for the right sweeping. So, all relevant tokens are grouped in $Right\_M$ (step 25 in Algorithm 3) which is defined as:

$$\begin{aligned} {Right\_M=(K(Left\_M+1:end))} \end{aligned}$$

(13)

The concatenation in the right sweeping is done with the same principles detailed in the left sweeping. Figure 5 shows the process of sweeping for tokens concatenation. In Fig. 5(b), a subgraph of geometric relationships is established between the nodes. The nearest token, ind, having the minimum distance with the node N is detected. The latter is “19”. In Fig. 5(c), $Left\_M$ contains two tokens “0,000” and “19”. The distance $n_{Z-1,Z}$ between these tokens exceeds the threshold S. For that, the token “0,000” is eliminated. In Fig. 5(d), $Left\_M$, containing one element, is integrated in the start of $Right\_M$ and the right sweeping begins. In Fig. 5(e), the right sweeping allows the concatenating of relevant tokens (“19”, “440,000”).

3 Experiments

3.1 Dataset

For test, we use a dataset of 930 real invoices obtained from Compagnie des Phosphates de Gafsa (CPG)^{Footnote 2}. The entities are categorized into 7 types: Invoice Number (N^o), Invoice Date (DT), Account Identity (AI), Pre-tax Amount (PA), Total Including Tax (IT), Holdback (H) and Balance Due (BA).

It is important to indicate that our system sustains the data extraction from grayscale, color and bi-tonal (black and white) images. Our system is insensitive to the multiplicity of fonts. Although the preprocessing step does not belong of our work, our system is able to manipulate little noisy invoices with a slight skew. These invoices contain graphical elements, logos, vertical and horizontal lines, and tables. Some manipulated invoices are shown in Fig. 6.

We have used our ground truth to evaluate our system’s performance. This ground truth was manually prepared.

3.2 Erroneous RWs Correction

In our system, entities are labeled using Regex. The patterns are written to allow some OCR errors in RWs such as confusing zero with capital or lowercase O (e.g., Facture no$\Rightarrow $ Facture n0). This can allow unconstrained input that nearly matches the Regex pattern to be taken in account and significantly improve the performance. The refined Regex has allowed us to detect correctly 62 N^o, 13 DT, 53 PA, 17 IT, 153 AI, and 23 BA.

3.3 Structure Correction Evaluation

To capture contiguous and noncontiguous structures of entity, a set of Regex patterns are used in conjunction with geometric relationships between labels. The correction step is integrated for superfluous tokens elimination. The goal is to increase the accuracy of the extraction. The correction step has allowed us to correct to 100 % the superfluous tokens and yielded a growth of the accuracies. Table 1 shows the impact of this step for each entity. For that, we use two options; without correction (W/o C) and with correction (With C). The most interesting result is for the N^o entity since it has a countless number of formats. The obtained rates justify the fixed threshold distances between any consecutive tokens. We use two thresholds: S is around 32. It is fixed for concatenating the tokens whatever in contiguous or noncontiguous structure. The second threshold H is around 12. It is fixed only in noncontiguous structure to ensure the alignment and the consecutiveness of the tokens. The thresholds set show the power possessed by our method for correction. To ensure the robustness of our correction method, we propose to strengthen these thresholds by other features such as font size to avoid bad detection that can be generated by using the few thresholds in other models.

Table 1. Impact of correction

Full size table

Table 2. Missed entities

Full size table

Table 3. Rates comparison

Full size table

Missed entities, as detailed in Table 2, are due to the following issues: errors in RW represent the RWs completely wrong, so, they cannot be identified by the Regex. Errors in KF are the KFs partially corrected or completely not corrected: if one field is not properly extracted, then, the entity was regarded as erroneous. The confusing labels (CL) means that the label is not associated with the correct entity which leads a failed match for another entity. The OCR sometimes missed the text zone (MT) due to: skewed image, noise, degraded characters, bad detection of tabular structure, etc.

3.4 Comparison with Existing System

Table 3 synthesizes the obtained Recall and Precision of our system. In this table, we also compare our work with the results obtained by the system proposed in [3]. Recall and Precision are defined as:

$$\begin{aligned} Recall=\frac{relevant~extracted~entities}{relevant~entities} \end{aligned}$$

(14)

$$\begin{aligned} Precision=\frac{relevant~extracted~entities}{extracted~entities} \end{aligned}$$

(15)

4 Conclusion

We have proposed an approach for entity extraction from scanned invoices. We have showed how adopting a local structure of entities is very efficient for data extraction. This represents a powerful tool in dealing with variant layout entities. Our method is reinforced by correction step for superfluous tokens elimination. The experimental results have showed an interesting improvement in the performance and accuracy of the extraction process.

Notes

References

Saund, E.: Scientific challenges underlying production document processing. In: Document Recognition and Retrieval XVIII, DRR (2011)
Google Scholar
Rusinol, M., Benkhelfallah, T., Poulain, V.D.: Field extraction from administrative documents by incremental structural templates. In: International Conference on Document Analysis and Recognition, ICDAR (2013)
Google Scholar
Kooli, N., Belaid, A.: Semantic label and structure model based approach for entity recognition in database context. In: International Conference on Document Analysis and Recognition, ICDAR (2015)
Google Scholar
Dejean, H.: Extracting structured data from unstructured document with incomplete resources. In: International Conference on Document Analysis and Recognition, ICDAR (2015)
Google Scholar
Rahal, N., Benjlaiel, M., Alimi, Adel. M.: Incremental structural model for extracting relevant tokens of entity. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC) (2016, to be published)
Google Scholar

Download references

Acknowledgment

We are grateful to CPG Company for providing real invoices for test.

Author information

Authors and Affiliations

Tunis el Manar University, FST, Tunis, Tunisia
Najoua Rahal
Sfax University, ENIS, Sfax, Tunisia
Mohamed Benjlaiel & Adel M. Alimi

Authors

Najoua Rahal
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Benjlaiel
View author publications
You can also search for this author in PubMed Google Scholar
Adel M. Alimi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Najoua Rahal .

Editor information

Editors and Affiliations

Data 61 - CSIRO , Canberra, Australia
Antonio Robles-Kelly
Pattern Recognition Laboratory, Technical University of Delft Pattern Recognition Laboratory, CD Delft, The Netherlands
Marco Loog
Electrical and Electronic Engineering, University of Cagliari Electrical and Electronic Engineering, Cagliari, Italy
Battista Biggio
Computación e IA, Universidad de Alicante Computación e IA, Alicante, Spain
Francisco Escolano
Computer Science, University of York Computer Science, York, United Kingdom
Richard Wilson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rahal, N., Benjlaiel, M., Alimi, A.M. (2016). Entity Extraction and Correction Based on Token Structure Model Generation. In: Robles-Kelly, A., Loog, M., Biggio, B., Escolano, F., Wilson, R. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2016. Lecture Notes in Computer Science(), vol 10029. Springer, Cham. https://doi.org/10.1007/978-3-319-49055-7_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-49055-7_36
Published: 05 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49054-0
Online ISBN: 978-3-319-49055-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Entity Extraction and Correction Based on Token Structure Model Generation

Abstract

Similar content being viewed by others

Automatic Annotation Service APPI: Named Entity Linking in Legal Domain

Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents

Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

Keywords

1 Introduction