Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats

Ahmad, Riaz; Afzal, Muhammad Tanvir; Qadir, Muhammad Abdul

doi:10.1007/978-3-319-46565-4_23

Riaz Ahmad¹⁴,
Muhammad Tanvir Afzal¹⁴ &
Muhammad Abdul Qadir¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 641))

Included in the following conference series:

Semantic Web Evaluation Challenge

889 Accesses
13 Citations

Abstract

Information extraction from the PDF sources is a tedious task. Most of the existing approaches use either tag-based format such as HTML and XML, or Plain-text format for the extraction of information. In this paper, we present an information extraction technique for research papers which exploits both XML and text formats intelligently. The various patterns and rules are prepared from integrated formats. Furthermore, the intelligent processing of XML and Plain-text for various situations compliments the approach to achieve high accuracy. The proposed approach is a heuristic based approach that extracts the information about logical structure and supportive materials of research papers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)
Google Scholar
Di Iorio, A., Peroni, S., Poggi, F., Vitali, F., Shotton, D.: Recognising document components in XML-based academic articles. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 181–184. ACM (2013)
Google Scholar
Kim, S., Cho, Y., Ahn, K.: Semi-automatic metadata extraction from scientific journal article for full-text XML conversion. In: Proceedings of the International Conference on Data Mining (DMIN), p. 1 (2014). The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp)
Google Scholar
Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. In: Multimedia Storage and Retrieval Innovations for Digital Library Systems, vol. 270 (2012)
Google Scholar
Milicka, M., Burget, R.: Information extraction from web sources based on multi-aspect content analysis. In: Gandon, F., et al. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 81–92. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_7
Chapter Google Scholar
Mohemad, R., Hamdan, A.R., Othman, Z.A., Noor, N.M.: Automatic document structure analysis of structured PDF files. Int. J. New Comput. Architect. Appl. (IJNCAA) 1(2), 404–411 (2011)
Google Scholar
Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endow. 8(12), 1606–1617 (2015)
Article Google Scholar
Nuno, M., Fátima, R.: Extracting structure, text and entities from PDF documents of the portuguese legislation. Institute of Engineering, Polytechnic of Porto, Portugal (2012)
Google Scholar
Ramakrishnan, C., Patnia, A., Hovy, E., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7(1), 1 (2012)
Article Google Scholar
Saleem, O., Latif, S.: Information extraction from research papers by data integration and data validation from multiple header extraction sources. In: Proceedings of the World Congress on Engineering and Computer Science, vol. 1 (2012)
Google Scholar
Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 177–180. ACM (2013)
Google Scholar

Download references

Acknowledgments

We would like to thank the computer science department of CUST (Capital University of Science & Technology) to provide us research Lab and knowledgeable support to conduct this research activity. We are also thankful to the members of CDSC (Center for Distributed & Semantic Computing) research group for supporting us in different occasion to complete this research work.

Author information

Authors and Affiliations

Department of Computer Science, Capital University of Science & Technology, Islamabad, Pakistan
Riaz Ahmad, Muhammad Tanvir Afzal & Muhammad Abdul Qadir

Authors

Riaz Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Tanvir Afzal
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Abdul Qadir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Riaz Ahmad .

Editor information

Editors and Affiliations

IT Systems Engineering, Hasso-Plattner Institute, Potsdam, Germany
Harald Sack
Leibniz Universität Hannover , Hannover, Germany
Stefan Dietze
Elsevier B.V. , Amsterdem, The Netherlands
Anna Tordai
Universität Bonn , Bonn, Germany
Christoph Lange

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ahmad, R., Afzal, M.T., Qadir, M.A. (2016). Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats. In: Sack, H., Dietze, S., Tordai, A., Lange, C. (eds) Semantic Web Challenges. SemWebEval 2016. Communications in Computer and Information Science, vol 641. Springer, Cham. https://doi.org/10.1007/978-3-319-46565-4_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-46565-4_23
Published: 09 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46564-7
Online ISBN: 978-3-319-46565-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics