Skip to main content

Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats

  • Conference paper
  • First Online:
Semantic Web Challenges (SemWebEval 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 641))

Included in the following conference series:

Abstract

Information extraction from the PDF sources is a tedious task. Most of the existing approaches use either tag-based format such as HTML and XML, or Plain-text format for the extraction of information. In this paper, we present an information extraction technique for research papers which exploits both XML and text formats intelligently. The various patterns and rules are prepared from integrated formats. Furthermore, the intelligent processing of XML and Plain-text for various situations compliments the approach to achieve high accuracy. The proposed approach is a heuristic based approach that extracts the information about logical structure and supportive materials of research papers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://pdfx.cs.man.ac.uk.

  2. 2.

    http://2016.eswc-conferences.org/assessing-quality-scientific-output-its-ecosystem.

  3. 3.

    https://pdfbox.apache.org/.

  4. 4.

    https://www.editpadpro.com/.

  5. 5.

    http://Jena.apache.org.

  6. 6.

    http://www.cdsc-cust.org/downloads.

  7. 7.

    https://github.com/angelobo/SemPubEvaluator.

References

  1. Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)

    Google Scholar 

  2. Di Iorio, A., Peroni, S., Poggi, F., Vitali, F., Shotton, D.: Recognising document components in XML-based academic articles. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 181–184. ACM (2013)

    Google Scholar 

  3. Kim, S., Cho, Y., Ahn, K.: Semi-automatic metadata extraction from scientific journal article for full-text XML conversion. In: Proceedings of the International Conference on Data Mining (DMIN), p. 1 (2014). The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp)

    Google Scholar 

  4. Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. In: Multimedia Storage and Retrieval Innovations for Digital Library Systems, vol. 270 (2012)

    Google Scholar 

  5. Milicka, M., Burget, R.: Information extraction from web sources based on multi-aspect content analysis. In: Gandon, F., et al. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 81–92. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25518-7_7

    Chapter  Google Scholar 

  6. Mohemad, R., Hamdan, A.R., Othman, Z.A., Noor, N.M.: Automatic document structure analysis of structured PDF files. Int. J. New Comput. Architect. Appl. (IJNCAA) 1(2), 404–411 (2011)

    Google Scholar 

  7. Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endow. 8(12), 1606–1617 (2015)

    Article  Google Scholar 

  8. Nuno, M., Fátima, R.: Extracting structure, text and entities from PDF documents of the portuguese legislation. Institute of Engineering, Polytechnic of Porto, Portugal (2012)

    Google Scholar 

  9. Ramakrishnan, C., Patnia, A., Hovy, E., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7(1), 1 (2012)

    Article  Google Scholar 

  10. Saleem, O., Latif, S.: Information extraction from research papers by data integration and data validation from multiple header extraction sources. In: Proceedings of the World Congress on Engineering and Computer Science, vol. 1 (2012)

    Google Scholar 

  11. Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 177–180. ACM (2013)

    Google Scholar 

Download references

Acknowledgments

We would like to thank the computer science department of CUST (Capital University of Science & Technology) to provide us research Lab and knowledgeable support to conduct this research activity. We are also thankful to the members of CDSC (Center for Distributed & Semantic Computing) research group for supporting us in different occasion to complete this research work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Riaz Ahmad .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ahmad, R., Afzal, M.T., Qadir, M.A. (2016). Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats. In: Sack, H., Dietze, S., Tordai, A., Lange, C. (eds) Semantic Web Challenges. SemWebEval 2016. Communications in Computer and Information Science, vol 641. Springer, Cham. https://doi.org/10.1007/978-3-319-46565-4_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46565-4_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46564-7

  • Online ISBN: 978-3-319-46565-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics