Skip to main content
Log in

Section-wise indexing and retrieval of research articles

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Relevant information extraction is a dire need of the scholarly community. There are a number of systems available to find relevant information from scientific literature such as search engines, citation indexes, digital libraries etc. For a search query, a long list of irrelevant documents is presented to the users mainly due to the huge number of availability of the full-text document, and furthermore due to the unstructured nature of indexed scientific resources. The contemporary systems have formally defined the structure of scientific documents. However, populating the already available enriched scientific structure from unstructured/semi-structured scientific documents has not been addressed previously. In this research paper, we have designed, implemented, and evaluated an automated technique that is able to tag each paper’s content with logical sections appearing in the scientific document. The proposed system has been evaluated against the benchmark, subsequently, the proposed system have been also compared with machine learning techniques that may be used for the same task. It has been empirically shown that the overall correctness and completeness of our proposed technique is 0.78 and 0.79 respectively and thus the overall accuracy of about 0.78 was achieved. The achieved results are good as compared to machine learning based classification. The developed system may help future information retrieval systems, digital libraries, and citation indexes to index, retrieve, rank and visualize most relevant scientific documents for the scientific community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Larsen, P.O., Ins, M.V.: The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics 84, 575–603 (2010)

    Article  Google Scholar 

  2. Bollacker, K.D., Lawrence, S., Giles, C.L.: Discovering relevant scientific literature on the Web. IEEE Intell. Syst. 15, 4247 (2000)

    Article  Google Scholar 

  3. Giles, C.L., Bollacker, K.D., Lawrence, S., CiteSeer: An automatic citation indexing system. In: Proceedings of Third ACM Conference on Digital Libraries, Pittsburgh, Pennsylvania, United States, 23–26 (1998)

  4. Beel, J., Gipp, B.: Google scholars ranking algorithm: an introductory overview. In: Proceedings of the 12th International Conference on Scientometrics and Informetrics, 230241(2009)

  5. Blumberg, R., Atre, S.: The problem with unstructured data. Inform Manag. 6287, 42–46 (2003)

    Google Scholar 

  6. Roberts, R.J., Varmus, H.E., Ashburner, M., Brown, P.O., Eisen, M.B., Khosla, C., Kirschner, M., Nusse, R., Scott, M., Wold, B.: Building a “GenBank” of the published literature. Science 291, 2318–2319 (2001)

    Article  Google Scholar 

  7. Kafkas, S., Pi, X., Marinos, N., Talo, F., Morrison, A., McEntyre, J.R.: Section level search functionality in Europe PMC. J. Biomed. Semant. 6(1), 3–7 (2015)

    Article  Google Scholar 

  8. Guo, Y., Korhonen, A., Liakata, M., Silins, I., Hogberg, J., Stenius, U.: A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinform. 12(1), 7–17 (2011)

    Article  Google Scholar 

  9. Lin, J., Karakos, D., Demner-Fushman, D., Khudanpur, S.: Generative content models for structural analysis of medical abstracts. In: Proceedings of BioNLP-06, New York, USA, pp. 65–72 (2006)

  10. Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M.: Identifying sections in scientific abstracts using conditional random fields. In: Proceedings of 3rd International Joint Conference on Natural Language Processing, pp. 381–388 (2008)

  11. Lin, R.T.K., Dai, H.J., Bow, Y.Y., Chiu, J.L.T., Tsai, R.T.H.: Using conditional randomfields for result identification in biomedical abstracts. Integr. Comput. Aided Eng. 16(4), 339–352 (2009)

    Google Scholar 

  12. Teufel, S., Siddharthan, A., Batchelor, C.: Towards domain-independent argumentative zoning. Evidence from chemistry and computational linguistics. In: Proceedings of EMNLP, pp. 1493–1502 (2009)

  13. Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28, 409–445 (2002)

    Article  Google Scholar 

  14. Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.: Corpora for the conceptualisation and zoning of scientific papers. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC10) (2010)

  15. Teufel, S.: Citations and sentiment. In: Workshop on Text mining for Scholarly Communications and Repositories, University of Manchester, UK (2009)

  16. Maricic, S., Spaventi, J., Pavicic, L., Pifat-Mrzljak, G.: Citation context versus the frequency counts of citation histories. J. Am. Soc. Inf. Sci. 49, 530–540 (1998)

    Article  Google Scholar 

  17. Shahid, A., Afzal, M.T., Qadir, M.A.: Discovering semantic relatedness between scientific articles through citation frequency. In: Workshop on Text mining for Scholarly Communications and Repositories, Australian Journal of Basic Applied Sciences, vol. 5, pp. 1599–1604 (2011)

  18. Peroni, S., Shotton, D., Vitali, F.: Faceted documents: describing document characteristics using semantic lenses. In: ACM Symposium on Document Engineering, pp. 191–194 (2012)

  19. Shotton, D., Portwin, K., Klyne, G., Miles, A.: Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLoS Comput. Biol. 5, e1000361 (2009). doi:10.1371/journal.pcbi.1000361

    Article  Google Scholar 

  20. Mizuta, Y., Korhonen, A., Mullen, T., Collier, N.: Zone analysis in biology articles as a basis for information extraction. Int. J. Med. Inf. Nat. Lang. Process. Biomed. Appl. 75(6), 468–487 (2006)

    Google Scholar 

  21. Cohen, J., Ahmad, M.T., Qadir, M.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20, 37–46 (1960)

    Article  Google Scholar 

  22. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)

    Article  MATH  Google Scholar 

  23. Seringhaus, M.R., Gerstein, M.B.: Publishing perishing? Towards tomorrows information architecture. BMC Bioinform. 8, 17 (2007). doi:10.1186/1471-2105-8-17

    Article  Google Scholar 

  24. Gerstein, M., Seringhaus, M., Fields, S.: Structured digital abstract makes text mining easy. Nature 447, 142 (2007). doi:10.1038/447142a

    Article  Google Scholar 

  25. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130137 (1980)

    Article  Google Scholar 

  26. Afzal, M.T., Maurer, H., Balke, W.T., Kulathuramaiyer, N.: Rule based autonomous citation mining with TIERL. J. Dig. Inf. Manag. 8(3), 96–204 (2010)

  27. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(23), 103–134 (2000)

    Article  MATH  Google Scholar 

  28. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: European Conference on Machine Learning, pp. 137–142. Springer, Berlin (1998)

  29. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdul Shahid.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shahid, A., Afzal, M.T. Section-wise indexing and retrieval of research articles. Cluster Comput 21, 481–492 (2018). https://doi.org/10.1007/s10586-017-0914-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0914-4

Keywords

Navigation