Skip to main content

Text Classification of Technical Papers Based on Text Segmentation

  • Conference paper
Natural Language Processing and Information Systems (NLDB 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7934))

Abstract

The goal of this research is to design a multi-label classification model which determines the research topics of a given technical paper. Based on the idea that papers are well organized and some parts of papers are more important than others for text classification, segments such as title, abstract, introduction and conclusion are intensively used in text representation. In addition, new features called Title Bi-Gram and Title SigNoun are used to improve the performance. The results of the experiments indicate that feature selection based on text segmentation and these two features are effective. Furthermore, we proposed a new model for text classification based on the structure of papers, called Back-off model, which achieves 60.45% Exact Match Ratio and 68.75% F-measure. It was also shown that Back-off model outperformed two existing methods, ML-kNN and Binary Approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  2. Rahmoun, A., Elberrichi, Z.: Experimenting n-grams in text categorization. Int. Arab J. Inf. Technol., 377–385 (2007)

    Google Scholar 

  3. Cao, M.D., Gao, X.: Combining contents and citations for scientific document classification. In: Australian Conference on Artificial Intelligence, pp. 143–152 (2005)

    Google Scholar 

  4. Zhang, M., Gao, X., Cao, M.D., Ma, Y.: Modelling citation networks for improving scientific paper classification performance. In: Yang, Q., Webb, G. (eds.) PRICAI 2006. LNCS (LNAI), vol. 4099, pp. 413–422. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. Nomoto, T., Matsumoto, Y.: Exploiting text structure for topic identification. In: Proceedings of the 4th Workshop on Very Large Corpora, pp. 101–112 (1996)

    Google Scholar 

  6. Larkey, L.S.: A patent search and classification system. In: Proceedings of the Fourth ACM Conference on Digital Libraries, DL 1999, pp. 179–187. ACM, New York (1999)

    Chapter  Google Scholar 

  7. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer US (2010)

    Google Scholar 

  8. Zhang, M.L., Zhou, Z.H.: Ml-knn: A lazy learning approach to multi-label learning. Pattern Recognition 40(7), 2038–2048 (2007)

    Article  MATH  Google Scholar 

  9. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A java library for multi-label learning. Journal of Machine Learning Research 12, 2411–2414 (2011)

    MathSciNet  Google Scholar 

  10. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  11. Morgan, W.: Statistical hypothesis tests for NLP, http://cs.stanford.edu/people/wmorgan/sigtest.pdf

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nguyen, T.H., Shirai, K. (2013). Text Classification of Technical Papers Based on Text Segmentation. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds) Natural Language Processing and Information Systems. NLDB 2013. Lecture Notes in Computer Science, vol 7934. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38824-8_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38824-8_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38823-1

  • Online ISBN: 978-3-642-38824-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics