Text Classification of Technical Papers Based on Text Segmentation

Nguyen, Thien Hai; Shirai, Kiyoaki

doi:10.1007/978-3-642-38824-8_25

Thien Hai Nguyen²⁰ &
Kiyoaki Shirai²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7934))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

2510 Accesses
7 Citations

Abstract

The goal of this research is to design a multi-label classification model which determines the research topics of a given technical paper. Based on the idea that papers are well organized and some parts of papers are more important than others for text classification, segments such as title, abstract, introduction and conclusion are intensively used in text representation. In addition, new features called Title Bi-Gram and Title SigNoun are used to improve the performance. The results of the experiments indicate that feature selection based on text segmentation and these two features are effective. Furthermore, we proposed a new model for text classification based on the structure of papers, called Back-off model, which achieves 60.45% Exact Match Ratio and 68.75% F-measure. It was also shown that Back-off model outperformed two existing methods, ML-kNN and Binary Approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Rahmoun, A., Elberrichi, Z.: Experimenting n-grams in text categorization. Int. Arab J. Inf. Technol., 377–385 (2007)
Google Scholar
Cao, M.D., Gao, X.: Combining contents and citations for scientific document classification. In: Australian Conference on Artificial Intelligence, pp. 143–152 (2005)
Google Scholar
Zhang, M., Gao, X., Cao, M.D., Ma, Y.: Modelling citation networks for improving scientific paper classification performance. In: Yang, Q., Webb, G. (eds.) PRICAI 2006. LNCS (LNAI), vol. 4099, pp. 413–422. Springer, Heidelberg (2006)
Chapter Google Scholar
Nomoto, T., Matsumoto, Y.: Exploiting text structure for topic identification. In: Proceedings of the 4th Workshop on Very Large Corpora, pp. 101–112 (1996)
Google Scholar
Larkey, L.S.: A patent search and classification system. In: Proceedings of the Fourth ACM Conference on Digital Libraries, DL 1999, pp. 179–187. ACM, New York (1999)
Chapter Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer US (2010)
Google Scholar
Zhang, M.L., Zhou, Z.H.: Ml-knn: A lazy learning approach to multi-label learning. Pattern Recognition 40(7), 2038–2048 (2007)
Article MATH Google Scholar
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A java library for multi-label learning. Journal of Machine Learning Research 12, 2411–2414 (2011)
MathSciNet Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Morgan, W.: Statistical hypothesis tests for NLP, http://cs.stanford.edu/people/wmorgan/sigtest.pdf

Download references

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology, Japan
Thien Hai Nguyen & Kiyoaki Shirai

Authors

Thien Hai Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoaki Shirai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Conservatoire National des Arts et Métiers, 2 rue Conté, 75003, Paris, France
Elisabeth Métais
School of Computing, Science and Engineering, University of Salford, The Crescent, M5 4WT, Salford, Lancashire, UK
Farid Meziane & Sunil Vadera &
School of Computing Science and Engineering, University of Salford, The Crescent, M5 4WT, Salford, Lancashire, UK
Mohamad Saraee
Department of Decision and Information Sciences School of Business Administration, Oakland University, 306 Elliott Hall, 48309, Rochester, MI, USA
Vijayan Sugumaran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, T.H., Shirai, K. (2013). Text Classification of Technical Papers Based on Text Segmentation. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds) Natural Language Processing and Information Systems. NLDB 2013. Lecture Notes in Computer Science, vol 7934. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38824-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-38824-8_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38823-1
Online ISBN: 978-3-642-38824-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics