A Hybrid Dimension Reduction Technique for Document Clustering

Nebu, Cynthia Marea; Joseph, Sumy

doi:10.1007/978-3-319-28031-8_35

Cynthia Marea Nebu¹⁹ &
Sumy Joseph¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 424))

962 Accesses
2 Citations

Abstract

The paper proposes a hybrid approach to reduce dimension in text classification problems, to overcome the issue of Curse of Dimensionality. This hybrid approach is a combination of Feature Selection (FS) and Feature Extraction (FE) methods, considering different aspects of feature relevance, to effectively reduce the dimension in large text datasets. It prevents feature selection biased in favor of a particular FS method. Many FS methods like Term Variance, Document Frequency, Information Gain, Shannons Entropy measure, Mean-Median and Mean Absolute Difference, were implemented and a comparative study was made on their performance when implemented in a hybrid system. The features selected by the individual FS methods are merged using three approaches, namely, Union, Intersection and Modified Union. The sub lists of features further undergo Feature Extraction by PCA, and the reduced feature sub list is clustered with k-means. Finally, the sentiment-score of the individual clusters are calculated using SentiWordNet database which gives the polarity of the data. The experiments were conducted on the benchmark datasets namely Reuters-21,578 and Classic4. The performance evaluation of the system made using the measures like precision, recall, f-score and accuracy shows that the proposed method has improved performance compared to its competitive methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. Lond., Edinb., Dublin Phil. Mag. J. Sci. 2(11), 559–572 (1901)
Article MATH Google Scholar
Deerwester, S.: Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American Society for Information Science, no. 25, pp. 3640 (1988)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization.In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420
Google Scholar
Xu, Y.: A comparative study on feature selection in unbalance text classification. In: 2012 International Symposium on Information Science and Engineering (ISISE), pp. 44–47. IEEE (2012)
Google Scholar
Ferreira, A.J., Figueiredo, M.A.T.: Efficient feature selection filters for high-dimensional data. Pattern Recognit. Lett. 33(13), 1794–1804 (2012)
Article Google Scholar
Bharti, K.K., Singh, P.K.: A three-stage unsupervised dimension reduction method for text clustering. J. Comput. Sci. 5(2), 156–169 (2014)
Google Scholar
Patil, L.H., Atique, M.: A novel feature selection based on information gain using WordNet. In: Science and Information Conference (SAI), pp. 625–629. IEEE (2013)
Google Scholar
Largeron, C., Moulin, C., Gry, M.: Entropy based feature selection for text categorization. In: Proceedings of the 2011 ACM Symposium on Applied Computing, pp. 924–928. ACM (2011)
Google Scholar
Uguz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl.-Based Syst. 24(7), 1024–1032 (2011)
Article Google Scholar
Tsai, C.-F., Hsiao, Y.-C.: Combining multiple feature selection methods for stock prediction: union, intersection, and multi-intersection approaches. Decis. Support Syst. 50(1), 258–269 (2010)
Article Google Scholar
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Godbole, N., Srinivasaiah, M., Skiena, S.: Large-scale sentiment analysis for news and blogs. ICWSM 7, 21 (2007)
Google Scholar
Meesad, P., Li, J.: Stock trend prediction relying on text mining and sentiment analysis with tweets. In: 2014 Fourth World Congress on Information and Communication Technologies (WICT), pp. 257–262. IEEE (2014)
Google Scholar
Esuli, A., Sebastiani, F.: Sentiwordnet: a publicly available lexical resource for opinion mining. Proc. LREC 6, 417–422 (2006)
Google Scholar
Denecke, K.: Are SentiWordNet scores suited for multi-domain sentiment classification? In: Fourth International Conference on Digital Information Management. ICDIM 2009, pp. 1–6. IEEE (2009)
Google Scholar
http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
http://tartarus.org/martin/PorterStemmer/
https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/
https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

Download references

Author information

Authors and Affiliations

Amal Jyothi College of Engineering, Koovappally, Kerala, India
Cynthia Marea Nebu & Sumy Joseph

Authors

Cynthia Marea Nebu
View author publications
You can also search for this author in PubMed Google Scholar
Sumy Joseph
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cynthia Marea Nebu .

Editor information

Editors and Affiliations

Dep. of Computer Science, VŠB – Technical Univ. of Ostrava, Ostrava, Czech Republic
Václav Snášel
(MIR Labs), Scientific Net Innov & Res Excel, Auburn, Washington, USA
Ajit Abraham
Faculty of Elec. Eng. & Comp. Sci., VŠB - Technical University of Ostrava, Ostrava-Poruba, Czech Republic
Pavel Krömer
Department of Paper Technology, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Millie Pant
Fac of Info & Comm, Comp Inte & Tech Lab, Universiti Teknikal Malaysia Melaka, Durian Tunggal, Malaysia
Azah Kamilah Muda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nebu, C.M., Joseph, S. (2016). A Hybrid Dimension Reduction Technique for Document Clustering. In: Snášel, V., Abraham, A., Krömer, P., Pant, M., Muda, A. (eds) Innovations in Bio-Inspired Computing and Applications. Advances in Intelligent Systems and Computing, vol 424. Springer, Cham. https://doi.org/10.1007/978-3-319-28031-8_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-28031-8_35
Published: 15 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28030-1
Online ISBN: 978-3-319-28031-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics