Abstract
The paper proposes a hybrid approach to reduce dimension in text classification problems, to overcome the issue of Curse of Dimensionality. This hybrid approach is a combination of Feature Selection (FS) and Feature Extraction (FE) methods, considering different aspects of feature relevance, to effectively reduce the dimension in large text datasets. It prevents feature selection biased in favor of a particular FS method. Many FS methods like Term Variance, Document Frequency, Information Gain, Shannons Entropy measure, Mean-Median and Mean Absolute Difference, were implemented and a comparative study was made on their performance when implemented in a hybrid system. The features selected by the individual FS methods are merged using three approaches, namely, Union, Intersection and Modified Union. The sub lists of features further undergo Feature Extraction by PCA, and the reduced feature sub list is clustered with k-means. Finally, the sentiment-score of the individual clusters are calculated using SentiWordNet database which gives the polarity of the data. The experiments were conducted on the benchmark datasets namely Reuters-21,578 and Classic4. The performance evaluation of the system made using the measures like precision, recall, f-score and accuracy shows that the proposed method has improved performance compared to its competitive methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. Lond., Edinb., Dublin Phil. Mag. J. Sci. 2(11), 559–572 (1901)
Deerwester, S.: Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American Society for Information Science, no. 25, pp. 3640 (1988)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization.In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420
Xu, Y.: A comparative study on feature selection in unbalance text classification. In: 2012 International Symposium on Information Science and Engineering (ISISE), pp. 44–47. IEEE (2012)
Ferreira, A.J., Figueiredo, M.A.T.: Efficient feature selection filters for high-dimensional data. Pattern Recognit. Lett. 33(13), 1794–1804 (2012)
Bharti, K.K., Singh, P.K.: A three-stage unsupervised dimension reduction method for text clustering. J. Comput. Sci. 5(2), 156–169 (2014)
Patil, L.H., Atique, M.: A novel feature selection based on information gain using WordNet. In: Science and Information Conference (SAI), pp. 625–629. IEEE (2013)
Largeron, C., Moulin, C., Gry, M.: Entropy based feature selection for text categorization. In: Proceedings of the 2011 ACM Symposium on Applied Computing, pp. 924–928. ACM (2011)
Uguz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl.-Based Syst. 24(7), 1024–1032 (2011)
Tsai, C.-F., Hsiao, Y.-C.: Combining multiple feature selection methods for stock prediction: union, intersection, and multi-intersection approaches. Decis. Support Syst. 50(1), 258–269 (2010)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Godbole, N., Srinivasaiah, M., Skiena, S.: Large-scale sentiment analysis for news and blogs. ICWSM 7, 21 (2007)
Meesad, P., Li, J.: Stock trend prediction relying on text mining and sentiment analysis with tweets. In: 2014 Fourth World Congress on Information and Communication Technologies (WICT), pp. 257–262. IEEE (2014)
Esuli, A., Sebastiani, F.: Sentiwordnet: a publicly available lexical resource for opinion mining. Proc. LREC 6, 417–422 (2006)
Denecke, K.: Are SentiWordNet scores suited for multi-domain sentiment classification? In: Fourth International Conference on Digital Information Management. ICDIM 2009, pp. 1–6. IEEE (2009)
http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/
https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Nebu, C.M., Joseph, S. (2016). A Hybrid Dimension Reduction Technique for Document Clustering. In: Snášel, V., Abraham, A., Krömer, P., Pant, M., Muda, A. (eds) Innovations in Bio-Inspired Computing and Applications. Advances in Intelligent Systems and Computing, vol 424. Springer, Cham. https://doi.org/10.1007/978-3-319-28031-8_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-28031-8_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28030-1
Online ISBN: 978-3-319-28031-8
eBook Packages: EngineeringEngineering (R0)