Skip to main content

A Hybrid Dimension Reduction Technique for Document Clustering

  • Conference paper
  • First Online:
Innovations in Bio-Inspired Computing and Applications

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 424))

Abstract

The paper proposes a hybrid approach to reduce dimension in text classification problems, to overcome the issue of Curse of Dimensionality. This hybrid approach is a combination of Feature Selection (FS) and Feature Extraction (FE) methods, considering different aspects of feature relevance, to effectively reduce the dimension in large text datasets. It prevents feature selection biased in favor of a particular FS method. Many FS methods like Term Variance, Document Frequency, Information Gain, Shannons Entropy measure, Mean-Median and Mean Absolute Difference, were implemented and a comparative study was made on their performance when implemented in a hybrid system. The features selected by the individual FS methods are merged using three approaches, namely, Union, Intersection and Modified Union. The sub lists of features further undergo Feature Extraction by PCA, and the reduced feature sub list is clustered with k-means. Finally, the sentiment-score of the individual clusters are calculated using SentiWordNet database which gives the polarity of the data. The experiments were conducted on the benchmark datasets namely Reuters-21,578 and Classic4. The performance evaluation of the system made using the measures like precision, recall, f-score and accuracy shows that the proposed method has improved performance compared to its competitive methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. Lond., Edinb., Dublin Phil. Mag. J. Sci. 2(11), 559–572 (1901)

    Article  MATH  Google Scholar 

  2. Deerwester, S.: Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American Society for Information Science, no. 25, pp. 3640 (1988)

    Google Scholar 

  3. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization.In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420

    Google Scholar 

  4. Xu, Y.: A comparative study on feature selection in unbalance text classification. In: 2012 International Symposium on Information Science and Engineering (ISISE), pp. 44–47. IEEE (2012)

    Google Scholar 

  5. Ferreira, A.J., Figueiredo, M.A.T.: Efficient feature selection filters for high-dimensional data. Pattern Recognit. Lett. 33(13), 1794–1804 (2012)

    Article  Google Scholar 

  6. Bharti, K.K., Singh, P.K.: A three-stage unsupervised dimension reduction method for text clustering. J. Comput. Sci. 5(2), 156–169 (2014)

    Google Scholar 

  7. Patil, L.H., Atique, M.: A novel feature selection based on information gain using WordNet. In: Science and Information Conference (SAI), pp. 625–629. IEEE (2013)

    Google Scholar 

  8. Largeron, C., Moulin, C., Gry, M.: Entropy based feature selection for text categorization. In: Proceedings of the 2011 ACM Symposium on Applied Computing, pp. 924–928. ACM (2011)

    Google Scholar 

  9. Uguz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl.-Based Syst. 24(7), 1024–1032 (2011)

    Article  Google Scholar 

  10. Tsai, C.-F., Hsiao, Y.-C.: Combining multiple feature selection methods for stock prediction: union, intersection, and multi-intersection approaches. Decis. Support Syst. 50(1), 258–269 (2010)

    Article  Google Scholar 

  11. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  12. Godbole, N., Srinivasaiah, M., Skiena, S.: Large-scale sentiment analysis for news and blogs. ICWSM 7, 21 (2007)

    Google Scholar 

  13. Meesad, P., Li, J.: Stock trend prediction relying on text mining and sentiment analysis with tweets. In: 2014 Fourth World Congress on Information and Communication Technologies (WICT), pp. 257–262. IEEE (2014)

    Google Scholar 

  14. Esuli, A., Sebastiani, F.: Sentiwordnet: a publicly available lexical resource for opinion mining. Proc. LREC 6, 417–422 (2006)

    Google Scholar 

  15. Denecke, K.: Are SentiWordNet scores suited for multi-domain sentiment classification? In: Fourth International Conference on Digital Information Management. ICDIM 2009, pp. 1–6. IEEE (2009)

    Google Scholar 

  16. http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

  17. http://tartarus.org/martin/PorterStemmer/

  18. https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

  19. http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/

  20. https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cynthia Marea Nebu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Nebu, C.M., Joseph, S. (2016). A Hybrid Dimension Reduction Technique for Document Clustering. In: Snášel, V., Abraham, A., Krömer, P., Pant, M., Muda, A. (eds) Innovations in Bio-Inspired Computing and Applications. Advances in Intelligent Systems and Computing, vol 424. Springer, Cham. https://doi.org/10.1007/978-3-319-28031-8_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28031-8_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28030-1

  • Online ISBN: 978-3-319-28031-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics