A Clustering Based Feature Selection Method Using Feature Information Distance for Text Data

Chao, Shilong; Cai, Jie; Yang, Sheng; Wang, Shulin

doi:10.1007/978-3-319-42291-6_12

Shilong Chao¹⁶,
Jie Cai¹⁶,
Sheng Yang¹⁶ &
…
Shulin Wang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9771))

Included in the following conference series:

International Conference on Intelligent Computing

1904 Accesses
3 Citations

Abstract

Feature selection is a key point in text classification. In this paper a new feature selection method based on feature clustering using information distance is put forward. This method using information distance measure builds a feature clusters space. Firstly, K-medoids clustering algorithm is employed to gather the features into k clusters. Secondly the feature which has the largest mutual information with class is selected from each cluster to make up a feature subset. Finally, choose target number features according to the mRMR algorithm from the selected subset. This algorithm fully considers the diversity between features. Unlike the incremental search algorithm mRMR, it avoids prematurely falling into local optimum. Experimental results show that the features selected by the proposed algorithm can gain better classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31, 721–735 (2009)
Article Google Scholar
Xu, J., Croft, W.B.: Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst. 18, 79–112 (2000)
Article Google Scholar
Chen, Z., Lü, K.: A preprocess algorithm of filtering irrelevant information based on the minimum class difference. Knowl.-Based Syst. 19, 422–429 (2006)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article MathSciNet Google Scholar
Song, F., Liu, S., Yang, J.: A comparative study on text representation schemes in text categorization. Pattern Anal. Appl. 8, 199–209 (2005)
Article MathSciNet Google Scholar
Fragoudis, D., Meretakis, D., Likothanassis, S.: Best terms: an efficient feature-selection algorithm for text categorization. Knowl. Inf. Syst. 8, 16–33 (2005)
Article Google Scholar
Battiti, R.: Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 5, 537–550 (1994)
Article Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)
Article Google Scholar
Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)
MathSciNet MATH Google Scholar
Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 1531–1555 (2004)
MathSciNet MATH Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Effective global approaches for mutual information based feature selection. In: International Conference on Knowledge Discovery and Data Mining, pp. 512–521. ACM (2014)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Liu, H., Liu, L., Zhang, H.: Feature selection using mutual information: an experimental study. In: Ho, T.-B., Zhou, Z.-H. (eds.) PRICAI 2008. LNCS (LNAI), vol. 5351, pp. 235–246. Springer, Heidelberg (2008)
Chapter Google Scholar
Au, W.H., Chan, K.C.C., Wong, A.K.C., Wang, Y.: Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 2, 83–101 (2005)
Article Google Scholar
Song, Q., Ni, J., Wang, G.: A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans. Knowl. Data Eng. 25, 1–14 (2013)
Article Google Scholar
Liu, Q., Zhang, J., Xiao, J., Zhu, H., Zhao, Q.: A supervised feature selection algorithm through minimum spanning tree clustering. In: IEEE 26th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 264–271 (2014)
Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
MathSciNet MATH Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: 26th AI Conference, pp. 1073–1080 (2009)
Google Scholar
Vinh, N.X, Epps, J.: A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. In: 9th IEEE International Conference on Bioinformatics and BioEngineering, pp. 84–91 (2009)
Google Scholar
Jain, A.K., Duin, R.P., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22, 4–37 (2000)
Article Google Scholar
Herman, G., Zhang, B., Wang, Y., Ye, G., Chen, F.: Mutual information-based method for selecting informative feature sets. Pattern Recogn. 46, 3315–3327 (2013)
Article Google Scholar
Fayyad, U., Irani, K.B.: Multi-interval discretization of continuous valued attributes for classification learning. In: 13th IJCAI, pp. 1022–1027 (1993)
Google Scholar

Download references

Acknowledgments

This research was supported by the National Natural Science Foundation of China (Grant No. 61472467).

Author information

Authors and Affiliations

College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
Shilong Chao, Jie Cai, Sheng Yang & Shulin Wang

Authors

Shilong Chao
View author publications
You can also search for this author in PubMed Google Scholar
Jie Cai
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shulin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sheng Yang .

Editor information

Editors and Affiliations

Tongji University , Shanghai, China
De-Shuang Huang
Polytechnic of Bari , Bari, Italy
Vitoantonio Bevilacqua
University of Wollongong , North Wollongong, New South Wales, Australia
Prashan Premaratne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chao, S., Cai, J., Yang, S., Wang, S. (2016). A Clustering Based Feature Selection Method Using Feature Information Distance for Text Data. In: Huang, DS., Bevilacqua, V., Premaratne, P. (eds) Intelligent Computing Theories and Application. ICIC 2016. Lecture Notes in Computer Science(), vol 9771. Springer, Cham. https://doi.org/10.1007/978-3-319-42291-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-42291-6_12
Published: 12 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42290-9
Online ISBN: 978-3-319-42291-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics