New Labeling Strategy for Semi-supervised Document Categorization

Zhu, Yan; Jing, Liping; Yu, Jian

doi:10.1007/978-3-642-10488-6_16

Yan Zhu²¹,
Liping Jing²¹ &
Jian Yu²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5914))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1201 Accesses

Abstract

Usually, semi-supervised learning requires a number of prior knowledge to supervise the learning process, such as, seeds in Seeded-Kmeans, pair-wise constraints in COP-Kmeans, and labeled data for training an initial useful classifier in S3VM. Such prior knowledge is generally provided by the domain expert, so it is very expensive. In this paper, we propose a new automatical document labeling strategy to derive much more prior knowledge based on the very limited labeled data and the whole data set. Experimental results on 20-Newsgroup text data have shown that the new strategy is helpful for semi-supervised document categorization and improves the learning performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Macskassy, S., Banerjee, A., Davison, B., Hirsh, H.: Human performance on custering web pages: a preliminary study. In: Proceedings of ACM SIGKDD, pp. 264–268 (1998)
Google Scholar
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian Network classifier. Machine Learning 29, 131–163 (1997)
Article MATH Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar
Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/Gather: a cluster-based approach to browsing large document collection. In: Proceedings of ACM SIGIR, pp. 318–329 (1992)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining (2000)
Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: a feasability demonstration. In: Proceedings of ACM SIGIR, pp. 46–54 (1998)
Google Scholar
Jing, L., Ng, M., Huang, J.: An entropy weighting k-means algorithm for subspace clsutering of high-dimensional sparse data. IEEE transactions on knowledge and data engineering 19(8), 1026–1041 (2007)
Article Google Scholar
Chapelle, O., Zien, A., Scholkopf, B.: Semi-supervised learning. MIT Press, Cambridge (2006)
Google Scholar
Zhu, X.: Semi-supervised learning literature survey. Computer Sciences Technical Report 1530, University of Wisconsin-Madison (last modified on July19, 2008)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Ontologies Improve Text Document Clustering. In: Proceeding of ICDM, pp. 19–22 (2003)
Google Scholar
Zhou, Z., Chen, K., Dai, H.: Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on information systems 24(2), 219C–244C (2006)
Article Google Scholar
Li, M., Zhou, Z.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on systems, man and cybernetics-part A: systems and humans 37, 1088C–1098C (2007)
Article MathSciNet Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Maching Learning 39, 103–134 (2000)
Article MATH Google Scholar
Fujino, A., Ueda, N., Saito, K.: A hybrid generative/discriminative approach to semi-supervised classifier desigen. Proceedings of the 20th AAAI (2005)
Google Scholar
Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proceedings of the 19th ICML, Sydney, Australia, pp. 27–34 (2002)
Google Scholar
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-means Clustering with Background Knowledge. In: Proceedings of the 18th ICML, pp. 577–584 (2001)
Google Scholar
Basu, S., Bilenko, M., Mooney, R.: A probabilistic framework for semi-supervised clustering. eattle, WA, pp. 59–68 (2004)
Google Scholar
Zhou, D., Bousquet, O., Lal, N., Weston, J., Scholkopf, B., Olkopf, B.: Learning with Local and Global Consistency. Advances in Neural Information Processing Systems 16 (2004)
Google Scholar
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In: Proceedings of the 20th ICML (2003)
Google Scholar
Wang, F., Zhang, C.: Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20(1) (2008)
Google Scholar
Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th ICML (1999)
Google Scholar
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research 7, 2399–2434 (2006)
MathSciNet Google Scholar
Li, Y., Kwok, J., Zhou, Z.: Semi-supervised learning using label mean. In: Proceedings of the 26th ICML, Canada (2009)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th COLT, Wisconsin, MI, pp. 92–100 (1998)
Google Scholar
Zhou, Z., Zhan, D., Yang, Q.: Semi-supervised learning with very few labeled training examples. In: Proceedings of the 22th AAAI (2007)
Google Scholar
Zhou, Z., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 17(11), 1529–1541 (2005)
Article MathSciNet Google Scholar
Jain, K.: Data Clustering: 50 Years Beyond K-Means. Springer, Berlin (2008)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of ACM SIGIR workshop on semantic Web (2003)
Google Scholar
Jing, L., Zhou, L., Ng, M., Huang, J.: Ontology-based distance measure for text clustering. In: Proceedings of SIAM DM workshop on text mining (2006)
Google Scholar
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th ACM SIGIR, pp. 787–788 (2007)
Google Scholar
Schapire, R., Rochery, M., Rahim, M., Gupta, N.: Incorporating prior knowledge into boosting. In: Proceedings of 19th ICML, pp. 538–545 (2002)
Google Scholar
Wu, X., Srihari, R.: Incorporating prior knowledge with weighted margin support vector machines. In: Proceedings of 10th ACM SIGKDD, pp. 326–333 (2004)
Google Scholar
Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: Proceedings of 31th annual international ACM SIGIR, pp. 595–602 (2008)
Google Scholar
Berry, M., Castellanos, M.: Survey of Text Mining II: Clustering, Classification, and Retrieval. Springer, Heidelberg (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Yan Zhu, Liping Jing & Jian Yu

Authors

Yan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Liping Jing
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Science, Institute for Knowledge and Business Engineering, University of Vienna, Brünner Straße 72, 1210, Vienna, Austria
Dimitris Karagiannis
School of Electronic Engineering and Computer Science, Peking University, No. 5 Yiheyuan Road, 100871, Beijing, China
Zhi Jin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Y., Jing, L., Yu, J. (2009). New Labeling Strategy for Semi-supervised Document Categorization. In: Karagiannis, D., Jin, Z. (eds) Knowledge Science, Engineering and Management. KSEM 2009. Lecture Notes in Computer Science(), vol 5914. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10488-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-10488-6_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10487-9
Online ISBN: 978-3-642-10488-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics