Skip to main content

New Labeling Strategy for Semi-supervised Document Categorization

  • Conference paper
Knowledge Science, Engineering and Management (KSEM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5914))

  • 1201 Accesses

Abstract

Usually, semi-supervised learning requires a number of prior knowledge to supervise the learning process, such as, seeds in Seeded-Kmeans, pair-wise constraints in COP-Kmeans, and labeled data for training an initial useful classifier in S3VM. Such prior knowledge is generally provided by the domain expert, so it is very expensive. In this paper, we propose a new automatical document labeling strategy to derive much more prior knowledge based on the very limited labeled data and the whole data set. Experimental results on 20-Newsgroup text data have shown that the new strategy is helpful for semi-supervised document categorization and improves the learning performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Macskassy, S., Banerjee, A., Davison, B., Hirsh, H.: Human performance on custering web pages: a preliminary study. In: Proceedings of ACM SIGKDD, pp. 264–268 (1998)

    Google Scholar 

  2. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian Network classifier. Machine Learning 29, 131–163 (1997)

    Article  MATH  Google Scholar 

  3. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)

    MATH  Google Scholar 

  4. Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/Gather: a cluster-based approach to browsing large document collection. In: Proceedings of ACM SIGIR, pp. 318–329 (1992)

    Google Scholar 

  5. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining (2000)

    Google Scholar 

  6. Zamir, O., Etzioni, O.: Web document clustering: a feasability demonstration. In: Proceedings of ACM SIGIR, pp. 46–54 (1998)

    Google Scholar 

  7. Jing, L., Ng, M., Huang, J.: An entropy weighting k-means algorithm for subspace clsutering of high-dimensional sparse data. IEEE transactions on knowledge and data engineering 19(8), 1026–1041 (2007)

    Article  Google Scholar 

  8. Chapelle, O., Zien, A., Scholkopf, B.: Semi-supervised learning. MIT Press, Cambridge (2006)

    Google Scholar 

  9. Zhu, X.: Semi-supervised learning literature survey. Computer Sciences Technical Report 1530, University of Wisconsin-Madison (last modified on July19, 2008)

    Google Scholar 

  10. Hotho, A., Staab, S., Stumme, G.: Ontologies Improve Text Document Clustering. In: Proceeding of ICDM, pp. 19–22 (2003)

    Google Scholar 

  11. Zhou, Z., Chen, K., Dai, H.: Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on information systems 24(2), 219C–244C (2006)

    Article  Google Scholar 

  12. Li, M., Zhou, Z.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on systems, man and cybernetics-part A: systems and humans 37, 1088C–1098C (2007)

    Article  MathSciNet  Google Scholar 

  13. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Maching Learning 39, 103–134 (2000)

    Article  MATH  Google Scholar 

  14. Fujino, A., Ueda, N., Saito, K.: A hybrid generative/discriminative approach to semi-supervised classifier desigen. Proceedings of the 20th AAAI (2005)

    Google Scholar 

  15. Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proceedings of the 19th ICML, Sydney, Australia, pp. 27–34 (2002)

    Google Scholar 

  16. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-means Clustering with Background Knowledge. In: Proceedings of the 18th ICML, pp. 577–584 (2001)

    Google Scholar 

  17. Basu, S., Bilenko, M., Mooney, R.: A probabilistic framework for semi-supervised clustering. eattle, WA, pp. 59–68 (2004)

    Google Scholar 

  18. Zhou, D., Bousquet, O., Lal, N., Weston, J., Scholkopf, B., Olkopf, B.: Learning with Local and Global Consistency. Advances in Neural Information Processing Systems 16 (2004)

    Google Scholar 

  19. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In: Proceedings of the 20th ICML (2003)

    Google Scholar 

  20. Wang, F., Zhang, C.: Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20(1) (2008)

    Google Scholar 

  21. Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th ICML (1999)

    Google Scholar 

  22. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research 7, 2399–2434 (2006)

    MathSciNet  Google Scholar 

  23. Li, Y., Kwok, J., Zhou, Z.: Semi-supervised learning using label mean. In: Proceedings of the 26th ICML, Canada (2009)

    Google Scholar 

  24. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th COLT, Wisconsin, MI, pp. 92–100 (1998)

    Google Scholar 

  25. Zhou, Z., Zhan, D., Yang, Q.: Semi-supervised learning with very few labeled training examples. In: Proceedings of the 22th AAAI (2007)

    Google Scholar 

  26. Zhou, Z., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 17(11), 1529–1541 (2005)

    Article  MathSciNet  Google Scholar 

  27. Jain, K.: Data Clustering: 50 Years Beyond K-Means. Springer, Berlin (2008)

    Google Scholar 

  28. Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of ACM SIGIR workshop on semantic Web (2003)

    Google Scholar 

  29. Jing, L., Zhou, L., Ng, M., Huang, J.: Ontology-based distance measure for text clustering. In: Proceedings of SIAM DM workshop on text mining (2006)

    Google Scholar 

  30. Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th ACM SIGIR, pp. 787–788 (2007)

    Google Scholar 

  31. Schapire, R., Rochery, M., Rahim, M., Gupta, N.: Incorporating prior knowledge into boosting. In: Proceedings of 19th ICML, pp. 538–545 (2002)

    Google Scholar 

  32. Wu, X., Srihari, R.: Incorporating prior knowledge with weighted margin support vector machines. In: Proceedings of 10th ACM SIGKDD, pp. 326–333 (2004)

    Google Scholar 

  33. Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: Proceedings of 31th annual international ACM SIGIR, pp. 595–602 (2008)

    Google Scholar 

  34. Berry, M., Castellanos, M.: Survey of Text Mining II: Clustering, Classification, and Retrieval. Springer, Heidelberg (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, Y., Jing, L., Yu, J. (2009). New Labeling Strategy for Semi-supervised Document Categorization. In: Karagiannis, D., Jin, Z. (eds) Knowledge Science, Engineering and Management. KSEM 2009. Lecture Notes in Computer Science(), vol 5914. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10488-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-10488-6_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-10487-9

  • Online ISBN: 978-3-642-10488-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics