Abstract
There are circumstances where classification is required only if a certain condition, such a specific level of quality, is met. This paper investigates a semi-automatic solution where only the predictions for the documents which are more likely to be correctly classified would be considered. This method provides high-quality automatic classification for large subsets of the collection and employs human expertise for the “most complicated” decisions. This research presents different approaches to measure document difficulty and it discusses the benefits of applying it for semi-automatic classification. In addition, experiments are carried out to show the results achieved for different subsets of the collection. Experiments prove that it is possible to improve quality significantly with large subsets (i.e. 13% micro-f 1 increase with 70% of documents) of two different collections. Furthermore, it shows how it provides a flexible mechanism to apply automatic classification to specific subsets while specific constrains are met.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Carmel, D., Yom-Tov, E., Darlow, A., Pelleg, D.: What makes a query difficult? In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 390–397. ACM, New York (2006)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Yang, Y.: A study on thresholding strategies for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 137–145. ACM Press (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martinez-Alvarez, M., Yahyaei, S., Roelleke, T. (2012). Semi-automatic Document Classification: Exploiting Document Difficulty. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_43
Download citation
DOI: https://doi.org/10.1007/978-3-642-28997-2_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28996-5
Online ISBN: 978-3-642-28997-2
eBook Packages: Computer ScienceComputer Science (R0)