Semi-automatic Document Classification: Exploiting Document Difficulty

Martinez-Alvarez, Miguel; Yahyaei, Sirvan; Roelleke, Thomas

doi:10.1007/978-3-642-28997-2_43

Miguel Martinez-Alvarez^22,23,
Sirvan Yahyaei²² &
Thomas Roelleke²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7224))

Included in the following conference series:

European Conference on Information Retrieval

2742 Accesses
5 Citations

Abstract

There are circumstances where classification is required only if a certain condition, such a specific level of quality, is met. This paper investigates a semi-automatic solution where only the predictions for the documents which are more likely to be correctly classified would be considered. This method provides high-quality automatic classification for large subsets of the collection and employs human expertise for the “most complicated” decisions. This research presents different approaches to measure document difficulty and it discusses the benefits of applying it for semi-automatic classification. In addition, experiments are carried out to show the results achieved for different subsets of the collection. Experiments prove that it is possible to improve quality significantly with large subsets (i.e. 13% micro-f ₁ increase with 70% of documents) of two different collections. Furthermore, it shows how it provides a flexible mechanism to apply automatic classification to specific subsets while specific constrains are met.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Carmel, D., Yom-Tov, E., Darlow, A., Pelleg, D.: What makes a query difficult? In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 390–397. ACM, New York (2006)
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article MathSciNet Google Scholar
Yang, Y.: A study on thresholding strategies for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 137–145. ACM Press (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Queen Mary, University of London, UK
Miguel Martinez-Alvarez, Sirvan Yahyaei & Thomas Roelleke
Globe Business Publishing Ltd., UK
Miguel Martinez-Alvarez

Authors

Miguel Martinez-Alvarez
View author publications
You can also search for this author in PubMed Google Scholar
Sirvan Yahyaei
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Roelleke
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Yahoo! Research, Diagonal 177, 08018, Barcelona, Spain
Ricardo Baeza-Yates & B. Barla Cambazoglu &
Centrum Wiskunde & Informatica, Science Park 123, Amsterdam, The Netherlands
Arjen P. de Vries
Websays, Nàpols 294 7-4, 08025, Barcelona, Spain
Hugo Zaragoza
Yahoo! Research, Diagnoal 177, 08018, Barcelona, Spain
Vanessa Murdock
Yahoo! Labs, Tower 3, Matam Park, 31905, Haifa, Israel
Ronny Lempel
ISTI-CNR, via G. Moruzzi, 1, 56124, Pisa, Italy
Fabrizio Silvestri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martinez-Alvarez, M., Yahyaei, S., Roelleke, T. (2012). Semi-automatic Document Classification: Exploiting Document Difficulty. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_43

Download citation

DOI: https://doi.org/10.1007/978-3-642-28997-2_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28996-5
Online ISBN: 978-3-642-28997-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics