Skip to main content

Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts Under Precision and Recall Constraints

  • Conference paper
  • First Online:
Digital Libraries for Open Knowledge (TPDL 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11057))

Included in the following conference series:

Abstract

Digital libraries strive for integration of automatic subject indexing methods into operative information retrieval systems, yet integration is prevented by misleading and incomplete semantic annotations. For this reason, we investigate approaches to detect documents where quality criteria are met. In contrast to mainstream methods, our approach, named Qualle, estimates quality at the document-level rather than the concept-level. Qualle is implemented as a combination of different machine learning models into a deep, multi-layered regression architecture that comprises a variety of content-based indicators, in particular label set size calibration. We evaluated the approach on very short texts from law and economics, investigating the impact of different feature groups on recall estimation. Our results show that Qualle effectively determined subsets of previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Such filtering can therefore be used to control compliance with data quality standards in practice. Qualle allows to make trade-offs between indexing quality and collection coverage, and it can complement semi-automatic indexing to process large datasets more efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For brevity, the term subject may be omitted in subject indexing, subject indexer, ..., respectively.

  2. 2.

    The concept identifier 29638-6 refers to the concept “Low-interest-rate policy”.

  3. 3.

    If only ranking is relevant, rank-based correlation coefficients should be considered.

  4. 4.

    http://eurovoc.europa.eu/, accessed: 31.12.2017.

  5. 5.

    http://www.ke.tu-darmstadt.de/resources/eurlex, accessed 31.12.2017.

  6. 6.

    http://zbw.eu/stw/version/latest/about.en.html, accessed: 09.01.2018.

References

  1. Bennett, P.N., Chickering, D.M., Meek, C., Zhu, X.: Algorithms for active classifier selection: maximizing recall with precision constraints. In: Proceedings of WSDM 2017, pp. 711–719. ACM (2017)

    Google Scholar 

  2. Bennett, P.N., Dumais, S.T., Horvitz, E.: Probabilistic combination of text classifiers using reliability indicators: models and results. In: Proceedings of SIGIR 2002, pp. 207–214. ACM (2002). https://doi.org/10.1145/564376.564413

  3. Culotta, A., McCallum, A.: Confidence estimation for information extraction. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 109–112. ACL (2004)

    Google Scholar 

  4. Drucker, H.: Improving regressors using boosting techniques. In: Proceedings of ICML 1997, pp. 107–115. Morgan Kaufmann (1997)

    Google Scholar 

  5. Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002). https://doi.org/10.1016/s0167-9473(01)00065-2

    Article  MathSciNet  MATH  Google Scholar 

  6. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1

    Article  MATH  Google Scholar 

  7. Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Comput. Surv. 47(3), 52:1–52:38 (2015). https://doi.org/10.1145/2716262

    Article  Google Scholar 

  8. Huang, M., Névéol, A., Lu, Z.: Recommending MeSH terms for annotating biomedical articles. JAMIA 18(5), 660–667 (2011)

    Google Scholar 

  9. Liu, J., Chang, W., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of SIGIR 2017, pp. 115–124. ACM (2017)

    Google Scholar 

  10. Loza Mencía, E., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 192–215. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12837-0_11

    Chapter  Google Scholar 

  11. Medelyan, O., Witten, I.H.: Measuring inter-indexer consistency using a thesaurus. In: Proceedings of JCDL 2006, pp. 274–275. ACM (2006)

    Google Scholar 

  12. Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. JASIST 59(7), 1026–1040 (2008)

    Article  Google Scholar 

  13. Neveol, A., Zeng, K., Bodenreider, O.: Besides precision & recall: exploring alternative approaches to evaluating an automatic indexing tool for MEDLINE. In: AMIA Annual Symposium Proceedings, pp. 589–593 (2006)

    Google Scholar 

  14. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  15. Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you?: explaining the predictions of any classifier. In: Proceedings of SIGKDD 2016, pp. 1135–1144. ACM (2016)

    Google Scholar 

  16. Rolling, L.N.: Indexing consistency, quality and efficiency. Inf. Process. Manag. 17(2), 69–76 (1981)

    Article  Google Scholar 

  17. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  18. Tahmasebi, N., Risse, T.: On the uses of word sense change for research in the digital humanities. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds.) TPDL 2017. LNCS, vol. 10450, pp. 246–257. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67008-9_20

    Chapter  Google Scholar 

  19. Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proceedings of WWW 2009, pp. 211–220. ACM (2009)

    Google Scholar 

  20. Toepfer, M., Seifert, C.: Descriptor-invariant fusion architectures for automatic subject indexing. In: Proceedings of JCDL 2017, pp. 31–40. IEEE Computer Society (2017)

    Google Scholar 

  21. Trieschnigg, D., Pezik, P., Lee, V., de Jong, F., Kraaij, W., Rebholz-Schuhmann, D.: MeSH up: effective MeSH text classification for improved document retrieval. Bioinformatics 25(11), 1412–1418 (2009)

    Article  Google Scholar 

  22. Wilbur, W.J., Kim, W.: Stochastic gradient descent and the prediction of MeSH for PubMed records. In: AMIA Annual Symposium Proceedings 2014, pp. 1198–1207 (2014)

    Google Scholar 

  23. Zadrozny, B., Elkan, C.: Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of SIGKDD 2002, pp. 694–699. ACM (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Toepfer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Toepfer, M., Seifert, C. (2018). Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts Under Precision and Recall Constraints. In: Méndez, E., Crestani, F., Ribeiro, C., David, G., Lopes, J. (eds) Digital Libraries for Open Knowledge. TPDL 2018. Lecture Notes in Computer Science(), vol 11057. Springer, Cham. https://doi.org/10.1007/978-3-030-00066-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00066-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00065-3

  • Online ISBN: 978-3-030-00066-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics