Skip to main content

Active Learning Strategies for Technology Assisted Sensitivity Review

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2018)

Abstract

Government documents must be reviewed to identify and protect any sensitive information, such as personal information, before the documents can be released to the public. However, in the era of digital government documents, such as e-mail, traditional sensitivity review procedures are no longer practical, for example due to the volume of documents to be reviewed. Therefore, there is a need for new technology assisted review protocols to integrate automatic sensitivity classification into the sensitivity review process. Moreover, to effectively assist sensitivity review, such assistive technologies must incorporate reviewer feedback to enable sensitivity classifiers to quickly learn and adapt to the sensitivities within a collection, when the types of sensitivity are not known a priori. In this work, we present a thorough evaluation of active learning strategies for sensitivity review. Moreover, we present an active learning strategy that integrates reviewer feedback, from sensitive text annotations, to identify features of sensitivity that enable us to learn an effective sensitivity classifier (0.7 Balanced Accuracy) using significantly less reviewer effort, according to the sign test (\(p<0.01\)). Moreover, this approach results in a 51% reduction in the number of documents required to be reviewed to achieve the same level of classification accuracy, compared to when the approach is deployed without annotation features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.right2info.org/access-to-information-laws/access-to-information-laws.

  2. 2.

    In active learning parlance, “query” usually refers to membership queries i.e. the system poses queries in the form of instances to be reviewed. In this work we use query in the IR sense, i.e. a textual passage used to retrieve relevant documents from an IR system. For membership queries we say that the system suggests documents to be reviewed.

  3. 3.

    In practice this means that we randomly down-sample the classifier’s training data to loosely match the class frequencies. In preliminary experiments this led to uniform improvements across all tested approaches of \(\sim \)+0.4 Balanced Accuracy, after all documents had been reviewed.

References

  1. TNA: The application of technology-assisted review to born-digital records transfer, inquiries and beyond (2016)

    Google Scholar 

  2. McDonald, G., Macdonald, C., Ounis, I., Gollins, T.: Towards a classifier for digital sensitivity review. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C.X., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 500–506. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_48

    Chapter  Google Scholar 

  3. Berardi, G., Esuli, A., Macdonald, C., Ounis, I., Sebastiani, F.: Semi-automated text classification for sensitivity identification. In: Proceedings of the CIKM (2015)

    Google Scholar 

  4. McDonald, G., Macdonald, C., Ounis, I.: Using part-of-speech n-grams for sensitive-text classification. In: Proceedings of the ICTIR (2015)

    Google Scholar 

  5. McDonald, G., Macdonald, C., Ounis, I.: Enhancing sensitivity classification with semantic features using word embeddings. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 450–463. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_35

    Chapter  Google Scholar 

  6. Cormack, G.V., Grossman, M.R.: Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In: Proceedings of the SIGIR (2014)

    Google Scholar 

  7. Oard, D.W., Baron, J.R., Hedin, B., Lewis, D.D., Tomlinson, S.: Evaluation of information retrieval for e-discovery. Artif. Intell. Law 18(4), 347–386 (2010)

    Article  Google Scholar 

  8. Settles, B.: Active learning. Synth. Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  9. Settles, B.: Closing the loop: fast, interactive semi-supervised annotation with queries on features and instances. In: Proceedings of the EMNLP (2011)

    Google Scholar 

  10. Berardi, G., Esuli, A., Sebastiani, F.: A utility-theoretic ranking method for semi-automated text classification. In: Proceedings of the SIGIR (2012)

    Google Scholar 

  11. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  12. McDonald, G., García-Pedrajas, N., Macdonald, C., Ounis, I.: A study of svm kernel functions for sensitivity classification ensembles with pos sequences. In: Proceedings of the SIGIR (2017)

    Google Scholar 

  13. Lefebvre, C., Manheimer, E., Glanville, J.: Searching for studies. Cochrane handbook for systematic reviews of interventions: Cochrane book series, pp. 95–150 (2008)

    Google Scholar 

  14. Sanderson, M., Joho, H.: Forming test collections with no system pooling. In: Proceedings of the SIGIR (2004)

    Google Scholar 

  15. Oard, D.W., Hedin, B., Tomlinson, S., Baron, J.R.: Legal track overview. In: Proceedings of the TREC (2008)

    Google Scholar 

  16. Grossman, M.R., Cormack, G.V.: Technology-assisted review in E-discovery can be more effective and more efficient than exhaustive manual review. Rich. JL Tech. 17, 1 (2010)

    Google Scholar 

  17. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of the SIGIR (1994)

    Google Scholar 

  18. Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of the ICML (2003)

    Google Scholar 

  19. McCallum, A., Nigam, K., et al.: Employing EM and pool-based active learning for text classification. In: ICML, vol. 98, pp. 350–358 (1998)

    Google Scholar 

  20. Shannon, C.E.: A mathematical theory of communication. BSTJ 27, 623–656 (1948)

    MathSciNet  MATH  Google Scholar 

  21. Scheffer, T., Decomain, C., Wrobel, S.: Active hidden markov models for information extraction. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds.) IDA 2001. LNCS, vol. 2189, pp. 309–318. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44816-0_31

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham McDonald .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

McDonald, G., Macdonald, C., Ounis, I. (2018). Active Learning Strategies for Technology Assisted Sensitivity Review. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science(), vol 10772. Springer, Cham. https://doi.org/10.1007/978-3-319-76941-7_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-76941-7_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-76940-0

  • Online ISBN: 978-3-319-76941-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics