Skip to main content

The Beauty of Small Data: An Information Retrieval Perspective

  • Chapter
  • First Online:
Applied Data Science

Abstract

This chapter focuses on Data Science problems, which we will refer to as “Small Data” problems. We have over the past 20 years accumulated considerable experience with working on Information Retrieval applications that allow effective search on collections that do not exceed in size the order of tens or hundreds of thousands of documents. In this chapter we want to highlight a number of lessons learned in dealing with such document collections.

The better-known term “Big Data” has in recent years created a lot of buzz, but also frequent misunderstandings. To use a provocative simplification, the magic of Big Data often lies in the fact that sheer volume of data will necessarily bring redundancy, which can be detected in the form of patterns. Algorithms can then be trained to recognize and process these repeated patterns in the data streams.

Conversely, “Small Data” approaches do not operate on volumes of data big enough to exploit repetitive patterns to a successful degree. While there have been spectacular applications of Big Data technology, we are convinced that there are and will remain countless, equally exciting, “Small Data” tasks, across all industrial and public sectors, and also for private applications. They have to be approached in a very different manner to Big Data problems. In this chapter, we will first argue that the task of retrieving documents from large text collections (often termed “full text search”) can become easier as the document collection grows. We then present two exemplary “Small Data” retrieval applications and discuss the best practices that can be derived from such applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of Information Retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), 20(4), 357–389.

    Article  Google Scholar 

  • Baeza-Yates, R., & Riebeiro-Neto, B. (2011). Modern information retrieval, 2nd edn. New York: ACM Press.

    Google Scholar 

  • Braschler, M., & Gonzalo, J. (2009). Best practices in system and user oriented multilingual information access. TrebleCLEF Consortium, ISBN 9788888506890.

    Google Scholar 

  • Braschler, M., & Ripplinger, B. (2004). How effective is stemming and decompounding for German text retrieval? Information Retrieval, 7(3–4), 291–316.

    Article  Google Scholar 

  • Braschler, M., Rietberger, S., Imhof, M., Järvelin, A., Hansen, P., Lupu, M., Gäde, M., Berendsen, R., Garcia Seco de Herrera, A. (2012). Deliverable 2.3. best practices report, PROMISE participative laboratory for multimedia and multilingual information systems evaluation.

    Google Scholar 

  • Buss, P., & Braschler, M. (2015). Stiftungsschweiz.ch Effizienzsteigerung für das Stiftungsfundraising. In Stiftung & Sponsoring, Ausgabe 5|2015.

    Google Scholar 

  • Google. (2008). https://googleblog.blogspot.ch/2008/07/we-knew-web-was-big.html

  • Google. (2016). http://www.google.com/insidesearch/howsearchworks/thestory/

  • Harman, D. K. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7.

    Article  MathSciNet  Google Scholar 

  • Harman, D. (1993). Overview of the first TREC conference, Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.

    Google Scholar 

  • Hawking, D., & Robertson, S. (2003). On collection size and retrieval effectiveness. Information Retrieval, 6(1), 99–105.

    Article  Google Scholar 

  • Hawking, D., & Thistlewaite, P. B. (1997). Overview of TREC-6 very large collection track. In Proceedings of The Sixth Text REtrieval Conference, TREC 1997 (pp. 93–105), NIST Special Publication 500-240.

    Google Scholar 

  • Hawking, D., Craswell, N., & Thistlewaite, P. (1998). Overview of TREC-7 very large collection track. In E. M. Voorhees, & D. K. Harman (Eds.), Proceedings of the Seventh Text REtrieval Conference (TREC-7) (pp. 91–103). NIST Special Publication 500-242.

    Google Scholar 

  • Hawking, D., Voorhees, E., Craswell, N., & Bailey, P. (1999a). Overview of the TREC-8 web track. In E. M. Voorhees, & D. K. Harman (Eds.), Proceedings of the Eighth Text REtrieval Conference (TREC-8) (pp. 131–150). NIST Special Publication 500-246.

    Google Scholar 

  • Hawking, D., Craswell, N., Thistlewaite, P., & Harman, D. (1999b). Results and challenges in Web search evaluation. Computer Networks, 31(11–16), 1321–1330.

    Article  Google Scholar 

  • Hiemstra, D., & de Jong, F. (1999). Disambiguation strategies for cross-language information retrieval. In Abiteboul, S., Vercoustre, A. M. (eds) Research and advanced technology for digital libraries. ECDL 1999. Lecture Notes in Computer Science (Vol. 1696, pp. 274–293). Berlin: Springer.

    Google Scholar 

  • Hull, D. A. (1996). Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1), 70–84.

    Article  Google Scholar 

  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

    Google Scholar 

  • Peters, C., & Braschler, M. (2001). European research letter: Cross-language system evaluation: The CLEF campaigns. Journal of the Association for Information Science and Technology, 52(12), 1067–1072.

    Google Scholar 

  • Peters, C., Braschler, M., & Clough, P. (2012). Multilingual information retrieval. Berlin: Springer.

    Book  Google Scholar 

  • Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation, 33(4), 294–304.

    Article  Google Scholar 

  • Robertson, S. E., Maron, M. E., & Cooper, W. S. (1982). Probability of relevance: A unification of two competing models for document retrieval. Information Technology – Research and Development, 1, 1–21.

    Google Scholar 

  • Schäuble, P. (1999). Multimedia information retrieval. Kluwer Academic.

    Google Scholar 

  • Singhal, A., & Kaszkiel, M. (2001). A case study in web search using TREC algorithms. In Proceedings of the 10th International Conference On World Wide Web (pp. 708–716). ACM.

    Google Scholar 

  • Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’96) (pp. 21–29). New York: ACM.

    Google Scholar 

  • Smucker, M. D., Allan, J., & Carterette, B. (2007) A comparison of statistical significance tests for information retrieval evaluation, CIKM ’07. Portugal: Lisboa

    Google Scholar 

  • Spärck Jones, K., & Willett, P. (1997). Readings in information retrieval. Morgan Kaufmann.

    Google Scholar 

  • Spink, A., Wolfram, D., Jansen, M. B., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the Association for Information Science and Technology, 52(3), 226–234.

    Google Scholar 

  • Walker, S., Robertson, S. E., Boughanem, M., Jones, G. J. F., & Spärck Jones, K. (1998) Okapi at TREC-6, automatic ad hoc, VLC, routing, filtering and QSDR. In E. M. Voorhees, & D. K. Harman (Eds.), The Sixth Text REtrieval Conference (TREC-6) (pp. 125–136). NIST Special Publication 500-240.

    Google Scholar 

Download references

Acknowledgments

The retrieval applications “Stiftung Schweiz” and “Expert Match” were partially funded by Swiss funding agency CTI under grants no. 15666.1 and no. 13235.1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Braschler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Braschler, M. (2019). The Beauty of Small Data: An Information Retrieval Perspective. In: Braschler, M., Stadelmann, T., Stockinger, K. (eds) Applied Data Science. Springer, Cham. https://doi.org/10.1007/978-3-030-11821-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11821-1_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11820-4

  • Online ISBN: 978-3-030-11821-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics