Skip to main content

What was the Query? Generating Queries for Document Sets with Applications in Cluster Labeling

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9103))

Abstract

We deal with the task of generating a query that retrieves a given set of documents. In its abstract form, this can be seen as a “compression” of the document set to a short query. But the task also has a real-world application: cluster labeling (e.g., for faceted search). Our solution to cluster labeling is the usage of queries that approximately retrieve a cluster’s documents. To be generalizable, our approach does not require access to a search index but only a public interface like an API. This way, our approach can also be implemented at client side.

In an experimental evaluation, a basic version of our approach using a simple retrieval model is on par with standard cluster labeling techniques. A further user study reveals that queries as labels are often preferred when they are not too long.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://credo.fub.it/ambient/, last accessed: May 20, 2014.

References

  1. Azzopardi, L., Vinay, V.: Retrievability: an evaluation measure for higher order information access tasks. In: Proceedings of the 17th ACM conference on Information and knowledge management (CIKM 2008), pp. 561–570, ACM, New York (2008)

    Google Scholar 

  2. Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. J. ACM 55(5), 1–74 (2008)

    Article  MathSciNet  Google Scholar 

  3. Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM 2009), pp. 262–271, ACM, New York (2009)

    Google Scholar 

  4. Bonchi, F., Castillo, C., Donato, D., Gionis, A.: Topical query decomposition. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), pp. 52–60, ACM, New York (2008)

    Google Scholar 

  5. Dasdan, A., D’Alberto, P., Kolay, S., Drome, C.: Automatic retrieval of similar content using search engine query interface. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), pp. 701–710, ACM, New York (2009)

    Google Scholar 

  6. Fuglede, B., Topsøe, F.: Jensen-Shannon divergence and Hilbert space embedding. In: Proceedings of International Symposium on Information Theory (ISIT 2004), pp. 31, IEEE, Piscataway (2004)

    Google Scholar 

  7. Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: implementing the cluster hypothesis. Inf. Retrieval 15(2), 93–115 (2011)

    Article  Google Scholar 

  8. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI 2007), pp. 1606–1611, Morgan Kaufmann Publishers Inc, San Francisco (2007)

    Google Scholar 

  9. Gollub, T., Hagen, M., Völske, M., Stein, B.: From keywords to keyqueries: content descriptors for the web. In: Proceeding of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), pp. 981–984, ACM, New York (2013)

    Google Scholar 

  10. Hagen, M., Stein, B.: Search strategies for keyword-based queries. In: 7th International Workshop on Text-Based Information Retrieval (TIR 2010), pp. 37–41, Piscataway, IEEE (2010)

    Google Scholar 

  11. Hagen, M., Stein, B.: Candidate document retrieval for web-scale text reuse detection. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 356–367. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  12. Huston, S., Croft, W.B.: Evaluating verbose query processing techniques. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), pp. 291–298, ACM, New York (2010)

    Google Scholar 

  13. Jordan, C., Watters, C., Gao, Q.: Using controlled query generation to evaluate blind relevance feedback algorithms. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2006), pp. 286–295, ACM, New York (2006)

    Google Scholar 

  14. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  15. Muhr, M., Kern, R., Granitzer, M.: Analysis of structural relationships for hierarchical cluster labeling. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), pp. 178–185, ACM, New York (2010)

    Google Scholar 

  16. Navigli, R., Crisafulli, G.: Inducing word senses to improve web search result clustering. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), pp. 116–126, Stroudsburg (2010) (Association for Computational Linguistics)

    Google Scholar 

  17. Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM 2010), pp. 1049–1058, ACM, New York (2010)

    Google Scholar 

  18. Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.: ChatNoir: A search engine for the ClueWeb09 corpus. In: The 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1004, ACM, New York (2012)

    Google Scholar 

  19. Robertson, S.E., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retrieval 3(4), 333–389 (2009)

    Article  Google Scholar 

  20. Robertson, S.E., Zaragoza, H., Taylor, M.J.: Simple BM25 extension to multiple weighted fields. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM 2004), pp. 42–49, ACM, New York (2004)

    Google Scholar 

  21. Stein, B., zu Eißen, S.M.: Topic identification: framework and application. In: Proceedings of the 4th International Conference on Knowledge Management (I-KNOW 2004), Journal of Universal Computer Science, Know-Center, pp. 353–360, Graz (2004)

    Google Scholar 

  22. Stein, B., Gollub, T., Hoppe, D.: Beyond precision@10: Clustering the long tail of web search results. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM 2011), pp. 2141–2144, ACM, New York (2011)

    Google Scholar 

  23. Turel, A., Can, F.: A new approach to search result clustering and labeling. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 283–292. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  24. Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., Papadias, D.: Query by document. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM 2009), pp. 34–43, ACM, New York (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthias Hagen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Hagen, M., Michel, M., Stein, B. (2015). What was the Query? Generating Queries for Document Sets with Applications in Cluster Labeling. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2015. Lecture Notes in Computer Science(), vol 9103. Springer, Cham. https://doi.org/10.1007/978-3-319-19581-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19581-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19580-3

  • Online ISBN: 978-3-319-19581-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics