Skip to main content

Techniques for Processing LSI Queries Incorporating Phrases

  • Conference paper
  • First Online:
Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2014)

Abstract

Latent semantic indexing (LSI) is a well-established technique that provides broad capabilities for information search, categorization, clustering, and discovery. However, there are some limitations that are encountered in using the technique. One such limitation is that the classical implementation of LSI does not provide a flexible mechanism for dealing with phrases. In the standard implementation of LSI, the only way that a phrase can be used as a whole in a query is if that phrase has been identified a priori and treated as a unit during the process of creating the LSI index. This requirement has greatly hindered the use of phrases in LSI applications. This paper presents a method for dealing with phrases in LSI-based information systems on an ad hoc basis – at query time, without requiring any prior knowledge of the phrases of interest. The approach is fast enough to be used during real-time query execution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Furnas, G., et al.: Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings 11th SIGIR, pp. 465–480 (1988)

    Google Scholar 

  2. Dumais, S.: Latent semantic analysis. In: ARIST Review of Information Science and Technology, vol. 38, Chap. 4 (2004)

    Google Scholar 

  3. Dumais, S., et al.: Using latent semantic analysis to improve access to textual information. In: Proceedings of CHI 1988, 15–19 June 1988, Washington, DC, pp.281–285 (1988)

    Google Scholar 

  4. Bradford, R.: Comparability of LSI and human judgment in text analysis tasks. In: Proceedings of Applied Computing Conference, Athens, Greece, pp. 359–366 (2006)

    Google Scholar 

  5. Bradford, R.: Relationship discovery in large text collections using latent semantic indexing. In: Proceedings of the Fourth Workshop on Link Analysis, Counterterrorism and Security, SIAM Data Mining Conference, Bethesda, MD, 20–22 April 2006

    Google Scholar 

  6. Price, R., Zukas, A.E.: Application of latent semantic indexing to processing of noisy text. In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 602–603. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  7. Xu, L., et al.: Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts. PLoS ONE 6(4), e18851 (2011)

    Article  Google Scholar 

  8. Huang, J., Kumar, S., Zabih, R.: An Automatic hierarchical image classification scheme. In: Proceedings of the Sixth ACM International Conference on Multimedia, pp.219–228 (1998)

    Google Scholar 

  9. Hulth, A.: Combining machine learning and natural language processing for automatic keyword extraction. Ph.D. thesis, Stockholm University, April (2004)

    Google Scholar 

  10. Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, p. 36. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  11. Caid, W., Dumais, S., Gallant, S.: Learned vector-space models for document retrieval. Inf. Process. Manage. 31(3), 419–429 (1995)

    Article  Google Scholar 

  12. Salton, G., Yang, C., Yu, T.: A theory of term importance in automatic text analysis. JASIS 26(1), 33–44 (1975)

    Article  Google Scholar 

  13. Fagan, J.: The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. JASIS 40(2), 115–132 (1989)

    Article  Google Scholar 

  14. Mitra, M., et al.: An analysis of statistical and syntactic phrases. In: Proceedings of RIAO 1997, Montreal, Canada, pp. 200–214 (1997)

    Google Scholar 

  15. Turpin, A., Moffat, A.: Statistical phrases for vector-space information retrieval. In: Proceedings of SIGIR 1999, Berkley, CA, August 1999, pp. 309–310 (1999)

    Google Scholar 

  16. Harmon, D.: The TREC ad hoc experiments. In: Voorhees, E.M., Harmon, D.K. (eds.) TREC Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)

    Google Scholar 

  17. Metzler, D., Strohman, T., Croft, W.: Indri at TREC 2006: lessons learned from three terabyte tracks. In: Proceedings of Fifteenth Text REtrieval Conference. NIST Special Publication SP 500-272 (2006)

    Google Scholar 

  18. Ogawa, Y., et al.: Structuring and expanding queries in the probabilistic model. In: Proceedings of Ninth Text REtrieval Conference (TREC-9), pp. 427–435. NIST Special Publication 500-249 (2000)

    Google Scholar 

  19. Zhai, C., et al.: Evaluation of syntactic phrase indexing – CLARIT NLP track report. In: Proceedings of Fifth TExt Retrieval Conference, pp. 347–358. NIST Special Publication 500-238 (1996)

    Google Scholar 

  20. Kraaij, W., Pohlmann, R.: Comparing the effect of syntactic vs. statistical phrase indexing strategies for Dutch. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 605–617. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  21. Jiang, M., et al.: Choosing the right bigrams for information retrieval. In: Proceeding of the Meeting of the International Federation of Classification Societies, pp.531–540 (2004)

    Google Scholar 

  22. Broschart, A., Berberich, K., Schenkel, R.: Evaluating the potential of explicit phrases for retrieval quality. In: Gurrin, C., He, Y., Kazai, G., K, Udo, Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 623–626. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  23. Kim, H.-R., Chan, P.: Identifying variable-length meaningful phrases with correlation functions. In: Proceedings of 16th IEEE International Conference on Tools with Artificial Intelligence, pp.30–38 (2004)

    Google Scholar 

  24. Lizza, M., Sartoretto, F.: A comparative analysis of LSI strategies. In: Berry, M. (ed.) Computational Information Retrieval, SIAM, pp. 171–181 (2001)

    Google Scholar 

  25. Weimer-Hastings, P.: Adding syntactic information to LSA. In: Proceedings of the 22nd Annual Meeting of the Cognitive Science Society (2000)

    Google Scholar 

  26. Wu, H., Gunopulos, D.: Evaluating the utility of statistical phrases and latent semantic indexing for text classification. In: Proceedings ICDM, pp. 713–716 (2002)

    Google Scholar 

  27. Nakov, P., Valchanova, E., Angelova, G.: Towards deeper understanding of the LSA performance. In: Proceedings of Recent Advances in Natural Language Processing 2003, pp. 311–318 (2003)

    Google Scholar 

  28. Grönqvist, L.: An evaluation of bi- and tri-gram enriched latent semantic vector models. In: Proceedings of ELECTRA Workshop, Methodologies and Evaluation of Lexical Cohesion Techniques in Real-World Applications, Salvador, Brazil, 19 August 2005, pp. 57–62 (2005)

    Google Scholar 

  29. Grönqvist, L.: Evaluating latent semantic vector models with synonym tests and document retrieval. In: Proceedings of ELECTRA Workshop, Methodologies and Evaluation of Lexical Cohesion Techniques in Real-World Applications, Salvador, Brazil, 19 August 2005, pp. 86–88 (2005)

    Google Scholar 

  30. Grönqvist, L.: Exploring latent semantic vector models enriched with n-grams. Ph.D. thesis, Växjö University, Sweden (2006)

    Google Scholar 

  31. Olney, A.: Generalizing latent semantic analysis. In: Proceedings, 2009 IEEE International Conference on Semantic Computing, pp. 40–46 (2009)

    Google Scholar 

  32. Bradford, R.: An empirical study of required dimensionality for large-scale latent semantic indexing applications. In: Proceedings of ACM 17th Conference on Information and Knowledge Management (CIKM) Napa, California, 26–30 October 2008

    Google Scholar 

  33. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit. O’Reilly Media, San Francisco (2009)

    MATH  Google Scholar 

  34. Bradford, R.: Implementation techniques for large-scale latent semantic indexing applications. In: Proceedings of ACM Conference on Information and Knowledge Management, Glasgow, Scotland, October 2011

    Google Scholar 

Download references

Acknowledgements

The author would like to thank the members of the Semantic Engineering staff at Agilex Technologies who participated in the reported testing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roger Bradford .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Bradford, R. (2015). Techniques for Processing LSI Queries Incorporating Phrases. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2014. Communications in Computer and Information Science, vol 553. Springer, Cham. https://doi.org/10.1007/978-3-319-25840-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25840-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25839-3

  • Online ISBN: 978-3-319-25840-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics