Abstract
Latent semantic indexing (LSI) is a well-established technique that provides broad capabilities for information search, categorization, clustering, and discovery. However, there are some limitations that are encountered in using the technique. One such limitation is that the classical implementation of LSI does not provide a flexible mechanism for dealing with phrases. In the standard implementation of LSI, the only way that a phrase can be used as a whole in a query is if that phrase has been identified a priori and treated as a unit during the process of creating the LSI index. This requirement has greatly hindered the use of phrases in LSI applications. This paper presents a method for dealing with phrases in LSI-based information systems on an ad hoc basis – at query time, without requiring any prior knowledge of the phrases of interest. The approach is fast enough to be used during real-time query execution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Furnas, G., et al.: Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings 11th SIGIR, pp. 465–480 (1988)
Dumais, S.: Latent semantic analysis. In: ARIST Review of Information Science and Technology, vol. 38, Chap. 4 (2004)
Dumais, S., et al.: Using latent semantic analysis to improve access to textual information. In: Proceedings of CHI 1988, 15–19 June 1988, Washington, DC, pp.281–285 (1988)
Bradford, R.: Comparability of LSI and human judgment in text analysis tasks. In: Proceedings of Applied Computing Conference, Athens, Greece, pp. 359–366 (2006)
Bradford, R.: Relationship discovery in large text collections using latent semantic indexing. In: Proceedings of the Fourth Workshop on Link Analysis, Counterterrorism and Security, SIAM Data Mining Conference, Bethesda, MD, 20–22 April 2006
Price, R., Zukas, A.E.: Application of latent semantic indexing to processing of noisy text. In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 602–603. Springer, Heidelberg (2005)
Xu, L., et al.: Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts. PLoS ONE 6(4), e18851 (2011)
Huang, J., Kumar, S., Zabih, R.: An Automatic hierarchical image classification scheme. In: Proceedings of the Sixth ACM International Conference on Multimedia, pp.219–228 (1998)
Hulth, A.: Combining machine learning and natural language processing for automatic keyword extraction. Ph.D. thesis, Stockholm University, April (2004)
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, p. 36. Cambridge University Press, New York (2008)
Caid, W., Dumais, S., Gallant, S.: Learned vector-space models for document retrieval. Inf. Process. Manage. 31(3), 419–429 (1995)
Salton, G., Yang, C., Yu, T.: A theory of term importance in automatic text analysis. JASIS 26(1), 33–44 (1975)
Fagan, J.: The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. JASIS 40(2), 115–132 (1989)
Mitra, M., et al.: An analysis of statistical and syntactic phrases. In: Proceedings of RIAO 1997, Montreal, Canada, pp. 200–214 (1997)
Turpin, A., Moffat, A.: Statistical phrases for vector-space information retrieval. In: Proceedings of SIGIR 1999, Berkley, CA, August 1999, pp. 309–310 (1999)
Harmon, D.: The TREC ad hoc experiments. In: Voorhees, E.M., Harmon, D.K. (eds.) TREC Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)
Metzler, D., Strohman, T., Croft, W.: Indri at TREC 2006: lessons learned from three terabyte tracks. In: Proceedings of Fifteenth Text REtrieval Conference. NIST Special Publication SP 500-272 (2006)
Ogawa, Y., et al.: Structuring and expanding queries in the probabilistic model. In: Proceedings of Ninth Text REtrieval Conference (TREC-9), pp. 427–435. NIST Special Publication 500-249 (2000)
Zhai, C., et al.: Evaluation of syntactic phrase indexing – CLARIT NLP track report. In: Proceedings of Fifth TExt Retrieval Conference, pp. 347–358. NIST Special Publication 500-238 (1996)
Kraaij, W., Pohlmann, R.: Comparing the effect of syntactic vs. statistical phrase indexing strategies for Dutch. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 605–617. Springer, Heidelberg (1998)
Jiang, M., et al.: Choosing the right bigrams for information retrieval. In: Proceeding of the Meeting of the International Federation of Classification Societies, pp.531–540 (2004)
Broschart, A., Berberich, K., Schenkel, R.: Evaluating the potential of explicit phrases for retrieval quality. In: Gurrin, C., He, Y., Kazai, G., K, Udo, Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 623–626. Springer, Heidelberg (2010)
Kim, H.-R., Chan, P.: Identifying variable-length meaningful phrases with correlation functions. In: Proceedings of 16th IEEE International Conference on Tools with Artificial Intelligence, pp.30–38 (2004)
Lizza, M., Sartoretto, F.: A comparative analysis of LSI strategies. In: Berry, M. (ed.) Computational Information Retrieval, SIAM, pp. 171–181 (2001)
Weimer-Hastings, P.: Adding syntactic information to LSA. In: Proceedings of the 22nd Annual Meeting of the Cognitive Science Society (2000)
Wu, H., Gunopulos, D.: Evaluating the utility of statistical phrases and latent semantic indexing for text classification. In: Proceedings ICDM, pp. 713–716 (2002)
Nakov, P., Valchanova, E., Angelova, G.: Towards deeper understanding of the LSA performance. In: Proceedings of Recent Advances in Natural Language Processing 2003, pp. 311–318 (2003)
Grönqvist, L.: An evaluation of bi- and tri-gram enriched latent semantic vector models. In: Proceedings of ELECTRA Workshop, Methodologies and Evaluation of Lexical Cohesion Techniques in Real-World Applications, Salvador, Brazil, 19 August 2005, pp. 57–62 (2005)
Grönqvist, L.: Evaluating latent semantic vector models with synonym tests and document retrieval. In: Proceedings of ELECTRA Workshop, Methodologies and Evaluation of Lexical Cohesion Techniques in Real-World Applications, Salvador, Brazil, 19 August 2005, pp. 86–88 (2005)
Grönqvist, L.: Exploring latent semantic vector models enriched with n-grams. Ph.D. thesis, Växjö University, Sweden (2006)
Olney, A.: Generalizing latent semantic analysis. In: Proceedings, 2009 IEEE International Conference on Semantic Computing, pp. 40–46 (2009)
Bradford, R.: An empirical study of required dimensionality for large-scale latent semantic indexing applications. In: Proceedings of ACM 17th Conference on Information and Knowledge Management (CIKM) Napa, California, 26–30 October 2008
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit. O’Reilly Media, San Francisco (2009)
Bradford, R.: Implementation techniques for large-scale latent semantic indexing applications. In: Proceedings of ACM Conference on Information and Knowledge Management, Glasgow, Scotland, October 2011
Acknowledgements
The author would like to thank the members of the Semantic Engineering staff at Agilex Technologies who participated in the reported testing.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Bradford, R. (2015). Techniques for Processing LSI Queries Incorporating Phrases. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2014. Communications in Computer and Information Science, vol 553. Springer, Cham. https://doi.org/10.1007/978-3-319-25840-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-25840-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25839-3
Online ISBN: 978-3-319-25840-9
eBook Packages: Computer ScienceComputer Science (R0)