Abstract
An alternative way to tackle Information Retrieval, called Passage Retrieval, considers text fragments independently rather than assessing global relevance of documents. In such a context, the fact that relevant information is surrounded by parts of text deviating from the interesting topic does not penalize the document. In this paper, we propose to study the impact of the consideration of these text fragments on a document clustering process. The use of clustering in the field of Information Retrieval is mainly supported by the cluster hypothesis which states that relevant documents tend to be more similar one to each other than to non-relevant documents and hence a clustering process is likely to gather them. Previous experiments have shown that clustering the first retrieved documents as response to a user’s query allows the Information Retrieval systems to improve their effectiveness. In the clustering process used in these studies, documents have been considered globally. Nevertheless, the assumption stating that a document can refer to more than one topic/concept may have also impacts on the document clustering process. Considering passages of the retrieved documents separately may allow to create more representative clusters of the addressed topics. Different approaches have been assessed and results show that using text fragments in the clustering process may turn out to be actually relevant.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Voorhees, E.M., Harman, D.: Overview of the fifth text retrieval conference (trec-5). In: Proceedings of the Fifth Text Retrieval Conference, NIST Special Publication, pp. 1–28. NIST Special Publication 500-238 (1997)
Leuski, A.V.: Interactive information organization: techniques and evaluation. PhD thesis, University of Amhert, Massachussets, Director-James Allan (2001)
Koenemann, J., Belkin, N.J.: A case for interaction: a study of interactive information retrieval behavior and effectiveness. In: CHI 1996: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 205–212. ACM Press, New York (1996)
Harper, D.J., Koychev, I., Yixing, S.: Query-based document skimming: A user-centred evaluation of relevance profiling. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 377–392. Springer, Heidelberg (2003)
Croft, W.B.: A model of cluster searching bases on classification. Information Systems 5(3), 189–195 (1980)
Willett, P.: Recent trends in hierarchic document clustering: a critical review. Information Processing & Management 24(5), 577–597 (1988)
Voorhees, E.M.: The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell University, Ithaca, NY, USA (1986)
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, Zürich, CH, pp. 76–84 (1996)
Leuski, A.: Evaluating document clustering for interactive information retrieval. In: CIKM 2001: Proceedings of the tenth international conference on Information and knowledge management, pp. 33–40. ACM, New York (2001)
Tombros, A., Villa, R., Rijsbergen, C.J.V.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing & Management 38(4), 559–582 (2002)
Jardine, N., van Rijsbergen, C.J.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7(5), 217–240 (1971)
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Callan, J.P.: Passage-level evidence in document retrieval. In: SIGIR 1994: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 302–310. Springer, New York (1994)
Kaszkiel, M., Zobel, J.: Effective ranking with arbitrary passages. Journal of the American Society of Information Science 52(4), 344–364 (2001)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press /Addison-Wesley, New York (1999)
Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: Hypertext 1996, The Seventh ACM Conference on Hypertext, March 16-20, 1996, pp. 53–65. ACM, New York (1996)
Kozima, H.: Text segmentation based on similarity between words. In: Meeting of the Association for Computational Linguistics, pp. 286–288 (1993)
Hearst, M.: Texttiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Choi, F.: Advances in domain independent linear text segmentation. In: Proceedings of the first conference on North American chapter of the Association for Computational Linguistics, pp. 26–33. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Seggen: A genetic algorithm for linear text segmentation. In: Veloso, M.M. (ed.) IJCAI, pp. 1647–1652 (2007)
Rasmussen, E.M.: Clustering algorithms. In: Information Retrieval: Data Structures & Algorithms, pp. 419–442. Prentice-Hall, Englewood Cliffs (1992)
Salton, G.: Automatic Information Organization and Retrieval. McGraw-Hill, New York (1968)
Mendes, M.E.S., Sacks, L.: Evaluating fuzzy clustering for relevance-based information access. In: The IEEE International Conference on Fuzzy Systems FUZZ-IEEE 2003, pp. 648–653 (2003)
Conrad, J.G., Al-Kofahi, K., Zhao, Y., Karypis, G.: Effective document clustering for large heterogeneous law firm collections. In: ICAIL 2005: Proceedings of the 10th international conference on Artificial intelligence and law, pp. 177–187. ACM Press, New York (2005)
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell (1981)
Kraft, D.H., Chen, J., Mikulcic, A.: Combining fuzzy clustering and fuzzy inference in information retrieval. In: The IEEE International Conference on Fuzzy Systems FUZZ-IEEE 2000, pp. 375–380 (2000)
Bradley, P.S., Reina, C., Fayyad, U.M.: Clustering very large databases using em mixture models. In: The International Conference on Pattern Recognition (ICPR 2000), vol. 2, pp. 2076–2080 (2000)
Chang, H.C., Hsu, C.C.: Using topic keyword clusters for automatic document clustering. In: ICITA 2005: Proceedings of the Third International Conference on Information Technology and Applications (ICITA 2005), vol. 2, pp. 419–424. IEEE Computer Society, Washington (2005)
Muscat, R.: Automatic document clustering using topic analysis. Master’s thesis, University of Malta (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lamprier, S., Amghar, T., Levrat, B., Saubion, F. (2008). Using Text Segmentation to Enhance the Cluster Hypothesis. In: Dochev, D., Pistore, M., Traverso, P. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2008. Lecture Notes in Computer Science(), vol 5253. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85776-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-540-85776-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85775-4
Online ISBN: 978-3-540-85776-1
eBook Packages: Computer ScienceComputer Science (R0)