Word Distribution Based Methods for Minimizing Segment Overlaps

Vasak, Joe; Song, Fei

doi:10.1007/978-3-540-74628-7_21

Joe Vasak¹ &
Fei Song¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4629))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

1732 Accesses
1 Citations

Abstract

Dividing coherent text into a sequence of coherent segments is a challenging task since different topics/subtopics are often related to a common theme(s). Based on lexical cohesion, we can keep track of words and their repetitions and break text into segments at points where the lexical chains are weak. However, there exist words that are more or less evenly distributed across a document (called document-dependent or distributional stopwords), making it difficult to separate one segment from another. To minimize the overlaps between segments, we propose two new measures for removing distributional stopwords based on word distribution. Our experimental results show that the new measures are both efficient to compute and effective for improving the segmentation performance of expository text and transcribed lecture text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Beeferman, D., Berger, A., Lafferty, J.D.: Statistical Models for Text Segmentation. Machine Learning 34(1-3), 177–210 (1999)
Article MATH Google Scholar
Choi, F.Y.Y.: Advances in Domain Independent Linear Text Segmentation. In: Proceedings of the NAACL 2000, pp. 26–33 (2000)
Google Scholar
Dais, G., Alves, E.: Discovering Topic Boundaries for Text Summarization Based on Word Co-occurrence. In: Proceedings of the RANLP (2005)
Google Scholar
Michael, A.K., Halliday, M.A.K., Hasan, R.: Cohesion in English. Longman, New York (1976)
Google Scholar
Hearst, M.: Multi-Paragraph Segmentation of Expository Text. In: Proceedings of the ACL, pp. 9–16 (1994)
Google Scholar
Heinonen, O.: Optimal Multi-Paragraph Text Segmentation by Dynamic Programming. In: proceedings of the COLING-ACL (1998)
Google Scholar
Ji, X., Zha, A.: Domain-independent Text Segmentation Using Anisotropic Diffusion and Dynamic Programming. In: Proceedings of the ACM SIGIR, pp. 322–329 (2003)
Google Scholar
Malioutov, I., Barzily, R.: Minimum Cut Model for Spoken Lecture Segmentation (2006)
Google Scholar
Jeffery, C., Reynar, J.C.: Topic Segmentation: Algorithms and Application. Ph.D. Thesis, University of Pennsylvania (1998)
Google Scholar
Skorochod’ko, E.F.: Adaptive method of automatic abstracting and indexing. In: Proceedings of the IFIP, vol. 71, pp. 1179–1182 (1972)
Google Scholar
Utiyama, M., Isahara, H.: A Statistical Model for Domain-Independent Text Segmentation. In: Proceeedings of the ACL, pp. 491–498 (2001)
Google Scholar
Youmans, G.: Measuring Lexical Style and Competence: The Type-Token Vocabulary Curve. Style 24, 584–599 (1990)
Google Scholar
Youmans, G.: A new Tool for Discourse analysis: The Vocabulary-Management Profile. Language 67(4), 763–789 (1991)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Information Science, University of Guelph, Guelph, Ontario, N1G 2W1, Canada
Joe Vasak & Fei Song

Authors

Joe Vasak
View author publications
You can also search for this author in PubMed Google Scholar
Fei Song
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Václav Matoušek Pavel Mautner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vasak, J., Song, F. (2007). Word Distribution Based Methods for Minimizing Segment Overlaps. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2007. Lecture Notes in Computer Science(), vol 4629. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74628-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-540-74628-7_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74627-0
Online ISBN: 978-3-540-74628-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics