Improving Efficiency of Sentence Boundary Detection by Feature Selection

Ho, Thi-Nga; Chong, Tze Yuang; Do, Van Hai; Pham, Van Tung; Chng, Eng Siong

doi:10.1007/978-3-662-49390-8_58

Thi-Nga Ho^8,9,
Tze Yuang Chong⁸,
Van Hai Do¹⁰,
Van Tung Pham^8,9 &
…
Eng Siong Chng^8,9

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9622))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

1563 Accesses

Abstract

The goal of sentence boundary detection (SBD) is to predict the presence/absence of sentence boundary in an unstructured word sequence, where there is no punctuation presented. In this paper, we propose a feature selection approach to obtain more effective features used for the SBD classifier. Specifically, the observed words are considered its correlation with the sentence boundary based on the pointwise mutual information before being used as the feature of the classifier. By using the linear chain CRF model to predict sentence boundaries of a text sequence, the experimental results on a part of the English Gigaword \({2^{nd}}\) Edition corpus show that the proposed method helps to reduce the number of model parameters up to 44.87 % while maintaining a comparable F1-score to the original model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The formula is explained in the document of this CRF++ Toolkit: https://taku910.github.io/crfpp/.
2.
https://catalog.ldc.upenn.edu/LDC2005T12.

References

Christensen, H., Gotoh, Y., Renals, S.: Punctuation annotation using statistical prosody models. In: Proceedings of ISCA Workshop on Prosody in Speech Recognition and Understanding, Red Bank, NJ, USA (2001)
Google Scholar
Shriberg, E., Stolcke, A., Hakkani-Tür, D., Tür, G.: Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun. 32(1–2), 127–154 (2000)
Article Google Scholar
Liu, Y., Stolcke, A., Shriberg, E., Harper, M.: Comparing and combining generative and posterior probability models: Some advances in sentence boundary detection in speech. In: Proceedings of EMNLP, Barcelona, Spain (2004)
Google Scholar
Huang, J., Zweig, G.: Maximum Entropy model for punctuation annotation from speech. In: Proceedings of INTERSPEECH, Denver, Colorado, USA (2002)
Google Scholar
Liu, Y., Stolcke, A., Shriberg, E., Hillard, D., Ostendorf, M., Harper, M.: Enrich speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio Speech Lang. Process. 14(5), 1426–1540 (2006)
Google Scholar
Kolář, J.: Automatic segmentation speech into sentence-like units. Ph.D. thesis, University of West Bohemia in Pilsen, Pilsen, Czech (2008)
Google Scholar
Gavalda, M., Zechner, K., Aist, G.: High performance segmentation of spontaneous speech using part of speech and trigger word information. In: Proceedings of 5th Conference on Applied Natural Language Processing, Washington D.C., USA (1997)
Google Scholar
Xu, C., Xie, L., Huang, G., Xiao, X., Chng, E.S., Li, H.: A deep neural network approach for sentence boundary detection in Broadcast News. In: Proceedings of INTERSPEECH 2014, Singapore (2014)
Google Scholar
Huang, G.P., Xu, C., Xiao, X., Xie, L., Chng, E.S., Li, H.: Multi-view features in a DNN-CRF model for improved sentence unit detection on English Broadcast News. In: APSIPA 2014, Cambodia (2014)
Google Scholar
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference 2009 (2009)
Google Scholar
Rosenfeld, R.: A maximum entropy approach to adaptive statistical language modelling. Comput. Speech Lang. 10(3), 187–228 (1996)
Article MathSciNet Google Scholar
Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 15–30. Springer, Heidelberg (2002)
Chapter Google Scholar
McCallum, A.: Efficiently inducing features of conditional random fields. In: Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, pp. 403–410. Morgan Kaufmann Publishers Inc. (2002)
Google Scholar
Jeong, M., Gary, G.L.: Practical use of non-local features for statistical spoken language understanding. Comput. Speech Lang. 22(2), 148–170 (2008)
Article Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labelling sequence data. In: Proceedings of ICML, pp. 282–28 (2001)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT/NAACL (2003)
Google Scholar

Download references

Acknowledgements

This work is supported by DSO funded project MAISON DSOCL14045.

Author information

Authors and Affiliations

School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, Singapore
Thi-Nga Ho, Tze Yuang Chong, Van Tung Pham & Eng Siong Chng
Temasek Laboratories@NTU, Nanyang Technological University, 50 Nanyang Drive, Singapore, Singapore
Thi-Nga Ho, Van Tung Pham & Eng Siong Chng
Advanced Digital Sciences Center, 1 Fusionopolis Way, Singapore, Singapore
Van Hai Do

Authors

Thi-Nga Ho
View author publications
You can also search for this author in PubMed Google Scholar
Tze Yuang Chong
View author publications
You can also search for this author in PubMed Google Scholar
Van Hai Do
View author publications
You can also search for this author in PubMed Google Scholar
Van Tung Pham
View author publications
You can also search for this author in PubMed Google Scholar
Eng Siong Chng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thi-Nga Ho .

Editor information

Editors and Affiliations

Wrocław University of Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Wrocław University of Technology, Wrocław, Poland
Bogdan Trawiński
Iwate Prefectural University, Takizawa, Japan
Hamido Fujita
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ho, TN., Chong, T.Y., Do, V.H., Pham, V.T., Chng, E.S. (2016). Improving Efficiency of Sentence Boundary Detection by Feature Selection. In: Nguyen, N.T., Trawiński, B., Fujita, H., Hong, TP. (eds) Intelligent Information and Database Systems. ACIIDS 2016. Lecture Notes in Computer Science(), vol 9622. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49390-8_58

Download citation

DOI: https://doi.org/10.1007/978-3-662-49390-8_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49389-2
Online ISBN: 978-3-662-49390-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics