Abstract
The goal of sentence boundary detection (SBD) is to predict the presence/absence of sentence boundary in an unstructured word sequence, where there is no punctuation presented. In this paper, we propose a feature selection approach to obtain more effective features used for the SBD classifier. Specifically, the observed words are considered its correlation with the sentence boundary based on the pointwise mutual information before being used as the feature of the classifier. By using the linear chain CRF model to predict sentence boundaries of a text sequence, the experimental results on a part of the English Gigaword \({2^{nd}}\) Edition corpus show that the proposed method helps to reduce the number of model parameters up to 44.87 % while maintaining a comparable F1-score to the original model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The formula is explained in the document of this CRF++ Toolkit: https://taku910.github.io/crfpp/.
- 2.
References
Christensen, H., Gotoh, Y., Renals, S.: Punctuation annotation using statistical prosody models. In: Proceedings of ISCA Workshop on Prosody in Speech Recognition and Understanding, Red Bank, NJ, USA (2001)
Shriberg, E., Stolcke, A., Hakkani-Tür, D., Tür, G.: Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun. 32(1–2), 127–154 (2000)
Liu, Y., Stolcke, A., Shriberg, E., Harper, M.: Comparing and combining generative and posterior probability models: Some advances in sentence boundary detection in speech. In: Proceedings of EMNLP, Barcelona, Spain (2004)
Huang, J., Zweig, G.: Maximum Entropy model for punctuation annotation from speech. In: Proceedings of INTERSPEECH, Denver, Colorado, USA (2002)
Liu, Y., Stolcke, A., Shriberg, E., Hillard, D., Ostendorf, M., Harper, M.: Enrich speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio Speech Lang. Process. 14(5), 1426–1540 (2006)
Kolář, J.: Automatic segmentation speech into sentence-like units. Ph.D. thesis, University of West Bohemia in Pilsen, Pilsen, Czech (2008)
Gavalda, M., Zechner, K., Aist, G.: High performance segmentation of spontaneous speech using part of speech and trigger word information. In: Proceedings of 5th Conference on Applied Natural Language Processing, Washington D.C., USA (1997)
Xu, C., Xie, L., Huang, G., Xiao, X., Chng, E.S., Li, H.: A deep neural network approach for sentence boundary detection in Broadcast News. In: Proceedings of INTERSPEECH 2014, Singapore (2014)
Huang, G.P., Xu, C., Xiao, X., Xie, L., Chng, E.S., Li, H.: Multi-view features in a DNN-CRF model for improved sentence unit detection on English Broadcast News. In: APSIPA 2014, Cambodia (2014)
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference 2009 (2009)
Rosenfeld, R.: A maximum entropy approach to adaptive statistical language modelling. Comput. Speech Lang. 10(3), 187–228 (1996)
Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 15–30. Springer, Heidelberg (2002)
McCallum, A.: Efficiently inducing features of conditional random fields. In: Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, pp. 403–410. Morgan Kaufmann Publishers Inc. (2002)
Jeong, M., Gary, G.L.: Practical use of non-local features for statistical spoken language understanding. Comput. Speech Lang. 22(2), 148–170 (2008)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labelling sequence data. In: Proceedings of ICML, pp. 282–28 (2001)
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT/NAACL (2003)
Acknowledgements
This work is supported by DSO funded project MAISON DSOCL14045.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ho, TN., Chong, T.Y., Do, V.H., Pham, V.T., Chng, E.S. (2016). Improving Efficiency of Sentence Boundary Detection by Feature Selection. In: Nguyen, N.T., Trawiński, B., Fujita, H., Hong, TP. (eds) Intelligent Information and Database Systems. ACIIDS 2016. Lecture Notes in Computer Science(), vol 9622. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49390-8_58
Download citation
DOI: https://doi.org/10.1007/978-3-662-49390-8_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49389-2
Online ISBN: 978-3-662-49390-8
eBook Packages: Computer ScienceComputer Science (R0)