A Semi-automatic Structure Learning Method for Language Modeling

Pera, Vitor

doi:10.1007/978-3-030-27947-9_14

Vitor Pera⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

770 Accesses

Abstract

This paper presents a semi-automatic method for statistical language modeling. The method addresses the structure learning problem of the linguistic classes prediction model (LCPM) in class-dependent N-grams supporting multiple linguistic classes per word. The structure of the LCPM is designed, within the Factorial Language Model framework, combining a knowledge-based approach with a data-driven technique. First, simple linguistic knowledge is used to define a set with linguistic features appropriate to the application, and to sketch the LCPM main structure. Next an automatic algorithm selects, based on Information Theory solid concepts, the relevant factors associated to the selected features and establishes the LCPM definitive structure. This approach is based on the so called Buried Markov Models [1]. Although only preliminary results were obtained, they afford great confidence on the method’s ability to learn from the data, LCPM structures that represent accurately the application’s real dependencies and also favor the training robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The lighter notation \(f^{1:K}\) is used to express the features set \(\{f^1, f^2, \dots , f^K\}\). The same convention is used, from now on, with \(f_{i:j}^{m:n}\) representing a factors set (where \(f_{\tau }^{\nu }, \tau =i,\dots ,j\quad \nu =m,\dots ,n\;\) represents the factor corresponding to the linguistic feature \(f^{\nu }\) at time \(\tau \)).

References

Bilmes, J.: Natural statistical models for automatic speech recognition. Ph.D. thesis, U.C. Berkeley, Department of EECS, CS Division (1999)
Google Scholar
Bilmes, J.: Natural statistical models for automatic speech recognition. Technical report, International Computer Science Institute, October 1999
Google Scholar
Bilmes, J., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4–6, May 2003. https://www.aclweb.org/anthology/N03-2002
Federico, M.: Language models. Presented at the Fourth Machine Translation Marathon - Open Source Tools for Machine Translation, January 2010
Google Scholar
Federico, M., Cettolo, M.: Efficient handling of n-gram language models for statistical machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation, pp. 88–95. Association for Computational Linguistics, Prague, June 2007. https://www.aclweb.org/anthology/W07-0712
Kirchhoff, K., Bilmes, J., Duh, K.: Factored language model tutorial. Technical report, University of Washington, Department of EE, February 2008
Google Scholar
Kirchhoff, K., Yang, M.: Improved language modeling for statistical machine translation. In: Proceedings of the ACL Workshop on Building and Using Parallel Texts, ParaText 2005, pp. 125–128. Association for Computational Linguistics, Stroudsburg (2005). http://dl.acm.org/citation.cfm?id=1654449.1654476
Mateus, M., Brito, A., Duarte, I., Faria, I.: Gramática da Língua Portuguesa. Editorial Caminho, S.A., Rua Cidade de Córdova n\(^\circ \)2, 2610–038 Alfragide, Portugal (2003)
Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)
Article Google Scholar
Santos, D., Rocha, P.: Evaluating cetempúblico, a free resource for Portuguese. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 450–457. Association for Computational Linguistics, Stroudsburg, July 2001. http://www.linguateca.pt/superb/busca_publ.pl?idi=1141122240

Download references

Author information

Authors and Affiliations

Faculdade de Engenharia da Universidade do Porto, Rua Dr Roberto Frias, s/n, 4200-465, Porto, Portugal
Vitor Pera

Authors

Vitor Pera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vitor Pera .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pera, V. (2019). A Semi-automatic Structure Learning Method for Language Modeling. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-27947-9_14
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics