Skip to main content

A Semi-automatic Structure Learning Method for Language Modeling

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

  • 770 Accesses

Abstract

This paper presents a semi-automatic method for statistical language modeling. The method addresses the structure learning problem of the linguistic classes prediction model (LCPM) in class-dependent N-grams supporting multiple linguistic classes per word. The structure of the LCPM is designed, within the Factorial Language Model framework, combining a knowledge-based approach with a data-driven technique. First, simple linguistic knowledge is used to define a set with linguistic features appropriate to the application, and to sketch the LCPM main structure. Next an automatic algorithm selects, based on Information Theory solid concepts, the relevant factors associated to the selected features and establishes the LCPM definitive structure. This approach is based on the so called Buried Markov Models [1]. Although only preliminary results were obtained, they afford great confidence on the method’s ability to learn from the data, LCPM structures that represent accurately the application’s real dependencies and also favor the training robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The lighter notation \(f^{1:K}\) is used to express the features set \(\{f^1, f^2, \dots , f^K\}\). The same convention is used, from now on, with \(f_{i:j}^{m:n}\) representing a factors set  (where \(f_{\tau }^{\nu }, \tau =i,\dots ,j\quad \nu =m,\dots ,n\;\) represents the factor corresponding to the linguistic feature \(f^{\nu }\) at time \(\tau \)).

References

  1. Bilmes, J.: Natural statistical models for automatic speech recognition. Ph.D. thesis, U.C. Berkeley, Department of EECS, CS Division (1999)

    Google Scholar 

  2. Bilmes, J.: Natural statistical models for automatic speech recognition. Technical report, International Computer Science Institute, October 1999

    Google Scholar 

  3. Bilmes, J., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4–6, May 2003. https://www.aclweb.org/anthology/N03-2002

  4. Federico, M.: Language models. Presented at the Fourth Machine Translation Marathon - Open Source Tools for Machine Translation, January 2010

    Google Scholar 

  5. Federico, M., Cettolo, M.: Efficient handling of n-gram language models for statistical machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation, pp. 88–95. Association for Computational Linguistics, Prague, June 2007. https://www.aclweb.org/anthology/W07-0712

  6. Kirchhoff, K., Bilmes, J., Duh, K.: Factored language model tutorial. Technical report, University of Washington, Department of EE, February 2008

    Google Scholar 

  7. Kirchhoff, K., Yang, M.: Improved language modeling for statistical machine translation. In: Proceedings of the ACL Workshop on Building and Using Parallel Texts, ParaText 2005, pp. 125–128. Association for Computational Linguistics, Stroudsburg (2005). http://dl.acm.org/citation.cfm?id=1654449.1654476

  8. Mateus, M., Brito, A., Duarte, I., Faria, I.: Gramática da Língua Portuguesa. Editorial Caminho, S.A., Rua Cidade de Córdova n\(^\circ \)2, 2610–038 Alfragide, Portugal (2003)

    Google Scholar 

  9. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)

    Article  Google Scholar 

  10. Santos, D., Rocha, P.: Evaluating cetempúblico, a free resource for Portuguese. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 450–457. Association for Computational Linguistics, Stroudsburg, July 2001. http://www.linguateca.pt/superb/busca_publ.pl?idi=1141122240

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vitor Pera .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pera, V. (2019). A Semi-automatic Structure Learning Method for Language Modeling. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27947-9_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27946-2

  • Online ISBN: 978-3-030-27947-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics