Skip to main content

Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7181))

Abstract

This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have significant impact on the state-of-the-art for Urdu language processing. In URDU.KON-TB treebank described here, a POS tagset, a syntactic tagset and a functional tagset have been proposed. The construction of the treebank is based on an existing corpus of 19 million words for the Urdu language. Part of speech (POS) tagging and annotation of a selected set of sentences from different sub-domains of this corpus is in process manually and the work performed till to date is presented here. The hierarchical annotation scheme we adopted has a combination of a phrase structure (PS) and a hybrid dependency structure (HDS).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Leech, G.: Adding linguistic annotation. In: Wynne, M. (ed.) Developing Linguistic Corpora: A Guide to Good Practice, ch. 3, pp. 17–29. Oxbow Books, Oxford (2005)

    Google Scholar 

  2. Garside, R., Leech, G.N., McEnery, T.: Corpus annotation: linguistic information from computer text corpora. Longman, London (1997)

    Google Scholar 

  3. Ijaz, M.: Urdu 5000 Most Frequently Used Words: Technical Report, Center for Research in Urdu Language Processing (CRULP), Lahore, Pakistan (2007)

    Google Scholar 

  4. Wallis, S.: Searching treebanks and other structured corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft, ch. 34. Mouton de Gruyter, Berlin (2008)

    Google Scholar 

  5. Santorini, B.: Part-of-speech tagging guidelines for the Penn treebank project: Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania (1990)

    Google Scholar 

  6. Brill, E.: Discovering the lexical features of a language. In: 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA (1991)

    Google Scholar 

  7. Brill, E., Magerman, D., Marcus, M.P., Santorini, B.: Deducing linguistic structure from the statistics of large corpora. In: DARPA Speech and Natural Language Workshop (1990)

    Google Scholar 

  8. Magerman, D., Marcus, M.P.: Parsing a natural language using mutual information statistics. In: AAAI (1990)

    Google Scholar 

  9. Pereira, F., Schabes, F.: Inside-outside re-estimation from partially bracketed corpora. In: 30th Annual Meeting of the Association for Computational Linguistics (1992)

    Google Scholar 

  10. Weischedel, R., Ayuso, D., Bobrow, R., Boisen, S., Ingria, R., Palmucci, J.: Partial parsing: a report of work in progress. In: 4th DARPA Speech and Natural Language Workshop (1991)

    Google Scholar 

  11. Meteer, M., Schwartz, R., Weischedel, R.: Studies in part of speech labelling. In: 4th DARPA Speech and Natural Language Workshop (1991)

    Google Scholar 

  12. Veilleux, M.N., Ostendorf, M.: Probabilistic parse scoring based on prosodic features. In: 5th DARPA Speech and Natural Language Workshop (1992)

    Google Scholar 

  13. Niv, M.: Syntactic disambiguation. The Penn Review of Linguistics 14, 120–126 (1991)

    Google Scholar 

  14. Sampson, G.: English for the computer: The SUSANNE corpus and analytic scheme. Clarendon Press, Oxford (1995)

    Google Scholar 

  15. Leech, G.: The Lancaster Parsed Corpus. ICAME Journal 16(124) (1992)

    Google Scholar 

  16. Greenbaum, S.: Comparing English worldwide: The International Corpus of English. Clarendon Press, Oxford (1996)

    Google Scholar 

  17. Dipper, S., Brants, T., Lezius, W., Plaehn, O., Smith, G.: The TIGER Treebank. In: Third Workshop on Linguistically Interpreted Corpora LINC 2001, Leuven, Belgium (2001)

    Google Scholar 

  18. Schiller, A., Teufel, S., Stoeckert, C.: Vorlaeufige Guidelines fuer das Tagging deutscher Textcorpora mit STTS(Deutsche): Technical Report, IMS-CL, University Stuttgart (1995)

    Google Scholar 

  19. Skut, W., Krenn, B., Brants, T., Uszkoreit, H.: An Annotation Scheme for Free Word Order Languages. In: Fifth Conference on Applied Natural Language Processing (ANLP), Washington, D.C (1997)

    Google Scholar 

  20. Abbas, Q., Karamat, N., Niazi, S.: Development of Tree-bank based probabilistic grammar for Urdu Language. International Journal of Electrical & Computer Science 09(09), 231–235 (2009) ISSN: 2077-1231

    Google Scholar 

  21. Butt, M., King, T.H.: The Status of Case. In: Dayal, V., Mahajan, A. (eds.) Clause Structure in South Asian Languages, pp. 153–198. Springer, Berlin (2005)

    Google Scholar 

  22. Sajjad, H., Schmid, H.: Tagging Urdu Text with Parts of Speech: A Tagger Comparison. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009 (2009)

    Google Scholar 

  23. Clark, A., Fox, C., Lappin, S.: The Handbook of Computational Linguistics and Natural Language Processing. Blackwell Handbooks in Linguistics, vol. 52, pp. 239–244. John Wiley and Sons (2010) ISBN: 1405155817, 9781405155816

    Google Scholar 

  24. Abbas, Q., Khan, A.H.: Lexical functional grammar for Urdu modal verbs. In: 5th IEEE (ICET) 2009 International Conference on Engineering and Technology, pp. 07–12 (2009)

    Google Scholar 

  25. Abbas, Q., Ahmed, M.S., Niazi, S.: Language Identifier for Languages of Pakistan Including Arabic and Persian. International Journal of Computational Linguistics (IJCL) 01(03), 27–35 (2010) ISSN: 2180-1266

    Google Scholar 

  26. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English. Computational Linguistics (CL) 19(2), 313–330 (1993)

    Google Scholar 

  27. Bies, A., Ferguson, M., Katz, K., Macintyre, R.: Bracketing guidelines for Treebank II style penn treebank project: Technical Report, University of Pennsylvania (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Abbas, Q. (2012). Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28604-9_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28603-2

  • Online ISBN: 978-3-642-28604-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics