Abstract
This paper presents a description of the parallel corpora being created simultaneously in 12 major Indian languages including English under a nationally funded project named Indian Languages Corpora Initiative (ILCI) run through a consortium of institutions across India. The project runs in two phases. The first phase of the project has two distinct goals - creating parallel sentence aligned corpus and parts of speech (POS) annotation of the corpora as per recently evolved national standard under Bureau of Indian Standard (BIS). This phase of the project is finishing in April 2012 and the next phase with newer domains and more national languages is likely to take off in May 2012. The goal of the current phase is to create parallel aligned POS tagged corpora in 12 major Indian languages (including English) with Hindi as the source language in health and tourism domains. Additional languages and domains will be added in the next phase. With the goal of 25 thousand sentences in each domain, we find that the total number of words in each of the domains has reached up to 400 thousands, the largest in size for a parallel corpus in any pair of Indian languages. A careful attempt has been made to capture various types of texts. With an analysis of the domains, we divided the two domains into sub-domains and then looked for the source text in those particular sub-domains to be included in the source text. With a preferable structure of the corpora in mind, we present our experiences also in selecting the text as the source and recount the problems like that of a judgment on the sub-domain text representation in the corpora. The POS annotation framework used for this corpora creation has also seen new changes in the POS tagsets. We also give a brief on the POS annotation framework being applied in this endeavor.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
as per Census of India, 2001 http://censusindia.gov.in/Census_Data_2001/Census_Data_Online/Language/Statement5.html
- 3.
No standard published reference can be given for this tagset as yet. We refer to the document circulated in the consortia meetings. This document was referred as “Linguistic Resource Standards: Standards for POS Tagsets for Indian Languages”, ver. 005, August, 2010.
- 4.
References
Baker, P., Hardie, A., McEnery, T., Xiao, R., Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Maynard, D., Tablan, V., Ursu, C., Jayaram, B.D., Leisher, M.: Corpus linguistics and South Asian languages: corpus creation and tool development. Literary Linguist. Comput. 19, 509–524 (2004)
Jha, G.N.: The TDIL program and the Indian language corpora initiative (ILCI). In: Calzolari, N., et al. (ed.) Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA) (2010)
Choudhary, N.: Web-drawn Corpus for Indian languages: a case of Hindi. In: Singh, C., Singh Lehal, G., Sengupta, J., Sharma, D.V., Goyal, V. (eds.) ICISIL 2011. CCIS, vol. 139, pp. 218–223. Springer, Heidelberg (2011)
Shrivastava, M., Bhattacharyya, P.: Hindi POS tagger using naive stemming: harnessing morphological information without extensive Linguistic knowledge. In: Proceedings of the International Conference on NLP (ICON08), Pune, India (2008)
Avinesh, P.V.S., Karthik, G.: Part-of-speech tagging and chunking using conditional random fields and transformation-based learning. In: Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), pp. 21–24 (2007)
Dandapat, S., Sarkar, S., Basu, A.: Automatic part-of-speech tagging for Bengali: an approach for morphologically rich languages in a poor resource scenario. In: Proceedings of the Association for Computational Linguistic, pp 221–224 (2007)
Kumar, D., Josan, G.S.: Part of speech taggers for morpho-logically rich Indian languages: a survey. Int. J. Comput. Appl. 6(5), 1–9 (2010). Foundation of Computer Science
Bharati, A., Sharma, D.M., Bai, L., Sangal, R.: Anncorra: Annotating Corpora. LTRC, IIIT, Hyderabad (2006)
Baskaran, S., Bali, K., Choudhury, M., Bhattacharya, T., Bhattacharyya, P., Jha, G.N., Rajendran, S., Saravanan, K., Sobha, L., Subbarao., K.V.: A Common parts-of-speech tag set framework for indian languages. In: Nicoletta Calzolari (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the 6th International Language Resources and Evaluation (LREC’08), Marrakech, Morocco (2008)
Santorini, B.: Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania (1990)
Goyal, V., Lehal, G.S.: Hindi morphological analyzer and generator. In: Proceedings of the 1st International Conference on Emerging Trends in Engineering and Technology (2008)
Bögel, T., Butt, M., Hautli, A., Sulger, S.: Developing a finite-state morphological analyzer for Urdu and Hindi. In: Proceedings of the 6th International Workshop on Finite-State Methods and Natural Language Processing, Potsdam (2007)
Leech, G., Wilson, A.: Standards for tagsets. In: van Halteren, H. (ed.) EAGLES Recommendations for the Morphosyntactic Annotation of Corpora, (1999). http://www.ilc.cnr.it/EAGLES96/browse.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix I: Super Set of POS Tags for Indian Languages
Appendix I: Super Set of POS Tags for Indian Languages
Sl. No. | Category (Category. Type. Subtype) | Label | Annotation convention |
---|---|---|---|
1 | 1 Noun | N | N |
2 | 1.1 Common | NN | N_NN |
3 | 1.2 Proper | NNP | N_NNP |
4 | 1.3 Verbal | NNV | N_NNV |
5 | 1.4 Nloc | NST | N_NST |
6 | 2 Pronoun | PR | PR |
7 | 2.1 Personal | PRP | PR_PRP |
8 | 2.2 Reflexive | PRF | PR_PRF |
9 | 2.3 Relative | PRL | PR_PRL |
10 | 2.4 Reciprocal | PRC | PR_PRC |
11 | 2.5 Wh-word | PRQ | PR_PRQ |
12 | 3 Demonstrative | DM | DM |
13 | 3.1 Deictic | DMD | DM_DMD |
14 | 3.2 Relative | DMR | DM_DMR |
15 | 3.3 Wh-word | DMQ | DM_DMQ |
16 | Verb | V | V |
17 | 4.1 Main | VM | V_VM |
18 | 4.1.1 Finite | VF | V_VM_VF |
19 | 4.1.2 Non-finite | VNF | V_VM_VNF |
20 | 4.1.3 Infinitive | VINF | V_VM_VINF |
21 | 4.1.4 Gerund | VNG | V_VM_VNG |
22 | 4.2 Auxiliary | VAUX | V_VAUX |
23 | 5 Adjective | JJ | |
24 | 6 Adverb | RB | |
25 | 7 Postposition | PSP | |
26 | 8 Conjunction | CC | CC |
27 | 8.1 Co-ordinator | CCD | CC_CCD |
28 | 8.2 Subordinator | CCS | CC_CCS |
29 | 8.2.1 Quotative | UT | CC_CCS_UT |
30 | 9 Particles | RP | RP |
31 | 9.1 Default | RPD | RP_RPD |
32 | 9.2 Classifier | CL | RP_CL |
33 | 9.3 Interjection | INJ | RP_INJ |
34 | 9.4 Intensifier | INTF | RP_INTF |
35 | 9.5 Negation | NEG | RP_NEG |
36 | 10 Quantifiers | QT | QT |
37 | 10.1 General | QTF | QT_QTF |
38 | 10.2 Cardinals | QTC | QT_QTC |
39 | 10.3 Ordinals | QTO | QT_QTO |
40 | 11 Residuals | RD | RD |
41 | 11.1 Foreign word | RDF | RD_RDF |
42 | 11.2 Symbol | SYM | RD_SYM |
43 | 11.3 Punctuation | PUNC | RD_PUNC |
44 | 11.4 Unknown | UNK | RD_UNK |
45 | 11.5 Echo-words | ECH | RD_ECH |
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Choudhary, N., Jha, G.N. (2014). Creating Multilingual Parallel Corpora in Indian Languages. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_43
Download citation
DOI: https://doi.org/10.1007/978-3-319-08958-4_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08957-7
Online ISBN: 978-3-319-08958-4
eBook Packages: Computer ScienceComputer Science (R0)