Creating Multilingual Parallel Corpora in Indian Languages

Choudhary, Narayan; Jha, Girish Nath

doi:10.1007/978-3-319-08958-4_43

Narayan Choudhary⁶ &
Girish Nath Jha⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8387))

Included in the following conference series:

Language and Technology Conference

921 Accesses
2 Citations

Abstract

This paper presents a description of the parallel corpora being created simultaneously in 12 major Indian languages including English under a nationally funded project named Indian Languages Corpora Initiative (ILCI) run through a consortium of institutions across India. The project runs in two phases. The first phase of the project has two distinct goals - creating parallel sentence aligned corpus and parts of speech (POS) annotation of the corpora as per recently evolved national standard under Bureau of Indian Standard (BIS). This phase of the project is finishing in April 2012 and the next phase with newer domains and more national languages is likely to take off in May 2012. The goal of the current phase is to create parallel aligned POS tagged corpora in 12 major Indian languages (including English) with Hindi as the source language in health and tourism domains. Additional languages and domains will be added in the next phase. With the goal of 25 thousand sentences in each domain, we find that the total number of words in each of the domains has reached up to 400 thousands, the largest in size for a parallel corpus in any pair of Indian languages. A careful attempt has been made to capture various types of texts. With an analysis of the domains, we divided the two domains into sub-domains and then looked for the source text in those particular sub-domains to be included in the source text. With a preferable structure of the corpora in mind, we present our experiences also in selecting the text as the source and recount the problems like that of a judgment on the sub-domain text representation in the corpora. The POS annotation framework used for this corpora creation has also seen new changes in the POS tagsets. We also give a brief on the POS annotation framework being applied in this endeavor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.ethnologue.com/show_country.asp?name=inaccessed:4September,2011
2.
as per Census of India, 2001 http://censusindia.gov.in/Census_Data_2001/Census_Data_Online/Language/Statement5.html
3.
No standard published reference can be given for this tagset as yet. We refer to the document circulated in the consortia meetings. This document was referred as “Linguistic Resource Standards: Standards for POS Tagsets for Indian Languages”, ver. 005, August, 2010.
4.
http://www.sil.org/iso639-3/default.asp

References

Baker, P., Hardie, A., McEnery, T., Xiao, R., Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Maynard, D., Tablan, V., Ursu, C., Jayaram, B.D., Leisher, M.: Corpus linguistics and South Asian languages: corpus creation and tool development. Literary Linguist. Comput. 19, 509–524 (2004)
Article Google Scholar
Jha, G.N.: The TDIL program and the Indian language corpora initiative (ILCI). In: Calzolari, N., et al. (ed.) Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA) (2010)
Google Scholar
Choudhary, N.: Web-drawn Corpus for Indian languages: a case of Hindi. In: Singh, C., Singh Lehal, G., Sengupta, J., Sharma, D.V., Goyal, V. (eds.) ICISIL 2011. CCIS, vol. 139, pp. 218–223. Springer, Heidelberg (2011)
Chapter Google Scholar
Shrivastava, M., Bhattacharyya, P.: Hindi POS tagger using naive stemming: harnessing morphological information without extensive Linguistic knowledge. In: Proceedings of the International Conference on NLP (ICON08), Pune, India (2008)
Google Scholar
Avinesh, P.V.S., Karthik, G.: Part-of-speech tagging and chunking using conditional random fields and transformation-based learning. In: Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), pp. 21–24 (2007)
Google Scholar
Dandapat, S., Sarkar, S., Basu, A.: Automatic part-of-speech tagging for Bengali: an approach for morphologically rich languages in a poor resource scenario. In: Proceedings of the Association for Computational Linguistic, pp 221–224 (2007)
Google Scholar
Kumar, D., Josan, G.S.: Part of speech taggers for morpho-logically rich Indian languages: a survey. Int. J. Comput. Appl. 6(5), 1–9 (2010). Foundation of Computer Science
Google Scholar
Bharati, A., Sharma, D.M., Bai, L., Sangal, R.: Anncorra: Annotating Corpora. LTRC, IIIT, Hyderabad (2006)
Google Scholar
Baskaran, S., Bali, K., Choudhury, M., Bhattacharya, T., Bhattacharyya, P., Jha, G.N., Rajendran, S., Saravanan, K., Sobha, L., Subbarao., K.V.: A Common parts-of-speech tag set framework for indian languages. In: Nicoletta Calzolari (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the 6th International Language Resources and Evaluation (LREC’08), Marrakech, Morocco (2008)
Google Scholar
Santorini, B.: Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania (1990)
Google Scholar
Goyal, V., Lehal, G.S.: Hindi morphological analyzer and generator. In: Proceedings of the 1st International Conference on Emerging Trends in Engineering and Technology (2008)
Google Scholar
Bögel, T., Butt, M., Hautli, A., Sulger, S.: Developing a finite-state morphological analyzer for Urdu and Hindi. In: Proceedings of the 6th International Workshop on Finite-State Methods and Natural Language Processing, Potsdam (2007)
Google Scholar
Leech, G., Wilson, A.: Standards for tagsets. In: van Halteren, H. (ed.) EAGLES Recommendations for the Morphosyntactic Annotation of Corpora, (1999). http://www.ilc.cnr.it/EAGLES96/browse.html

Download references

Author information

Authors and Affiliations

Jawaharlal Nehru University, New Delhi, India
Narayan Choudhary & Girish Nath Jha

Authors

Narayan Choudhary
View author publications
You can also search for this author in PubMed Google Scholar
Girish Nath Jha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Narayan Choudhary .

Editor information

Editors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Zygmunt Vetulani
IMMI-CNRS, Orsay, France
Joseph Mariani

Appendix I: Super Set of POS Tags for Indian Languages

Sl. No.	Category (Category. Type. Subtype)	Label	Annotation convention
1	1 Noun	N	N
2	1.1 Common	NN	N_NN
3	1.2 Proper	NNP	N_NNP
4	1.3 Verbal	NNV	N_NNV
5	1.4 Nloc	NST	N_NST
6	2 Pronoun	PR	PR
7	2.1 Personal	PRP	PR_PRP
8	2.2 Reflexive	PRF	PR_PRF
9	2.3 Relative	PRL	PR_PRL
10	2.4 Reciprocal	PRC	PR_PRC
11	2.5 Wh-word	PRQ	PR_PRQ
12	3 Demonstrative	DM	DM
13	3.1 Deictic	DMD	DM_DMD
14	3.2 Relative	DMR	DM_DMR
15	3.3 Wh-word	DMQ	DM_DMQ
16	Verb	V	V
17	4.1 Main	VM	V_VM
18	4.1.1 Finite	VF	V_VM_VF
19	4.1.2 Non-finite	VNF	V_VM_VNF
20	4.1.3 Infinitive	VINF	V_VM_VINF
21	4.1.4 Gerund	VNG	V_VM_VNG
22	4.2 Auxiliary	VAUX	V_VAUX
23	5 Adjective	JJ
24	6 Adverb	RB
25	7 Postposition	PSP
26	8 Conjunction	CC	CC
27	8.1 Co-ordinator	CCD	CC_CCD
28	8.2 Subordinator	CCS	CC_CCS
29	8.2.1 Quotative	UT	CC_CCS_UT
30	9 Particles	RP	RP
31	9.1 Default	RPD	RP_RPD
32	9.2 Classifier	CL	RP_CL
33	9.3 Interjection	INJ	RP_INJ
34	9.4 Intensifier	INTF	RP_INTF
35	9.5 Negation	NEG	RP_NEG
36	10 Quantifiers	QT	QT
37	10.1 General	QTF	QT_QTF
38	10.2 Cardinals	QTC	QT_QTC
39	10.3 Ordinals	QTO	QT_QTO
40	11 Residuals	RD	RD
41	11.1 Foreign word	RDF	RD_RDF
42	11.2 Symbol	SYM	RD_SYM
43	11.3 Punctuation	PUNC	RD_PUNC
44	11.4 Unknown	UNK	RD_UNK
45	11.5 Echo-words	ECH	RD_ECH

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Choudhary, N., Jha, G.N. (2014). Creating Multilingual Parallel Corpora in Indian Languages. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_43

Download citation

DOI: https://doi.org/10.1007/978-3-319-08958-4_43
Published: 26 July 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08957-7
Online ISBN: 978-3-319-08958-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Creating Multilingual Parallel Corpora in Indian Languages

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix I: Super Set of POS Tags for Indian Languages

Appendix I: Super Set of POS Tags for Indian Languages

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation