Abstract
The purpose of cross-language plagiarism detection is to detect similar fragments of text among different languages. There are a limited number of free Arabic-English parallel corpora. This paper reports on the creation of Arabic-English parallel corpora for the estimation of Arabic-English cross-language document similarity detection. The paper discusses available Arabic-English corpora and their limitations. The paper then explains the building of our dataset. The proposed dataset is parallel Arabic-English and involves alignment for different granularities. It is constructed on comparable and parallel corpora and includes human and machine translated text. In addition, the collected texts were written by authors of varying ability. Cross-language plagiarism detection methods are used on the dataset. These methods are then assessed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alkahtani S (2015) A new parallel corpus of Arabic/English. In: Proceedings of the 8th Saudi Students Conference (279–284). UK: Singapore: World Scientific Publishing
Alotaibi MH (2017) Arabic-English parallel corpus: a new resource for translation training and language teaching. Arab World Engl J (AWEJ) 8:319–337
Alzahrani S (2016) Cross-language semantic similarity of Arabic-English short phrases and sentences. J Comput Sci. https://doi.org/10.3844/jcssp.2016
Barro ́n-Ceden ̃o A (2012) On the mono- and cross- language detection of text re-use and plagiarism. In: Ph.D thesis, Vale`ncia, Spain., ncia, Spain
Barro ́n-Ceden ̃o AR (2010) Plagiarism detection across distant language pairs. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10), (pp 37–45). Association for Computational Linguistics
Clough P, Stevenson M (2011) Developing a corpus of plagiarised short answers. Lang Resour Eval 45(1):5–24
Eisele A (2010) MultiUN: a multilingual corpus from united nation documents. In: Proceeding of the 7th International Conference on Language Resources and Evaluation (LREC) (pp 2868–2872). Paris: European Language Resources Association (EL)
Ferrero J, Agne`s F, Besacier L, Schwab D (2016) A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In: 10th edition of the Language Resources and Evaluation Conference (pp. Portorož, Slovenia). Portorož, Slovenia
Gupta PBC (2012) Cross-language high similarity search using a conceptual thesaurus. In: Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics, pp 67–75
Guzman FS (2013) The AMARA corpus: building resources for translating the web’s educational content. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2013. Heidelberg: IWSLT
Hassan S (2016) Design and implementing of multilingual Hadith corpus. Int J Recent Res Soc Sci HumIties 2(3):100–104
Kent CK (2010) Web based cross language plagiarism detection. In: Second International Conference on Computational Intelligence, Modelling and Simulation (CIMSiM) (pp 199–204). IEEE
Muhr MK (2010) External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system (vol. Lab Report for PAN at CLEF 2010.). (e. a. Martin Braschler, Ed.) CLEF Notebook
Pataki M (2012) A new approach for searching translated plagiarism. In: Proceedings of the 5th International Plagiarism Conference. Newcastle, UK
Pinto DC-C (2009) A statistical approach to crosslingual natural language tasks. J Algorithms 64:51–60
Potthast M, Barro ́n-Ceden ̃o A, Stein BB (2011) Cross-language plagiarism detection. Lang Ressour Eval 45:45–62
Salhi H (2013) Investigating the complementary polysemy and the Arabic translations of the noun destruction in EAPCOUNT. Meta: Translators’ J 1(58):227–246
Acknowledgements
This research was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University through the Fast-track Research Funding Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Aljuaid, H. (2020). Arabic-English Corpus for Cross-Language Textual Similarity Detection. In: Kim, K., Kim, HY. (eds) Information Science and Applications. Lecture Notes in Electrical Engineering, vol 621. Springer, Singapore. https://doi.org/10.1007/978-981-15-1465-4_52
Download citation
DOI: https://doi.org/10.1007/978-981-15-1465-4_52
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1464-7
Online ISBN: 978-981-15-1465-4
eBook Packages: EngineeringEngineering (R0)