Arabic-English Corpus for Cross-Language Textual Similarity Detection

Aljuaid, Hanan

doi:10.1007/978-981-15-1465-4_52

Hanan Aljuaid³⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 621))

1393 Accesses

Abstract

The purpose of cross-language plagiarism detection is to detect similar fragments of text among different languages. There are a limited number of free Arabic-English parallel corpora. This paper reports on the creation of Arabic-English parallel corpora for the estimation of Arabic-English cross-language document similarity detection. The paper discusses available Arabic-English corpora and their limitations. The paper then explains the building of our dataset. The proposed dataset is parallel Arabic-English and involves alignment for different granularities. It is constructed on comparable and parallel corpora and includes human and machine translated text. In addition, the collected texts were written by authors of varying ability. Cross-language plagiarism detection methods are used on the dataset. These methods are then assessed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alkahtani S (2015) A new parallel corpus of Arabic/English. In: Proceedings of the 8th Saudi Students Conference (279–284). UK: Singapore: World Scientific Publishing
Google Scholar
Alotaibi MH (2017) Arabic-English parallel corpus: a new resource for translation training and language teaching. Arab World Engl J (AWEJ) 8:319–337
Google Scholar
Alzahrani S (2016) Cross-language semantic similarity of Arabic-English short phrases and sentences. J Comput Sci. https://doi.org/10.3844/jcssp.2016
Article Google Scholar
Barro ́n-Ceden ̃o A (2012) On the mono- and cross- language detection of text re-use and plagiarism. In: Ph.D thesis, Vale`ncia, Spain., ncia, Spain
Google Scholar
Barro ́n-Ceden ̃o AR (2010) Plagiarism detection across distant language pairs. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10), (pp 37–45). Association for Computational Linguistics
Google Scholar
Clough P, Stevenson M (2011) Developing a corpus of plagiarised short answers. Lang Resour Eval 45(1):5–24
Article Google Scholar
Eisele A (2010) MultiUN: a multilingual corpus from united nation documents. In: Proceeding of the 7th International Conference on Language Resources and Evaluation (LREC) (pp 2868–2872). Paris: European Language Resources Association (EL)
Google Scholar
Ferrero J, Agne`s F, Besacier L, Schwab D (2016) A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In: 10th edition of the Language Resources and Evaluation Conference (pp. Portorož, Slovenia). Portorož, Slovenia
Google Scholar
Gupta PBC (2012) Cross-language high similarity search using a conceptual thesaurus. In: Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics, pp 67–75
Google Scholar
Guzman FS (2013) The AMARA corpus: building resources for translating the web’s educational content. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2013. Heidelberg: IWSLT
Google Scholar
Hassan S (2016) Design and implementing of multilingual Hadith corpus. Int J Recent Res Soc Sci HumIties 2(3):100–104
MathSciNet Google Scholar
Kent CK (2010) Web based cross language plagiarism detection. In: Second International Conference on Computational Intelligence, Modelling and Simulation (CIMSiM) (pp 199–204). IEEE
Google Scholar
Muhr MK (2010) External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system (vol. Lab Report for PAN at CLEF 2010.). (e. a. Martin Braschler, Ed.) CLEF Notebook
Google Scholar
Pataki M (2012) A new approach for searching translated plagiarism. In: Proceedings of the 5th International Plagiarism Conference. Newcastle, UK
Google Scholar
Pinto DC-C (2009) A statistical approach to crosslingual natural language tasks. J Algorithms 64:51–60
Article Google Scholar
Potthast M, Barro ́n-Ceden ̃o A, Stein BB (2011) Cross-language plagiarism detection. Lang Ressour Eval 45:45–62
Article Google Scholar
Salhi H (2013) Investigating the complementary polysemy and the Arabic translations of the noun destruction in EAPCOUNT. Meta: Translators’ J 1(58):227–246
Google Scholar

Download references

Acknowledgements

This research was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University through the Fast-track Research Funding Program.

Author information

Authors and Affiliations

Computer Sciences Department, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University (PNU), 84428, Al Riyad, Saudi Arabia
Hanan Aljuaid

Authors

Hanan Aljuaid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanan Aljuaid .

Editor information

Editors and Affiliations

Kyonggi University, Suwon-si, Korea (Republic of)
Kuinam J. Kim
School of Games, Hongik University, Chungchengnam-do, Korea (Republic of)
Hye-Young Kim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aljuaid, H. (2020). Arabic-English Corpus for Cross-Language Textual Similarity Detection. In: Kim, K., Kim, HY. (eds) Information Science and Applications. Lecture Notes in Electrical Engineering, vol 621. Springer, Singapore. https://doi.org/10.1007/978-981-15-1465-4_52

Download citation

DOI: https://doi.org/10.1007/978-981-15-1465-4_52
Published: 19 December 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1464-7
Online ISBN: 978-981-15-1465-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics