Skip to main content

Arabic-English Corpus for Cross-Language Textual Similarity Detection

  • Conference paper
  • First Online:
Information Science and Applications

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 621))

  • 1393 Accesses

Abstract

The purpose of cross-language plagiarism detection is to detect similar fragments of text among different languages. There are a limited number of free Arabic-English parallel corpora. This paper reports on the creation of Arabic-English parallel corpora for the estimation of Arabic-English cross-language document similarity detection. The paper discusses available Arabic-English corpora and their limitations. The paper then explains the building of our dataset. The proposed dataset is parallel Arabic-English and involves alignment for different granularities. It is constructed on comparable and parallel corpora and includes human and machine translated text. In addition, the collected texts were written by authors of varying ability. Cross-language plagiarism detection methods are used on the dataset. These methods are then assessed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alkahtani S (2015) A new parallel corpus of Arabic/English. In: Proceedings of the 8th Saudi Students Conference (279–284). UK: Singapore: World Scientific Publishing

    Google Scholar 

  2. Alotaibi MH (2017) Arabic-English parallel corpus: a new resource for translation training and language teaching. Arab World Engl J (AWEJ) 8:319–337

    Google Scholar 

  3. Alzahrani S (2016) Cross-language semantic similarity of Arabic-English short phrases and sentences. J Comput Sci. https://doi.org/10.3844/jcssp.2016

    Article  Google Scholar 

  4. Barro ́n-Ceden ̃o A (2012) On the mono- and cross- language detection of text re-use and plagiarism. In: Ph.D thesis, Vale`ncia, Spain., ncia, Spain

    Google Scholar 

  5. Barro ́n-Ceden ̃o AR (2010) Plagiarism detection across distant language pairs. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10), (pp 37–45). Association for Computational Linguistics

    Google Scholar 

  6. Clough P, Stevenson M (2011) Developing a corpus of plagiarised short answers. Lang Resour Eval 45(1):5–24

    Article  Google Scholar 

  7. Eisele A (2010) MultiUN: a multilingual corpus from united nation documents. In: Proceeding of the 7th International Conference on Language Resources and Evaluation (LREC) (pp 2868–2872). Paris: European Language Resources Association (EL)

    Google Scholar 

  8. Ferrero J, Agne`s F, Besacier L, Schwab D (2016) A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In: 10th edition of the Language Resources and Evaluation Conference (pp. Portorož, Slovenia). Portorož, Slovenia

    Google Scholar 

  9. Gupta PBC (2012) Cross-language high similarity search using a conceptual thesaurus. In: Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics, pp 67–75

    Google Scholar 

  10. Guzman FS (2013) The AMARA corpus: building resources for translating the web’s educational content. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2013. Heidelberg: IWSLT

    Google Scholar 

  11. Hassan S (2016) Design and implementing of multilingual Hadith corpus. Int J Recent Res Soc Sci HumIties 2(3):100–104

    MathSciNet  Google Scholar 

  12. Kent CK (2010) Web based cross language plagiarism detection. In: Second International Conference on Computational Intelligence, Modelling and Simulation (CIMSiM) (pp 199–204). IEEE

    Google Scholar 

  13. Muhr MK (2010) External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system (vol. Lab Report for PAN at CLEF 2010.). (e. a. Martin Braschler, Ed.) CLEF Notebook

    Google Scholar 

  14. Pataki M (2012) A new approach for searching translated plagiarism. In: Proceedings of the 5th International Plagiarism Conference. Newcastle, UK

    Google Scholar 

  15. Pinto DC-C (2009) A statistical approach to crosslingual natural language tasks. J Algorithms 64:51–60

    Article  Google Scholar 

  16. Potthast M, Barro ́n-Ceden ̃o A, Stein BB (2011) Cross-language plagiarism detection. Lang Ressour Eval 45:45–62

    Article  Google Scholar 

  17. Salhi H (2013) Investigating the complementary polysemy and the Arabic translations of the noun destruction in EAPCOUNT. Meta: Translators’ J 1(58):227–246

    Google Scholar 

Download references

Acknowledgements

This research was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University through the Fast-track Research Funding Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hanan Aljuaid .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aljuaid, H. (2020). Arabic-English Corpus for Cross-Language Textual Similarity Detection. In: Kim, K., Kim, HY. (eds) Information Science and Applications. Lecture Notes in Electrical Engineering, vol 621. Springer, Singapore. https://doi.org/10.1007/978-981-15-1465-4_52

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-1465-4_52

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-1464-7

  • Online ISBN: 978-981-15-1465-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics