Skip to main content

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 716))

  • 1296 Accesses

Abstract

This paper presents some research results involved in building Polish semantic Internet search engine called the Natively Enhanced Knowledge Sharing Technologies (NEKST) and its plagiarism detection module. The main goal is to describe tools and algorithms of the engine and its usage within the Open System for Antiplagiarism (OSA).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://ipipan.waw.pl/.

  2. 2.

    http://sgjp.pl/morfeusz/index.html.en.

  3. 3.

    https://www.wikipedia.org/.

  4. 4.

    http://sgjp.pl/morfeusz/index.html.en.

  5. 5.

    http://osaweb.pl/.

  6. 6.

    http://polon.nauka.gov.pl/repozytorium-orpd.

  7. 7.

    http://oozie.apache.org/.

  8. 8.

    http://osaweb.pl/node/22?language=en.

References

  1. Alzahrani, S.M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(2), 133–149 (2012)

    Article  Google Scholar 

  2. Barrón-Cedeño, A., Rosso, P.: On automatic plagiarism detection based on n-grams comparison. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 696–700. Springer, Heidelberg (2009). doi:10.1007/978-3-642-00958-7_69

    Chapter  Google Scholar 

  3. Becker, M., Drożdżyński, W., Krieger, H.U., Piskorski, J., Schäfer, U., Xu, F.: Sprout - shallow processing with typed feature structures and unification. In: Proceedings of ICON 2002 - International Conference on NLP, Mumbai, India (2002)

    Google Scholar 

  4. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  5. Botelho, F.C., Ziviani, N.: External perfect hashing for very large key sets. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 653–662. ACM (2007)

    Google Scholar 

  6. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: ACM SIGMOD Record, vol. 24, pp. 398–409. ACM (1995)

    Google Scholar 

  7. Brodal, G.S., Kaligosi, K., Katriel, I., Kutz, M.: Faster algorithms for computing longest common increasing subsequences. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 330–341. Springer, Heidelberg (2006). doi:10.1007/11780441_30

    Chapter  Google Scholar 

  8. Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997, Proceedings, pp. 21–29. IEEE (1997)

    Google Scholar 

  9. Cárdenas, A.F.: Analysis and performance of inverted data base structures. Commun. ACM 18(5), 253–263 (1975)

    Article  MATH  Google Scholar 

  10. Carter, J.L., Wegman, M.N.: Universal classes of hash functions. In: Proceedings of the Ninth Annual ACM Symposium on Theory of Computing, pp. 106–112. ACM (1977)

    Google Scholar 

  11. Ceglarek, D., Haniewicz, K.: Fast plagiarism detection by sentence hashing. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012. LNCS (LNAI), vol. 7268, pp. 30–37. Springer, Heidelberg (2012). doi:10.1007/978-3-642-29350-4_4

    Chapter  Google Scholar 

  12. Ceglarek, D., Haniewicz, K., Rutkowski, W.: Robust plagiary detection using semantic compression augmented SHAPD. In: Nguyen, N.-T., Hoang, K., Jȩdrzejowicz, P. (eds.) ICCCI 2012. LNCS (LNAI), vol. 7653, pp. 308–317. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34630-9_32

    Chapter  Google Scholar 

  13. Cichelli, R.J.: Minimal perfect hash functions made simple. Commun. ACM 23(1), 17–19 (1980)

    Article  Google Scholar 

  14. Czerski, D., Ciesielski, K., Dramiński, M., Kłopotek, M.A., Wierzchoń, S.T.: Inverted lists compression using contextual information. In: Pejaś, J., Saeed, K. (eds.) Advances in Information Processing and Protection, pp. 55–66. Springer, Boston (2007)

    Chapter  Google Scholar 

  15. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  16. Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial hash functions are reliable. In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 235–246. Springer, Heidelberg (1992). doi:10.1007/3-540-55719-9_77

    Chapter  Google Scholar 

  17. Foundation, A.S.: Apache Hadoop (2015). http://hadoop.apache.org/

  18. Fox, E.A., Heath, L.S., Chen, Q.F., Daoud, A.M.: Practical minimal perfect hash functions for large databases. Commun. ACM 35(1), 105–121 (1992)

    Article  Google Scholar 

  19. Gauss, C.F.: Disquisitiones Arithmeticae, vol. 157. Yale University Press, New Haven (1966)

    MATH  Google Scholar 

  20. Gustafson, N., Pera, M.S., Ng, Y.K.: Nowhere to hide: finding plagiarized documents based on sentence similarity. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 690–696. IEEE Computer Society (2008)

    Google Scholar 

  21. Hameurlain, A., Morvan, F.: Big Data management in the cloud: evolution or crossroad? In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 23–38. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_2

    Chapter  Google Scholar 

  22. Hirschberg, D.S.: Algorithms for the longest common subsequence problem. J. ACM (JACM) 24(4), 664–675 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  23. Hunt, J.W., Szymanski, T.G.: A fast algorithm for computing longest common subsequences. Commun. ACM 20(5), 350–353 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  24. Iliopoulos, C.S., Rahman, M.S.: A new efficient algorithm for computing the longest common subsequence. Theor. Comput. Syst. 45(2), 355–371 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  25. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)

    Google Scholar 

  26. Irving, R.W.: Plagiarism and collusion detection using the smith-waterman algorithm. DCS Technical report, University of Glasgow, pp. 1–24 (2004)

    Google Scholar 

  27. Kahn, R.E.: Deposit, registration and recordation in an electronic copyright management system. Technical report, Corporation for National Research Initiatives, Reston, Virginia (1992)

    Google Scholar 

  28. Kang, N.O., Gelbukh, A., Han, S.Y.: PPChecker: plagiarism pattern checker in document copy detection. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 661–667. Springer, Heidelberg (2006). doi:10.1007/11846406_83

    Chapter  Google Scholar 

  29. Kang, N.O., Han, S.Y.: Document copy detection system based on plagiarism patterns. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 571–574. Springer, Heidelberg (2006). doi:10.1007/11671299_60

    Chapter  Google Scholar 

  30. Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, vol. 3. Pearson Education (1998)

    Google Scholar 

  31. Kowalski, M., Szczepański, M.: Identity of academic theses. In: Dobrzyńska, T., Kuncheva, R. (eds.) Resemblance and Difference. The problem of identity, pp. 259–278. Instytut Badań Literackich Polskiej Akademii Nauk (2015)

    Google Scholar 

  32. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady 10, 707–710 (1966)

    MathSciNet  MATH  Google Scholar 

  33. Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)

    Google Scholar 

  34. Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20(1), 18–31 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  35. Maziarz, M., Piasecki, M., Szpakowicz, S.: Approaching plWordNet 2.0. In: Proceedings of the 6th Global WordNet Conference, Matsue, Japan, pp. 50–62 (2012)

    Google Scholar 

  36. Miłkowski, M., Lipski, J.: Using SRX standard for sentence segmentation. In: Vetulani, Z. (ed.) LTC 2009. LNCS (LNAI), vol. 6562, pp. 172–182. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20095-3_16

    Chapter  Google Scholar 

  37. Oechslin, P.: Making a faster cryptanalytic time-memory trade-off. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp. 617–630. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45146-4_36

    Chapter  Google Scholar 

  38. Osman, A.H., Salim, N., Kumar, Y.J., Abuobieda, A.: Fuzzy semantic plagiarism detection. In: Hassanien, A.E., Salem, A.-B.M., Ramadan, R., Kim, T. (eds.) AMLTA 2012. CCIS, vol. 322, pp. 543–553. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35326-0_54

    Chapter  Google Scholar 

  39. Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51, 122–144 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  40. Paik, W., Liddy, E.D., Liddy, J.H., Niles, I.H., Allen, E.E.: Information extraction system and method using concept-relation-concept (CRC) triples, 17 July 2001. US Patent 6,263,335

    Google Scholar 

  41. Percova, N.N.: On the types of semantic compression of text. In: COLING, pp. 229–231 (1982)

    Google Scholar 

  42. Piskorski, J.: Rule-based named-entity recognition for polish. In: Proceedings of the Workshop on Named-Entity Recognition for NLP Applications held in Conjunction with the 1st International Joint Conference on NLP (2004)

    Google Scholar 

  43. Polaski, E.: Wielki sownik ortograficzny PWN. Wydawnictwo Naukowe PWN (2003)

    Google Scholar 

  44. Przepirkowski, A., Bako, M., Grski, R., Lewandowska-Tomaszczyk, B.: Narodowy korpus jezyka polskiego. PWN, Warszawa (2012)

    Google Scholar 

  45. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, vol. 77. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  46. Saloni, Z., Gruszczyński, W., Woliński, M., Wołosz, R., Skowrońska, D.: Website of the morphological analyser morfeusz (2011). http://sgjp.pl/morfeusz/index.html.en

  47. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)

    Article  Google Scholar 

  48. Sobieski, Ś., Kowalski, M.A., Kruszyński, P., Sysak, M., Zieliński, B., Maślanka, P.: OSA architecture. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 571–584. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_44

    Chapter  Google Scholar 

  49. Szczepański, M.: Algorytmy klasyfikacji tekstów i ich wykorzystanie w systemie wykrywania plagiatów. Oficyna Wydawnicza Politechniki Warszawskiej (2014)

    Google Scholar 

  50. Szmit, R.: Locality sensitive hashing for similarity search using MapReduce on large scale data. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38634-3_19

    Chapter  Google Scholar 

  51. Wang, T.: Integer hash function. (2007). http://www.concentric.net/~ttwang/tech/inthash.htm

  52. White, D.R., Joy, M.S.: Sentence-based natural language plagiarism detection. J. Educ. Res. Comput. (JERIC) 4(4), 2 (2004)

    Google Scholar 

Download references

Acknowledgement

The author acknowledges his contribution in the NEKST project (http://nekst.ipipan.waw.pl) founded by the Innovative Economy Operational Programme (POIG.01.01.02-14-013/09) and in the OSA project founded by the Interuniversity Centre for IT (MUCI – Miedzyuniwersyteckie Centrum Informatyzacji, http://muci.edu.pl).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Radosław Szmit .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Szmit, R. (2017). Fast Plagiarism Detection in Large-Scale Data. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-58274-0_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-58273-3

  • Online ISBN: 978-3-319-58274-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics