Fast Plagiarism Detection in Large-Scale Data

Szmit, Radosław

doi:10.1007/978-3-319-58274-0_27

Radosław Szmit¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 716))

Included in the following conference series:

International Conference: Beyond Databases, Architectures and Structures

1296 Accesses

Abstract

This paper presents some research results involved in building Polish semantic Internet search engine called the Natively Enhanced Knowledge Sharing Technologies (NEKST) and its plagiarism detection module. The main goal is to describe tools and algorithms of the engine and its usage within the Open System for Antiplagiarism (OSA).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Alzahrani, S.M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(2), 133–149 (2012)
Article Google Scholar
Barrón-Cedeño, A., Rosso, P.: On automatic plagiarism detection based on n-grams comparison. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 696–700. Springer, Heidelberg (2009). doi:10.1007/978-3-642-00958-7_69
Chapter Google Scholar
Becker, M., Drożdżyński, W., Krieger, H.U., Piskorski, J., Schäfer, U., Xu, F.: Sprout - shallow processing with typed feature structures and unification. In: Proceedings of ICON 2002 - International Conference on NLP, Mumbai, India (2002)
Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Botelho, F.C., Ziviani, N.: External perfect hashing for very large key sets. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 653–662. ACM (2007)
Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: ACM SIGMOD Record, vol. 24, pp. 398–409. ACM (1995)
Google Scholar
Brodal, G.S., Kaligosi, K., Katriel, I., Kutz, M.: Faster algorithms for computing longest common increasing subsequences. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 330–341. Springer, Heidelberg (2006). doi:10.1007/11780441_30
Chapter Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997, Proceedings, pp. 21–29. IEEE (1997)
Google Scholar
Cárdenas, A.F.: Analysis and performance of inverted data base structures. Commun. ACM 18(5), 253–263 (1975)
Article MATH Google Scholar
Carter, J.L., Wegman, M.N.: Universal classes of hash functions. In: Proceedings of the Ninth Annual ACM Symposium on Theory of Computing, pp. 106–112. ACM (1977)
Google Scholar
Ceglarek, D., Haniewicz, K.: Fast plagiarism detection by sentence hashing. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012. LNCS (LNAI), vol. 7268, pp. 30–37. Springer, Heidelberg (2012). doi:10.1007/978-3-642-29350-4_4
Chapter Google Scholar
Ceglarek, D., Haniewicz, K., Rutkowski, W.: Robust plagiary detection using semantic compression augmented SHAPD. In: Nguyen, N.-T., Hoang, K., Jȩdrzejowicz, P. (eds.) ICCCI 2012. LNCS (LNAI), vol. 7653, pp. 308–317. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34630-9_32
Chapter Google Scholar
Cichelli, R.J.: Minimal perfect hash functions made simple. Commun. ACM 23(1), 17–19 (1980)
Article Google Scholar
Czerski, D., Ciesielski, K., Dramiński, M., Kłopotek, M.A., Wierzchoń, S.T.: Inverted lists compression using contextual information. In: Pejaś, J., Saeed, K. (eds.) Advances in Information Processing and Protection, pp. 55–66. Springer, Boston (2007)
Chapter Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial hash functions are reliable. In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 235–246. Springer, Heidelberg (1992). doi:10.1007/3-540-55719-9_77
Chapter Google Scholar
Foundation, A.S.: Apache Hadoop (2015). http://hadoop.apache.org/
Fox, E.A., Heath, L.S., Chen, Q.F., Daoud, A.M.: Practical minimal perfect hash functions for large databases. Commun. ACM 35(1), 105–121 (1992)
Article Google Scholar
Gauss, C.F.: Disquisitiones Arithmeticae, vol. 157. Yale University Press, New Haven (1966)
MATH Google Scholar
Gustafson, N., Pera, M.S., Ng, Y.K.: Nowhere to hide: finding plagiarized documents based on sentence similarity. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 690–696. IEEE Computer Society (2008)
Google Scholar
Hameurlain, A., Morvan, F.: Big Data management in the cloud: evolution or crossroad? In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 23–38. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_2
Chapter Google Scholar
Hirschberg, D.S.: Algorithms for the longest common subsequence problem. J. ACM (JACM) 24(4), 664–675 (1977)
Article MathSciNet MATH Google Scholar
Hunt, J.W., Szymanski, T.G.: A fast algorithm for computing longest common subsequences. Commun. ACM 20(5), 350–353 (1977)
Article MathSciNet MATH Google Scholar
Iliopoulos, C.S., Rahman, M.S.: A new efficient algorithm for computing the longest common subsequence. Theor. Comput. Syst. 45(2), 355–371 (2009)
Article MathSciNet MATH Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Google Scholar
Irving, R.W.: Plagiarism and collusion detection using the smith-waterman algorithm. DCS Technical report, University of Glasgow, pp. 1–24 (2004)
Google Scholar
Kahn, R.E.: Deposit, registration and recordation in an electronic copyright management system. Technical report, Corporation for National Research Initiatives, Reston, Virginia (1992)
Google Scholar
Kang, N.O., Gelbukh, A., Han, S.Y.: PPChecker: plagiarism pattern checker in document copy detection. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 661–667. Springer, Heidelberg (2006). doi:10.1007/11846406_83
Chapter Google Scholar
Kang, N.O., Han, S.Y.: Document copy detection system based on plagiarism patterns. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 571–574. Springer, Heidelberg (2006). doi:10.1007/11671299_60
Chapter Google Scholar
Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, vol. 3. Pearson Education (1998)
Google Scholar
Kowalski, M., Szczepański, M.: Identity of academic theses. In: Dobrzyńska, T., Kuncheva, R. (eds.) Resemblance and Difference. The problem of identity, pp. 259–278. Instytut Badań Literackich Polskiej Akademii Nauk (2015)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady 10, 707–710 (1966)
MathSciNet MATH Google Scholar
Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)
Google Scholar
Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20(1), 18–31 (1980)
Article MathSciNet MATH Google Scholar
Maziarz, M., Piasecki, M., Szpakowicz, S.: Approaching plWordNet 2.0. In: Proceedings of the 6th Global WordNet Conference, Matsue, Japan, pp. 50–62 (2012)
Google Scholar
Miłkowski, M., Lipski, J.: Using SRX standard for sentence segmentation. In: Vetulani, Z. (ed.) LTC 2009. LNCS (LNAI), vol. 6562, pp. 172–182. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20095-3_16
Chapter Google Scholar
Oechslin, P.: Making a faster cryptanalytic time-memory trade-off. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp. 617–630. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45146-4_36
Chapter Google Scholar
Osman, A.H., Salim, N., Kumar, Y.J., Abuobieda, A.: Fuzzy semantic plagiarism detection. In: Hassanien, A.E., Salem, A.-B.M., Ramadan, R., Kim, T. (eds.) AMLTA 2012. CCIS, vol. 322, pp. 543–553. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35326-0_54
Chapter Google Scholar
Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51, 122–144 (2004)
Article MathSciNet MATH Google Scholar
Paik, W., Liddy, E.D., Liddy, J.H., Niles, I.H., Allen, E.E.: Information extraction system and method using concept-relation-concept (CRC) triples, 17 July 2001. US Patent 6,263,335
Google Scholar
Percova, N.N.: On the types of semantic compression of text. In: COLING, pp. 229–231 (1982)
Google Scholar
Piskorski, J.: Rule-based named-entity recognition for polish. In: Proceedings of the Workshop on Named-Entity Recognition for NLP Applications held in Conjunction with the 1st International Joint Conference on NLP (2004)
Google Scholar
Polaski, E.: Wielki sownik ortograficzny PWN. Wydawnictwo Naukowe PWN (2003)
Google Scholar
Przepirkowski, A., Bako, M., Grski, R., Lewandowska-Tomaszczyk, B.: Narodowy korpus jezyka polskiego. PWN, Warszawa (2012)
Google Scholar
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, vol. 77. Cambridge University Press, Cambridge (2012)
Google Scholar
Saloni, Z., Gruszczyński, W., Woliński, M., Wołosz, R., Skowrońska, D.: Website of the morphological analyser morfeusz (2011). http://sgjp.pl/morfeusz/index.html.en
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Article Google Scholar
Sobieski, Ś., Kowalski, M.A., Kruszyński, P., Sysak, M., Zieliński, B., Maślanka, P.: OSA architecture. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 571–584. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_44
Chapter Google Scholar
Szczepański, M.: Algorytmy klasyfikacji tekstów i ich wykorzystanie w systemie wykrywania plagiatów. Oficyna Wydawnicza Politechniki Warszawskiej (2014)
Google Scholar
Szmit, R.: Locality sensitive hashing for similarity search using MapReduce on large scale data. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38634-3_19
Chapter Google Scholar
Wang, T.: Integer hash function. (2007). http://www.concentric.net/~ttwang/tech/inthash.htm
White, D.R., Joy, M.S.: Sentence-based natural language plagiarism detection. J. Educ. Res. Comput. (JERIC) 4(4), 2 (2004)
Google Scholar

Download references

Acknowledgement

The author acknowledges his contribution in the NEKST project (http://nekst.ipipan.waw.pl) founded by the Innovative Economy Operational Programme (POIG.01.01.02-14-013/09) and in the OSA project founded by the Interuniversity Centre for IT (MUCI – Miedzyuniwersyteckie Centrum Informatyzacji, http://muci.edu.pl).

Author information

Authors and Affiliations

Institute of Control and Industrial Electronics, Warsaw University of Technology, ul. Koszykowa 75, 00-662, Warsaw, Poland
Radosław Szmit

Authors

Radosław Szmit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radosław Szmit .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Szmit, R. (2017). Fast Plagiarism Detection in Large-Scale Data. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-58274-0_27
Published: 27 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58273-3
Online ISBN: 978-3-319-58274-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics