Skip to main content

Authorship Identification of Source Codes

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10366))

Abstract

Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kothari, J., Shevertalov, M., Stehle, E., et al.: A probabilistic approach to source code authorship identification. In: 4th International Conference on Information Technology, pp. 243–248 (2007)

    Google Scholar 

  2. Ding, H., Samadzadeh, M.H.: Extraction of java program fingerprints for software authorship identification. J. Syst. Softw. 72(1), 49–57 (2004)

    Article  Google Scholar 

  3. Lange, R., Mancoridis, S.: Using code metric histograms and genetic algorithms to perform author identification for software forensics. In: 9th Annual Conference on Genetic and Evolutionary Computation, pp. 2082–2089 (2007)

    Google Scholar 

  4. Tennyson, M.F.: On improving authorship attribution of source code. In: International Conference on Digital Forensics and Cyber Crime, pp. 58–65 (2012)

    Google Scholar 

  5. Gray, A., Sallis, P., MacDonell, S.: Identified: a dictionary-based system for extracting source code metrics for software forensics. In: International Conference on Software Engineering: Education and Practice, pp. 252–259 (1998)

    Google Scholar 

  6. Zhang, C., Wu, X., Niu, Z., et al.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)

    Article  Google Scholar 

  7. Tennyson, M.F., Mitropoulos, F.J.: A bayesian ensemble classifier for source code authorship attribution. In: International Conference on Similarity Search and Applications, pp. 265–276 (2014)

    Google Scholar 

  8. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  9. Spafford, E.H., Weeber, S.A.: Software forensics: tracking code to its authors. Comput. Secur. 12, 585–595 (1993)

    Article  Google Scholar 

  10. Software Forensics. http://en.wikipedia.org/wiki/Software_forensics

  11. Bandara, U., Wijayarathna, G.: Source code author identification with unsupervised feature learning. Pattern Recogn. Lett. 34(3), 330–334 (2013)

    Article  Google Scholar 

  12. MacDonell, S., Gray, M., MacLennan, G., Sallis, P.: Software forensics for discriminating between program authors using case-based reasoning, feed-forward neural networks and multiple discriminant analysis. In: 6th International Conference on Neural Information Processing, pp. 66–71 (1999)

    Google Scholar 

  13. Burrows, S., Tahaghoghi, S.M.M.: Source code authorship attribution using n-grams. In: 12th Australasian Document Computing Symposium, pp. 32–39 (2007)

    Google Scholar 

  14. Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Comparing techniques for authorship attribution of source code. Softw.: Pract. Exp. 44(1), 1–32 (2014)

    Google Scholar 

  15. Bandara, U., Wijayarathna, G.: Deep neural networks for source code author identification. In: International Conference on Neural Information Processing, pp. 368–375 (2013)

    Google Scholar 

  16. Frantzeskou, G., Stamatatos, E., Gritzalis, S.: Supporting the cybercrime investigation process: effective discrimination of source code authors based on byte-level information. In: 2nd International Conference on E-business and Telecommunication Networks, pp. 163–173 (2005)

    Google Scholar 

  17. Frantzeskou, G., Gritzalis, S., MacDonell, S.G.: Source code authorship analysis for supporting the cybercrime investigation process. In: 1st International Conference on E-business and Telecommunication Networks, pp. 85–92 (2004)

    Google Scholar 

  18. Krsul, I.: Authorship analysis: identifying the author of a program. Technical report TR-94-030, Purdue University (1994)

    Google Scholar 

  19. Burrows, S., Tahaghoghi, S.M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Softw.-Pract. Exp. 37(2), 151–175 (2007)

    Article  Google Scholar 

  20. Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Temporally robust software features for authorship attribution. In: 33rd Annual International Computer Software and Applications Conference, pp. 599–606 (2009)

    Google Scholar 

  21. Burrows, S.: Source code authorship attribution. Ph.D. thesis. RMIT University, Melbourne, Australia (2010)

    Google Scholar 

  22. Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Application of information retrieval techniques for source code authorship attribution. In: 14th International Conference on Database Systems for Advanced Applications, pp. 699–713 (2009)

    Google Scholar 

  23. Frantzeskou, G., Stamatatos, E., Gritzalis, S., et al.: Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. Int. J. Digit. Evid. 6(1), 1–18 (2007)

    Google Scholar 

  24. Krsul, I., Spafford, E.H.: Authorship analysis: identifying the author of a program. Comput. Secur. 16(3), 233–257 (1997)

    Article  Google Scholar 

  25. Elenbogen, B.S., Seliya, N.: Detecting outsourced student programming assignments. J. Comput. Sci. Coll. 23(3), 50–57 (2008)

    Google Scholar 

  26. Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S.: On the use of discretised source code metrics for author identification. In: 1st International Symposium on Search Based Software Engineering, pp. 69–78 (2009)

    Google Scholar 

  27. N-gram. http://en.wikipedia.org/wiki/N-gram

  28. Platt, J.: Sequential minimal optimization: a fast algorithm for training support vector machines (1998). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.560

  29. Sequential minimal optimization. http://en.wikipedia.org/wiki/Sequential_minimal_optimization

  30. Sequential minimal optimization. http://blog.csdn.net/yclzh0522/article/details/6900707

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NO. 61672098, NO. 61272361).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunxia Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Zhang, C., Wang, S., Wu, J., Niu, Z. (2017). Authorship Identification of Source Codes. In: Chen, L., Jensen, C., Shahabi, C., Yang, X., Lian, X. (eds) Web and Big Data. APWeb-WAIM 2017. Lecture Notes in Computer Science(), vol 10366. Springer, Cham. https://doi.org/10.1007/978-3-319-63579-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63579-8_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63578-1

  • Online ISBN: 978-3-319-63579-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics