Abstract
Detecting code flaws in programs is a vital aspect of software maintenance and security. Classic code flaw detection techniques rely on program analysis to check whether the code logic violates certain pre-define rules. In many cases, however, program analysis falls short of understanding the semantics of a function (e.g., the functionality of an API), and thus is difficult to judge whether the function and its related behaviors would lead to a security bug. In response, we propose an automated data-driven annotation strategy to enhance the understanding of the semantics of functions during flaw detection. Our designed SparrowHawk source code analysis system utilizes a programming language aware text similarity comparison to efficiently annotate the attributes of functions. With the annotation results, SparrowHawk makes use of the Clang static analyzer to guide security analyses.
To evaluate the performance of SparrowHawk, we tested SparrowHawk for memory corruption detection, which relies on the annotation of customized memory allocation/release functions. The experiment results show that by introducing function annotation to the original source code analysis, SparrowHawk achieves more effective and efficient flaw detection, and successfully discovers 51 new memory corruption vulnerabilities in popular open source projects such as FFmpeg and kernel of OpenHarmony IoT operating system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Clang Static Analyzer. http://clang-analyzer.llvm.org
Abadi, M., et al.: Tensorflow: A system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, pp. 265–283. USENIX Association (2016)
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Proceedings of the 6th International Conference on Neural Information Processing Systems, pp. 737–744. Morgan Kaufmann Publishers Inc (1993)
brown, F., Deian, S., Dawson, E.: Sys: A static/symbolic tool for finding good bugs in good (browser) code. In: 29th USENIX Security Symposium (USENIX Security 20), pp. 199–216. USENIX Association (2020)
Busybox. https://github.com/mirror/busybox
Clang. https://clang.llvm.org/
Cpython. https://github.com/python/cpython
Dam, H.K., Tran, T., Pham, T., Ng, S.W., Grundy, J., Ghose, A.: Automatic feature learning for vulnerability prediction. arXiv:1708.02368 (2017)
Duan, X., et al.: Vulsniper: Focus your attention to shoot fine-grained vulnerabilities. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4665–4671. International Joint Conferences on Artificial Intelligence Organization (2019)
Ffmpeg. https://github.com/FFmpeg/FFmpeg
Gens, D., Schmitt, S., Davi, L., Sadeghi, A.R.: K-miner: Uncovering memory corruption in linux. (2018)
Gensim. https://radimrehurek.com/gensim/
Google web trillion word corpus. http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Graphicsmagick. http://www.graphicsmagick.org/
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Imagemagick. https://github.com/ImageMagick/ImageMagick
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv:1511.05493 (2017)
Li, Y., Liu, B.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007)
Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., Jin, H.: A comparative study of deep learning-based vulnerability detection system. IEEE Access 7, 103184–103197 (2019)
Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., Hu, J.: Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of the 32nd Annual Conference on Computer Security Applications, pp. 201–213 (2016)
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: Sysevr: A framework for using deep learning to detect software vulnerabilities. arXiv:1807.06756 (2018)
Li, Z., et al.: Vuldeepecker: A deep learning-based system for vulnerability detection (2018)
Libtiff. http://www.libtiff.org/
Ma, S., Thung, F., Lo, D., Sun, C., Deng, R.H.: Vurle: automatic vulnerability detection and repair by learning from examples. In: European Symposium on Research in Computer Security. pp. 229–246. Springer (2017). https://doi.org/10.1007/978-3-319-66399-9_13
Machiry, A., Spensky, C., Corina, J., Stephens, N., Kruegel, C., Vigna, G.: DR. CHECKER: A soundy analysis for linux kernel drivers. In: 26th USENIX Security Symposium (USENIX Security 17), pp. 1007–1024. USENIX Association (2017)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013)
Mean squared error. https://en.wikipedia.org/wiki/Mean_squared_error
Openharmony. https://openharmony.gitee.com/openharmony
Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: Simple and effective subword regularization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1882–1892. Association for Computational Linguistics (2020)
Ramos, D.A., Engler, D.: Under-constrained symbolic execution: Correctness checking for real code. In: 24th USENIX Security Symposium (USENIX Security 15), pp. 49–64. USENIX Association (2015)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics (2019)
Russell, R., et al.: Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762. IEEE (2018)
Schwartz, E.J., Cohen, C.F., Duggan, M., Gennari, J., Havrilla, J.S., Hines, C.: Using logic programming to recover C++ classes and methods from compiled executables. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS) (2018)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Association for Computational Linguistics (2016)
Shen, Z., Chen, S.: A survey of automatic software vulnerability detection, program repair, and defect prediction techniques. Security and Communication Networks 2020 (2020)
Stackexchange archive site. https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z
Stackoverflow forum. https://stackoverflow.com/
Sui, Y., Xue, J.: Svf: Interprocedural static value-flow analysis in LLVM. In: Proceedings of the 25th International Conference on Compiler Construction, pp. 265–266. Association for Computing Machinery (2016)
Tokenizers. https://github.com/huggingface/tokenizers
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30, pp. 5998–6008 (2017)
Wang, J., et al.: Nlp-eye: Detecting memory corruptions via semantic-aware memory operation function identification. In: 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), pp. 309–321. USENIX Association (2019)
Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on Security and Privacy, pp. 590–604. IEEE (2014)
Yan, H., Sui, Y., Chen, S., Xue, J.: Spatio-temporal context reduction: a pointer-analysis-based static approach for detecting use-after-free vulnerabilities. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 327–337. IEEE (2018)
Zhai, Y., yzhai: Ubitect: a precise and scalable method to detect use-before-initialization bugs in linux kernel. In: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). ACM (2020)
Zhang, Y., Ma, S., Li, J., Li, K., Nepal, S., Gu, D.: Smartshield: automatic smart contract protection made easy. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 23–34. IEEE (2020)
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv. Neural Inf. Process. Syst. 32, 10197–10207 (2019)
Acknowledgment
We would like to thank the anonymous reviewers for their helpful comments. This work was partially supported by the National Natural Science Foundation of China (U19B2023), the National Key Research and Development Program of China (Grant No.2020AAA0107800), and the National Natural Science Foundation of China (Grant No.62002222).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Lyu, Y., Gao, W., Ma, S., Sun, Q., Li, J. (2021). SparrowHawk: Memory Safety Flaw Detection via Data-Driven Source Code Annotation. In: Yu, Y., Yung, M. (eds) Information Security and Cryptology. Inscrypt 2021. Lecture Notes in Computer Science(), vol 13007. Springer, Cham. https://doi.org/10.1007/978-3-030-88323-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-88323-2_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88322-5
Online ISBN: 978-3-030-88323-2
eBook Packages: Computer ScienceComputer Science (R0)