SparrowHawk: Memory Safety Flaw Detection via Data-Driven Source Code Annotation

Lyu, Yunlong; Gao, Wang; Ma, Siqi; Sun, Qibin; Li, Juanru

doi:10.1007/978-3-030-88323-2_7

Yunlong Lyu¹⁰,
Wang Gao¹¹,
Siqi Ma¹²,
Qibin Sun¹⁰ &
…
Juanru Li¹¹

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 13007))

Included in the following conference series:

International Conference on Information Security and Cryptology

793 Accesses

Abstract

Detecting code flaws in programs is a vital aspect of software maintenance and security. Classic code flaw detection techniques rely on program analysis to check whether the code logic violates certain pre-define rules. In many cases, however, program analysis falls short of understanding the semantics of a function (e.g., the functionality of an API), and thus is difficult to judge whether the function and its related behaviors would lead to a security bug. In response, we propose an automated data-driven annotation strategy to enhance the understanding of the semantics of functions during flaw detection. Our designed SparrowHawk source code analysis system utilizes a programming language aware text similarity comparison to efficiently annotate the attributes of functions. With the annotation results, SparrowHawk makes use of the Clang static analyzer to guide security analyses.

To evaluate the performance of SparrowHawk, we tested SparrowHawk for memory corruption detection, which relies on the annotation of customized memory allocation/release functions. The experiment results show that by introducing function annotation to the original source code analysis, SparrowHawk achieves more effective and efficient flaw detection, and successfully discovers 51 new memory corruption vulnerabilities in popular open source projects such as FFmpeg and kernel of OpenHarmony IoT operating system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.varonis.com/blog/cybersecurity-statistics/.

References

Clang Static Analyzer. http://clang-analyzer.llvm.org
Abadi, M., et al.: Tensorflow: A system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, pp. 265–283. USENIX Association (2016)
Google Scholar
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Proceedings of the 6th International Conference on Neural Information Processing Systems, pp. 737–744. Morgan Kaufmann Publishers Inc (1993)
Google Scholar
brown, F., Deian, S., Dawson, E.: Sys: A static/symbolic tool for finding good bugs in good (browser) code. In: 29th USENIX Security Symposium (USENIX Security 20), pp. 199–216. USENIX Association (2020)
Google Scholar
Busybox. https://github.com/mirror/busybox
Clang. https://clang.llvm.org/
Cpython. https://github.com/python/cpython
Curl. https://github.com/curl/curl
Dam, H.K., Tran, T., Pham, T., Ng, S.W., Grundy, J., Ghose, A.: Automatic feature learning for vulnerability prediction. arXiv:1708.02368 (2017)
Duan, X., et al.: Vulsniper: Focus your attention to shoot fine-grained vulnerabilities. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4665–4671. International Joint Conferences on Artificial Intelligence Organization (2019)
Google Scholar
Ffmpeg. https://github.com/FFmpeg/FFmpeg
Gens, D., Schmitt, S., Davi, L., Sadeghi, A.R.: K-miner: Uncovering memory corruption in linux. (2018)
Google Scholar
Gensim. https://radimrehurek.com/gensim/
Git. https://github.com/git/git
Gnutls. https://gitlab.com/gnutls/gnutls/
Google web trillion word corpus. http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Graphicsmagick. http://www.graphicsmagick.org/
Gravity. https://github.com/marcobambini/gravity
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Imagemagick. https://github.com/ImageMagick/ImageMagick
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv:1511.05493 (2017)
Li, Y., Liu, B.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007)
Article Google Scholar
Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., Jin, H.: A comparative study of deep learning-based vulnerability detection system. IEEE Access 7, 103184–103197 (2019)
Article Google Scholar
Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., Hu, J.: Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of the 32nd Annual Conference on Computer Security Applications, pp. 201–213 (2016)
Google Scholar
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: Sysevr: A framework for using deep learning to detect software vulnerabilities. arXiv:1807.06756 (2018)
Li, Z., et al.: Vuldeepecker: A deep learning-based system for vulnerability detection (2018)
Google Scholar
Libtiff. http://www.libtiff.org/
Ma, S., Thung, F., Lo, D., Sun, C., Deng, R.H.: Vurle: automatic vulnerability detection and repair by learning from examples. In: European Symposium on Research in Computer Security. pp. 229–246. Springer (2017). https://doi.org/10.1007/978-3-319-66399-9_13
Machiry, A., Spensky, C., Corina, J., Stephens, N., Kruegel, C., Vigna, G.: DR. CHECKER: A soundy analysis for linux kernel drivers. In: 26th USENIX Security Symposium (USENIX Security 17), pp. 1007–1024. USENIX Association (2017)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013)
Google Scholar
Mean squared error. https://en.wikipedia.org/wiki/Mean_squared_error
Openharmony. https://openharmony.gitee.com/openharmony
Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: Simple and effective subword regularization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1882–1892. Association for Computational Linguistics (2020)
Google Scholar
Ramos, D.A., Engler, D.: Under-constrained symbolic execution: Correctness checking for real code. In: 24th USENIX Security Symposium (USENIX Security 15), pp. 49–64. USENIX Association (2015)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics (2019)
Google Scholar
Russell, R., et al.: Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762. IEEE (2018)
Google Scholar
Schwartz, E.J., Cohen, C.F., Duggan, M., Gennari, J., Havrilla, J.S., Hines, C.: Using logic programming to recover C++ classes and methods from compiled executables. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS) (2018)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Association for Computational Linguistics (2016)
Google Scholar
Shen, Z., Chen, S.: A survey of automatic software vulnerability detection, program repair, and defect prediction techniques. Security and Communication Networks 2020 (2020)
Google Scholar
Stackexchange archive site. https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z
Stackoverflow forum. https://stackoverflow.com/
Sui, Y., Xue, J.: Svf: Interprocedural static value-flow analysis in LLVM. In: Proceedings of the 25th International Conference on Compiler Construction, pp. 265–266. Association for Computing Machinery (2016)
Google Scholar
Tokenizers. https://github.com/huggingface/tokenizers
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30, pp. 5998–6008 (2017)
Google Scholar
Vim. https://github.com/vim/vim
Wang, J., et al.: Nlp-eye: Detecting memory corruptions via semantic-aware memory operation function identification. In: 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), pp. 309–321. USENIX Association (2019)
Google Scholar
Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on Security and Privacy, pp. 590–604. IEEE (2014)
Google Scholar
Yan, H., Sui, Y., Chen, S., Xue, J.: Spatio-temporal context reduction: a pointer-analysis-based static approach for detecting use-after-free vulnerabilities. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 327–337. IEEE (2018)
Google Scholar
Zhai, Y., yzhai: Ubitect: a precise and scalable method to detect use-before-initialization bugs in linux kernel. In: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). ACM (2020)
Google Scholar
Zhang, Y., Ma, S., Li, J., Li, K., Nepal, S., Gu, D.: Smartshield: automatic smart contract protection made easy. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 23–34. IEEE (2020)
Google Scholar
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv. Neural Inf. Process. Syst. 32, 10197–10207 (2019)
Google Scholar

Download references

Acknowledgment

We would like to thank the anonymous reviewers for their helpful comments. This work was partially supported by the National Natural Science Foundation of China (U19B2023), the National Key Research and Development Program of China (Grant No.2020AAA0107800), and the National Natural Science Foundation of China (Grant No.62002222).

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Yunlong Lyu & Qibin Sun
Shanghai Jiao Tong University, Shanghai, China
Wang Gao & Juanru Li
The University of Queensland, Brisbane, Australia
Siqi Ma

Authors

Yunlong Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Wang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Siqi Ma
View author publications
You can also search for this author in PubMed Google Scholar
Qibin Sun
View author publications
You can also search for this author in PubMed Google Scholar
Juanru Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yunlong Lyu or Juanru Li .

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Yu Yu
Columbia University, New York, NY, USA
Moti Yung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lyu, Y., Gao, W., Ma, S., Sun, Q., Li, J. (2021). SparrowHawk: Memory Safety Flaw Detection via Data-Driven Source Code Annotation. In: Yu, Y., Yung, M. (eds) Information Security and Cryptology. Inscrypt 2021. Lecture Notes in Computer Science(), vol 13007. Springer, Cham. https://doi.org/10.1007/978-3-030-88323-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-88323-2_7
Published: 18 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88322-5
Online ISBN: 978-3-030-88323-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics