Abstract
In this chapter, we have developed a method EmbTE, for source code topic extraction, based on word embedding techniques. We also adopted LDA and NMF to extract topics from source code. The empirical comparisons show that EmbTE outperforms LDA and NMF in terms of providing more coherent topics. EmbTE with CBOW model performs better than Skip-gram model. We also identified the most contributory terms from source code via our proposed term selection algorithm. We found that the method name, method comments, class names and class comments are the most contributory term types.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
Wei Emma Zhang, Quan Z. Sheng, Ermyas Abebe, Muhammad Ali Babar, and Andi Zhou. Mining Source Code Topics Through Topic Model and Words Embedding. In Proc. of 12th International Conference on Advanced Data Mining and Applications (ADMA 2016), pages 664–676, Gold Coast, QLD, Australia, December 2016.
Stefan Haefliger, Georg Von Krogh, and Sebastian Spaeth. Code Reuse in Open Source Software. Management Science, 54(1):180–193, 2008.
Girish Maskeri Rama, Santonu Sarkar, and Kenneth Heafield. Mining Business Topics in Source Code using Latent Dirichlet Allocation. In Proc. of the 1st Annual India Software Engineering Conference (ISEC 2008), pages 113–120, Hyderabad, India, February 2008.
Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. On the Use of Automated Text Summarization Techniques for Summarizing Source Code. In Proc. of the 17th Working Conference on Reverse Engineering (WCRE 2010), pages 35–44, Beverly, MA, USA, October 2010.
Sonia Haiduc, Jairo Aponte, and Andrian Marcus. Supporting Program Comprehension with Source Code Summarization. In Proc. of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pages 223–226, Cape Town, South Africa, May 2010.
Giriprasad Sridhara, Lori L. Pollock, and K. Vijay-Shanker. Automatically Detecting and Describing High Level Actions within Methods. In Proc. of the 33rd International Conference on Software Engineering (ICSE 2011), pages 101–110, Waikiki, Honolulu, HI, USA, May 2011.
Giriprasad Sridhara, Lori L. Pollock, and K. Vijay-Shanker. Generating Parameter Comments and Integrating with Method Summaries. In Proc. of the 19th IEEE International Conference on Program Comprehension (ICPC 2011), pages 71–80, Kingston, ON, Canada, June 2011.
Paige Rodeghero, Collin McMillan, Paul W. McBurney, Nigel Bosch, and Sidney K. D’Mello. Improving Automated Source Code Summarization via An Eye-tracking Study of Programmers. In Proc. of the 36th International Conference on Software Engineering (ICSE 2014), pages 390–401, Hyderabad, India, June 2014.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781, 2013.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proc. of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), pages 3111–3119, Lake Tahoe, Nevada, United States, December 2013.
Jon Louis Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Communications of the ACM, 18(9):509–517, 1975.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
Michael Röder, Andreas Both, and Alexander Hinneburg. Exploring the Space of Topic Coherence Measures. In Proc. of the Eighth ACM International Conference on Web Search and Data Mining (WSDM 2015), pages 399–408, Shanghai, China, February 2015.
Isabelle Guyon and André Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3:1157–1182, 2003.
Miltiadis Allamanis and Charles A. Sutton. Mining Source Code Repositories at Massive Scale using Language Modeling. In Proc. of the 10th Working Conference on Mining Software Repositories (MSR 2013), pages 207–216, San Francisco, CA, USA, May 2013.
Liqiang Niu, Xinyu Dai, Jianbing Zhang, and Jiajun Chen. Topic2Vec: Learning Distributed Representations of Topics. In Proc. of the International Conference on Asian Language Processing 2015 (IALP 2015), pages 193–196, Suzhou, China, October 2015.
Ronald Fagin, Ravi Kumar, and D. Sivakumar. Comparing Top K Lists. In Proc. of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pages 28–36, Baltimore, Maryland, USA, January 2003.
Hazeline U. Asuncion, Arthur U. Asuncion, and Richard N. Taylor. Software Traceability with Topic Modeling. In Proc. of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pages 95–104, Cape Town, South Africa, May 2010.
Stacy K. Lukins, Nicholas A. Kraft, and Letha H. Etzkorn. Bug Localization using Latent Dirichlet Allocation. Information and Software Technology, 52(9):972–990, 2010.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Zhang, W.E., Sheng, Q.Z. (2018). Knowledge Extraction from Unstructured Data on the Web. In: Managing Data From Knowledge Bases: Querying and Extraction. Springer, Cham. https://doi.org/10.1007/978-3-319-94935-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-94935-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94934-5
Online ISBN: 978-3-319-94935-2
eBook Packages: Computer ScienceComputer Science (R0)