Knowledge Extraction from Unstructured Data on the Web

Zhang, Wei Emma; Sheng, Quan Z.

doi:10.1007/978-3-319-94935-2_5

Wei Emma Zhang³ &
Quan Z. Sheng³

531 Accesses

Abstract

In this chapter, we have developed a method EmbTE, for source code topic extraction, based on word embedding techniques. We also adopted LDA and NMF to extract topics from source code. The empirical comparisons show that EmbTE outperforms LDA and NMF in terms of providing more coherent topics. EmbTE with CBOW model performs better than Skip-gram model. We also identified the most contributory terms from source code via our proposed term selection algorithm. We found that the method name, method comments, class names and class comments are the most contributory term types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
Article Google Scholar
Wei Emma Zhang, Quan Z. Sheng, Ermyas Abebe, Muhammad Ali Babar, and Andi Zhou. Mining Source Code Topics Through Topic Model and Words Embedding. In Proc. of 12th International Conference on Advanced Data Mining and Applications (ADMA 2016), pages 664–676, Gold Coast, QLD, Australia, December 2016.
Chapter Google Scholar
Stefan Haefliger, Georg Von Krogh, and Sebastian Spaeth. Code Reuse in Open Source Software. Management Science, 54(1):180–193, 2008.
Article Google Scholar
Girish Maskeri Rama, Santonu Sarkar, and Kenneth Heafield. Mining Business Topics in Source Code using Latent Dirichlet Allocation. In Proc. of the 1st Annual India Software Engineering Conference (ISEC 2008), pages 113–120, Hyderabad, India, February 2008.
Google Scholar
Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. On the Use of Automated Text Summarization Techniques for Summarizing Source Code. In Proc. of the 17th Working Conference on Reverse Engineering (WCRE 2010), pages 35–44, Beverly, MA, USA, October 2010.
Google Scholar
Sonia Haiduc, Jairo Aponte, and Andrian Marcus. Supporting Program Comprehension with Source Code Summarization. In Proc. of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pages 223–226, Cape Town, South Africa, May 2010.
Google Scholar
Giriprasad Sridhara, Lori L. Pollock, and K. Vijay-Shanker. Automatically Detecting and Describing High Level Actions within Methods. In Proc. of the 33rd International Conference on Software Engineering (ICSE 2011), pages 101–110, Waikiki, Honolulu, HI, USA, May 2011.
Google Scholar
Giriprasad Sridhara, Lori L. Pollock, and K. Vijay-Shanker. Generating Parameter Comments and Integrating with Method Summaries. In Proc. of the 19th IEEE International Conference on Program Comprehension (ICPC 2011), pages 71–80, Kingston, ON, Canada, June 2011.
Google Scholar
Paige Rodeghero, Collin McMillan, Paul W. McBurney, Nigel Bosch, and Sidney K. D’Mello. Improving Automated Source Code Summarization via An Eye-tracking Study of Programmers. In Proc. of the 36th International Conference on Software Engineering (ICSE 2014), pages 390–401, Hyderabad, India, June 2014.
Google Scholar
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781, 2013.
Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proc. of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), pages 3111–3119, Lake Tahoe, Nevada, United States, December 2013.
Google Scholar
Jon Louis Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Communications of the ACM, 18(9):509–517, 1975.
Article Google Scholar
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
MATH Google Scholar
Michael Röder, Andreas Both, and Alexander Hinneburg. Exploring the Space of Topic Coherence Measures. In Proc. of the Eighth ACM International Conference on Web Search and Data Mining (WSDM 2015), pages 399–408, Shanghai, China, February 2015.
Google Scholar
Isabelle Guyon and André Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3:1157–1182, 2003.
MATH Google Scholar
Miltiadis Allamanis and Charles A. Sutton. Mining Source Code Repositories at Massive Scale using Language Modeling. In Proc. of the 10th Working Conference on Mining Software Repositories (MSR 2013), pages 207–216, San Francisco, CA, USA, May 2013.
Google Scholar
Liqiang Niu, Xinyu Dai, Jianbing Zhang, and Jiajun Chen. Topic2Vec: Learning Distributed Representations of Topics. In Proc. of the International Conference on Asian Language Processing 2015 (IALP 2015), pages 193–196, Suzhou, China, October 2015.
Google Scholar
Ronald Fagin, Ravi Kumar, and D. Sivakumar. Comparing Top K Lists. In Proc. of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pages 28–36, Baltimore, Maryland, USA, January 2003.
Article MathSciNet Google Scholar
Hazeline U. Asuncion, Arthur U. Asuncion, and Richard N. Taylor. Software Traceability with Topic Modeling. In Proc. of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pages 95–104, Cape Town, South Africa, May 2010.
Google Scholar
Stacy K. Lukins, Nicholas A. Kraft, and Letha H. Etzkorn. Bug Localization using Latent Dirichlet Allocation. Information and Software Technology, 52(9):972–990, 2010.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing, Macquarie University, Sydney, NSW, Australia
Wei Emma Zhang & Quan Z. Sheng

Authors

Wei Emma Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Quan Z. Sheng
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhang, W.E., Sheng, Q.Z. (2018). Knowledge Extraction from Unstructured Data on the Web. In: Managing Data From Knowledge Bases: Querying and Extraction. Springer, Cham. https://doi.org/10.1007/978-3-319-94935-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-94935-2_5
Published: 01 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94934-5
Online ISBN: 978-3-319-94935-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics