Skip to main content

Knowledge Extraction from Unstructured Data on the Web

  • Chapter
  • First Online:
Managing Data From Knowledge Bases: Querying and Extraction
  • 531 Accesses

Abstract

In this chapter, we have developed a method EmbTE, for source code topic extraction, based on word embedding techniques. We also adopted LDA and NMF to extract topics from source code. The empirical comparisons show that EmbTE outperforms LDA and NMF in terms of providing more coherent topics. EmbTE with CBOW model performs better than Skip-gram model. We also identified the most contributory terms from source code via our proposed term selection algorithm. We found that the method name, method comments, class names and class comments are the most contributory term types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/.

  2. 2.

    https://github.com/javaparser/javaparser.

  3. 3.

    http://snowball.tartarus.org/algorithms/english/stop.txt.

  4. 4.

    http://nlp.stanford.edu/.

  5. 5.

    https://radimrehurek.com/gensim/.

  6. 6.

    https://code.google.com/archive/p/word2vec/.

  7. 7.

    https://github.com/AKSW/Palmetto.

  8. 8.

    http://scikit-learn.org/.

References

  1. Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.

    Article  Google Scholar 

  2. Wei Emma Zhang, Quan Z. Sheng, Ermyas Abebe, Muhammad Ali Babar, and Andi Zhou. Mining Source Code Topics Through Topic Model and Words Embedding. In Proc. of 12th International Conference on Advanced Data Mining and Applications (ADMA 2016), pages 664–676, Gold Coast, QLD, Australia, December 2016.

    Chapter  Google Scholar 

  3. Stefan Haefliger, Georg Von Krogh, and Sebastian Spaeth. Code Reuse in Open Source Software. Management Science, 54(1):180–193, 2008.

    Article  Google Scholar 

  4. Girish Maskeri Rama, Santonu Sarkar, and Kenneth Heafield. Mining Business Topics in Source Code using Latent Dirichlet Allocation. In Proc. of the 1st Annual India Software Engineering Conference (ISEC 2008), pages 113–120, Hyderabad, India, February 2008.

    Google Scholar 

  5. Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. On the Use of Automated Text Summarization Techniques for Summarizing Source Code. In Proc. of the 17th Working Conference on Reverse Engineering (WCRE 2010), pages 35–44, Beverly, MA, USA, October 2010.

    Google Scholar 

  6. Sonia Haiduc, Jairo Aponte, and Andrian Marcus. Supporting Program Comprehension with Source Code Summarization. In Proc. of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pages 223–226, Cape Town, South Africa, May 2010.

    Google Scholar 

  7. Giriprasad Sridhara, Lori L. Pollock, and K. Vijay-Shanker. Automatically Detecting and Describing High Level Actions within Methods. In Proc. of the 33rd International Conference on Software Engineering (ICSE 2011), pages 101–110, Waikiki, Honolulu, HI, USA, May 2011.

    Google Scholar 

  8. Giriprasad Sridhara, Lori L. Pollock, and K. Vijay-Shanker. Generating Parameter Comments and Integrating with Method Summaries. In Proc. of the 19th IEEE International Conference on Program Comprehension (ICPC 2011), pages 71–80, Kingston, ON, Canada, June 2011.

    Google Scholar 

  9. Paige Rodeghero, Collin McMillan, Paul W. McBurney, Nigel Bosch, and Sidney K. D’Mello. Improving Automated Source Code Summarization via An Eye-tracking Study of Programmers. In Proc. of the 36th International Conference on Software Engineering (ICSE 2014), pages 390–401, Hyderabad, India, June 2014.

    Google Scholar 

  10. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781, 2013.

    Google Scholar 

  11. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proc. of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), pages 3111–3119, Lake Tahoe, Nevada, United States, December 2013.

    Google Scholar 

  12. Jon Louis Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Communications of the ACM, 18(9):509–517, 1975.

    Article  Google Scholar 

  13. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003.

    MATH  Google Scholar 

  14. Michael Röder, Andreas Both, and Alexander Hinneburg. Exploring the Space of Topic Coherence Measures. In Proc. of the Eighth ACM International Conference on Web Search and Data Mining (WSDM 2015), pages 399–408, Shanghai, China, February 2015.

    Google Scholar 

  15. Isabelle Guyon and André Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3:1157–1182, 2003.

    MATH  Google Scholar 

  16. Miltiadis Allamanis and Charles A. Sutton. Mining Source Code Repositories at Massive Scale using Language Modeling. In Proc. of the 10th Working Conference on Mining Software Repositories (MSR 2013), pages 207–216, San Francisco, CA, USA, May 2013.

    Google Scholar 

  17. Liqiang Niu, Xinyu Dai, Jianbing Zhang, and Jiajun Chen. Topic2Vec: Learning Distributed Representations of Topics. In Proc. of the International Conference on Asian Language Processing 2015 (IALP 2015), pages 193–196, Suzhou, China, October 2015.

    Google Scholar 

  18. Ronald Fagin, Ravi Kumar, and D. Sivakumar. Comparing Top K Lists. In Proc. of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pages 28–36, Baltimore, Maryland, USA, January 2003.

    Article  MathSciNet  Google Scholar 

  19. Hazeline U. Asuncion, Arthur U. Asuncion, and Richard N. Taylor. Software Traceability with Topic Modeling. In Proc. of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pages 95–104, Cape Town, South Africa, May 2010.

    Google Scholar 

  20. Stacy K. Lukins, Nicholas A. Kraft, and Letha H. Etzkorn. Bug Localization using Latent Dirichlet Allocation. Information and Software Technology, 52(9):972–990, 2010.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhang, W.E., Sheng, Q.Z. (2018). Knowledge Extraction from Unstructured Data on the Web. In: Managing Data From Knowledge Bases: Querying and Extraction. Springer, Cham. https://doi.org/10.1007/978-3-319-94935-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-94935-2_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-94934-5

  • Online ISBN: 978-3-319-94935-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics