Skip to main content
Log in

APIReal: an API recognition and linking approach for online developer forums

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

When discussing programming issues on social platforms (e.g, Stack Overflow, Twitter), developers often mention APIs in natural language texts. Extracting API mentions from natural language texts serves as the prerequisite to effective indexing and searching for API-related information in software engineering social content. The task of extracting API mentions from natural language texts involves two steps: 1) distinguishing API mentions from other English words (i.e., API recognition), 2) disambiguating a recognized API mention to its unique fully qualified name (i.e., API linking). Software engineering social content lacks consistent API mentions and sentence writing format. As a result, API recognition and linking have to deal with the inherent ambiguity of API mentions in informal text, for example, due to the ambiguity between the API sense of a common word and the normal sense of the word (e.g., append, apply and merge), the simple name of an API can map to several APIs of the same library or of different libraries, or different writing forms of an API should be linked to the same API. In this paper, we propose a semi-supervised machine learning approach that exploits name synonyms and rich semantic context of API mentions for API recognition in informal text. Based on the results of our API recognition approach, we further propose an API linking approach leveraging a set of domain-specific heuristics, including mention-mention similarity, scope filtering, and mention-entry similarity, to determine which API in the knowledge base a recognized API actually refers to. To evaluate our API recognition approach, we use 1205 API mentions of three libraries (Pandas, Numpy, and Matplotlib) from Stack Overflow text. We also evaluate our API linking approach with 120 recognized API mentions of these three libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://github.com/google/code-prettify

  2. Scrapy, http://scrapy.org/

  3. CRFSuite, http://www.chokkan.org/software/crfsuite/

  4. Brown Clustering, https://github.com/percyliang/brown-cluster

  5. Word2vec, https://code.google.com/archive/p/word2vec/

  6. Sofia-ML, https://code.google.com/archive/p/sofia-ml/

  7. http://www.signll.org/conll/

References

  • Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from stackoverflow: an exploratory study on android apps. Inf Softw Technol 88:148–158

    Article  Google Scholar 

  • Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng (TSE) 28(10):970–983

    Article  Google Scholar 

  • Bacchelli A, D’Ambros M, Lanza M, Robbes R (2009) Benchmarking lightweight techniques to link e-mails and source code. In: Proceedings of the 16th working conference on reverse engineering (WCRE). IEEE, Piscataway, pp 205–214

  • Bacchelli A, Lanza M, Robbes R (2010) Linking e-mails and source code artifacts. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering (ICSE). ACM, New York, pp 375–384

  • Bacchelli A, Cleve A, Lanza M, Mocci A (2011) Extracting structured data from natural language documents with island parsing. In: Proceedings of the 26th IEEE/ACM international conference on automated software engineering (ASE). IEEE, Piscataway, pp 476–479

  • Brown PF, Desouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479

    Google Scholar 

  • Chen F, Kim S (2015) Crowd debugging. In: Proceedings of the 10th joint meeting on foundations of software engineering (FSE). ACM, New York, pp 320–332

  • Chen X, Liu Z, Sun M (2014) A unified model for word sense representation and disambiguation. In: EMNLP, Citeseer, pp 1025–1035

  • Dagenais B, Robillard MP (2012) Recovering traceability links between an api and its learning resources. In: Proceedings of the 34th international conference on software engineering (ICSE). IEEE, Piscataway, pp 47–57

  • Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378

    Article  Google Scholar 

  • Gao Q, Zhang H, Wang J, Xiong Y, Zhang L, Mei H (2015) Fixing recurring crash bugs via analyzing q&a sites (t). In: Proceedings of the 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, Piscataway, pp 307–318

  • Guo J, Che W, Wang H, Liu T (2014) Revisiting embedding features for simple semi-supervised learning. In: EMNLP, pp 110–120

  • Ji Z, Sun A, Cong G, Han J (2016) Joint recognition and linking of fine-grained locations from tweets. In: Proceedings of the 25th international conference on world wide web (WWW), International World Wide Web Conferences Steering Committee, pp 1271–1281

  • Jiang HY, Nguyen TN, Chen X, Jaygarl H, Chang CK (2008) Incremental latent semantic indexing for automatic traceability link evolution management. In: Proceedings of the 23rd IEEE/ACM international conference on automated software engineering (ASE), IEEE Computer Society, pp 59–68

  • Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth international conference on machine learning, ICML ’01, pp 282–289

  • Li C, Sun A (2014) Fine-grained location extraction from tweets with temporal awareness. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval. ACM, New York, pp 43–52

  • Liang P (2005) Semi-supervised learning for natural language. PhD thesis, Citeseer

  • Liao W, Veeramachaneni S (2009) A simple semi-supervised algorithm for named entity recognition. In: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing, Association for Computational Linguistics, pp 58–65

  • Linares-Vásquez M, Bavota G, Di Penta M, Oliveto R, Poshyvanyk D (2014) How do api changes trigger stack overflow discussions? a study on the android sdk. In: Proceedings of the 22nd international conference on program comprehension (ICPC). ACM, New York, pp 83–94

  • Liu X, Zhang S, Wei F, Zhou M (2011) Recognizing named entities in tweets. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-Volume 1, Association for Computational Linguistics, pp 359–367

  • Liu X, Li Y, Wu H, Zhou M, Wei F, Lu Y (2013) Entity linking for tweets. In: ACL (1), pp 1304–1311

  • Marcus A, Maletic J et al. (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of the 25th international conference on software engineering (ICSE). IEEE, Piscataway, pp 125–135

  • Mihalcea R (2004) Co-training and self-training for word sense disambiguation. In: CoNLL, pp 33– 40

  • Mihalcea R, Csomai A (2007) Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, New York, pp 233–242

  • Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  • Milne D, Witte IH (2008) Learning to link with wikipedia. In: Proceedings of the 17th ACM conference on Information and knowledge management. ACM , New York, pp 509–518

  • Moonen L (2001) Generating robust parsers using island grammars. In: Proceedings of eighth working conference on reverse engineering (WCRE). IEEE, Piscataway, pp 13–22

  • Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):10

    Article  Google Scholar 

  • Parnin C, Treude C, Grammel L, Storey MA (2012) Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow. Georgia Institute of Technology, Tech Rep

  • Rahman MM, Roy CK, Lo D (2016) Rack: Automatic api recommendation using crowdsourced knowledge. In: SANER

  • Rigby PC, Robillard MP (2013) Discovering essential code elements in informal documentation. In: Proceedings of international conference on software engineering (ICSE). IEEE Press, Piscataway, pp 832–841

  • Shen W, Wang J, Luo P, Wang M (2012) Liege:: Link entities in web lists with knowledge base. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, KDD ’12, pp 1424–1432

  • Subramanian S, Inozemtseva L, Holmes R (2014) Live api documentation. In: Proceedings of the 36th international conference on software engineering (ICSE). ACM, New York, pp 643–652

  • Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics, pp 384–394

  • Wang M, Manning CD (2013) Effect of non-linear deep architecture in sequence labeling. In: IJCNLP, pp 1285–1291

  • Wu D, Lee WS, Ye N, Chieu HL (2009) Domain adaptive bootstrapping for named entity recognition. In: Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 3-Volume 3, Association for Computational Linguistics, pp 1523–1532

  • Wu N, Hou D, Liu Q (2016) Linking usage tutorials into api client code pp 22–28

  • Yao Y, Sun A (2015) Mobile phone name extraction from internet forums: a semi-supervised approach. World Wide Web pp 1–23

  • Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on association for computational linguistics, association for computational linguistics, pp 189–196

  • Ye D, Xing Z, Foo CY, Ang ZQ, Li J, Kapre N (2016a) Software-specific named entity recognition in software engineering social content. In: Proceedings of the 23rd IEEE international conference on software analysis, evolution and reengineering (SANER)

  • Ye D, Xing Z, Li J, Kapre N (2016b) Software-specific part-of-speech tagging: An experimental study on stack overflow. In: Proceedings of the 31st annual ACM symposium on applied computing, ACM, New York, SAC ’16, pp 1378–1385. https://doi.org/10.1145/2851613.2851772

  • Yu M, Zhao T, Dong D, Tian H, Yu D (2013) Compound embedding features for semi-supervised learning. In: HLT-NAACL, pp 563–568

  • Zheng W, Zhang Q, Lyu M (2011) Cross-library api recommendation using web search engines. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering. ACM, New York, pp 480–483

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingfeng Bao.

Additional information

Communicated by: Denys Poshyvanyk

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, D., Bao, L., Xing, Z. et al. APIReal: an API recognition and linking approach for online developer forums. Empir Software Eng 23, 3129–3160 (2018). https://doi.org/10.1007/s10664-018-9608-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-018-9608-7

Keywords

Navigation