Using a balanced scorecard to identify opportunities to improve code review effectiveness: an industrial experience report

Hasan, Masum; Iqbal, Anindya; Islam, Mohammad Rafid Ul; Rahman, A.J.M. Imtiajur; Bosu, Amiangshu

doi:10.1007/s10664-021-10038-w

Using a balanced scorecard to identify opportunities to improve code review effectiveness: an industrial experience report

Published: 23 September 2021

Volume 26, article number 129, (2021)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Masum Hasan¹,
Anindya Iqbal¹,
Mohammad Rafid Ul Islam¹,
A.J.M. Imtiajur Rahman² &
…
Amiangshu Bosu ORCID: orcid.org/0000-0002-3178-6232³

805 Accesses
8 Citations
4 Altmetric
Explore all metrics

Abstract

Peer code review is a widely adopted software engineering practice to ensure code quality and ensure software reliability in both the commercial and open-source software projects. Due to the large effort overhead associated with practicing code reviews, project managers often wonder, if their code reviews are effective and if there are improvement opportunities in that respect. Since project managers at Samsung Research Bangladesh (SRBD) were also intrigued by these questions, this research developed, deployed, and evaluated a production-ready solution using the Balanced SCorecard (BSC) strategy that SRBD managers can use in their day-to-day management to monitor individual developer’s, a particular project’s or the entire organization’s code review effectiveness. Following the four-step framework of the BSC strategy, we– 1) defined the operation goals of this research, 2) defined a set of metrics to measure the effectiveness of code reviews, 3) developed an automated mechanism to measure those metrics, and 4) developed and evaluated a monitoring application to inform the key stakeholders. Our automated model to identify useful code reviews achieves 7.88% and 14.39% improvement in terms of accuracy and minority class F₁ score respectively over the models proposed in prior studies. It also outperforms human evaluators from SRBD, that the model replaces, by a margin of 25.32% and 23.84% respectively in terms of accuracy and minority class F₁ score. In our post-deployment survey, SRBD developers and managers indicated that they found our solution as useful and it provided them with important insights to help their decision makings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Choice of Code Review Process: A Survey on the State of the Practice

Studying the impact of risk assessment analytics on risk awareness and code review performance

Article 17 February 2024

What makes a code review useful to OpenDev developers? An empirical investigation

Article 22 November 2023

Notes

We are unable to make the dataset publicly available due to the restrictions imposed by our NDA with SRBD.
On StackOverflow, each accepted answer gets 15 points, upvote gets 10 points, and downvote gets -2 points
The numbers represent the number of interviewees that consider this type of comment as Useful or Not Useful
Point biserial correlation
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
Numbers in parentheses indicate how many CRA users of our evaluation survey mentioned this particular insight. One user may have mentioned multiple insights.

References

Ahmed T, Bosu A, Iqbal A, Rahimi S (2017) SentiCR: a customized sentiment analysis tool for code review interactions. In: 32nd IEEE/ACM international conference on automated software engineering (NIER track), ASE ’17
Bacchelli A, Bird C (2013) Expectations, outcomes, and challenges of modern code review. In: Proceedings of the 2013 international conference on software engineering, pp 712–721. IEEE Press
Barnett M, Bird C, Brunet J, Lahiri SK (2015) Helping developers help themselves: Automatic decomposition of code review changesets. In: Proceedings of the 37th international conference on software engineering-volume 1. IEEE Press, pp 134–144
Basili VR, Shull F, Lanubile F (1999) Building knowledge through families of experiments. IEEE Trans Softw Eng 25(4):456–473
Article Google Scholar
Beller M, Bacchelli A, Zaidman A, Juergens E (2014) Modern code reviews in open-source projects: Which problems do they fix?. In: Proceedings of the 11th working conference on mining software repositories. pp 202–211
Bosu A, Carver JC (2013) Impact of peer code review on peer impression formation: A survey. In: 2013 ACM / IEEE international symposium on empirical software engineering and measurement. pp 133–142. https://doi.org/10.1109/ESEM.2013.23
Bosu A, Carver JC, Bird C, Orbeck J, Chockley C (2017) Process aspects and social dynamics of contemporary code review: Insights from open source development and industrial practice at microsoft. IEEE Trans Softw Eng 43(1):56–75. https://doi.org/10.1109/TSE.2016.2576451
Article Google Scholar
Bosu A, Greiler M, Bird C (2015) Characteristics of useful code reviews: an empirical study at Microsoft. In: 2015 IEEE/ACM 12th working conference on mining software repositories. IEEE, Florence, pp 146–156. https://doi.org/10.1109/MSR.2015.21. http://ieeexplore.ieee.org/document/7180075/
Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly A, Holt B, Varoquaux G (2013) API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD workshop: languages for data mining and machine learning. pp 108–122
Calefato F, Lanubile F, Maiorano F, Novielli N (2018) Sentiment polarity detection for software development. Empir Softw Eng 23(3):1352–1382
Article Google Scholar
Camilo F, Meneely A, Nagappan M (2015) Do bugs foreshadow vulnerabilities?: a study of the chromium project. In: Proceedings of the 12th working conference on mining software repositories. IEEE Press, pp 269–279
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953. https://jair.org/index.php/jair/article/view/10302
Article Google Scholar
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, pp 785–794
Chouchen M, Ouni A, Kula RG, Wang D, Thongtanunam P, Mkaouer MW, Matsumoto K (2021) Anti-patterns in modern code review: Symptoms and prevalence. In: 2021 IEEE international conference on software analysis, evolution and reengineering (SANER), pp 531–535, DOI https://doi.org/10.1109/SANER50967.2021.00060, (to appear in print)
Chouchen, M., Ouni A., Mkaouer M.W., Kula R.G., Inoue K. (2021) Whoreview: A multi-objective search-based approach for code reviewers recommendation in modern code review. Appl Soft Comput 100:106908. https://doi.org/10.1016/j.asoc.2020.106908. https://www.sciencedirect.com/science/article/pii/S1568494620308462
Article Google Scholar
Cohen J, Brown E, DuRette B, Teleki S (2006) Best kept secrets of peer code review. Smart Bear Somerville
Czerwonka J, Greiler M, Tilford J (2015) Code reviews do not find bugs: how the current code review best practice slows us down. In: Proceedings of the 37th international conference on software engineering-volume 2. IEEE Press, pp 27–28
di Biase M, Bruntink M, Bacchelli A (2016) A security perspective on code review: The case of chromium. In: 2016 IEEE 16th international working conference on source code analysis and manipulation (SCAM). IEEE, pp 21–30
Ebert F, Castor F, Novielli N, Serebrenik A (2019) Confusion in code reviews: reasons, impacts, and coping strategies. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER), pp 49–60. https://doi.org/10.1109/SANER.2019.8668024
Ebert F, Castor F, Novielli N, Serebrenik A (2019) Confusion in code reviews: reasons, impacts, and coping strategies. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 49–60
Fagan ME (1976) Design and code inspections to reduce errors in program development. IBM Syst J 15(3):182–211. https://doi.org/10.1147/sj.153.0182.+
Article Google Scholar
Flesch R (2007) Flesch–kincaid readability test. Retrieved October 26, 2007
Flyvbjerg B (2006) Five misunderstandings about case-study research. Qual Inq 12(2):219–245
Article Google Scholar
Fracz W, Dajda J (2018) Developers’ game: A preliminary study concerning a tool for automated developers assessment. In: 2018 IEEE international conference on software maintenance and evolution (ICSME). pp 695–699, DOI https://doi.org/10.1109/ICSME.2018.00079, (to appear in print)
Gerrit code review - rest api (2019) https://gerrit-review.googlesource.com/Documentation/rest-api.html. (Accessed on 09/27/2019)
Hatton L (2008) Testing the value of checklists in code inspections. IEEE Softw 25(4):82–88
Article Google Scholar
Hirao T, Ihara A, Ueda Y, Phannachitta P, Matsumoto K (2016) The impact of a low level of agreement among reviewers in a code review process. In: IFIP International conference on open source systems. Springer, pp 97–110
Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition, vol 1. IEEE, pp 278–282
Hofner G, Mani V, Nambiar R, Apte M (2011) Fostering a high-performance culture in offshore software engineering teams using balanced scorecards and project scorecards. In: 2011 IEEE Sixth international conference on global software engineering. IEEE, pp 35–39
Hopfield JJ (1988) Artificial neural networks. IEEE Circ Devices Mag 4(5):3–10
Article Google Scholar
Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398, Wiley, Hoboken
Huq F, Hasan M, Pantho MAH, Mahbub S, Iqbal A, Ahmed T (2020) Review4repair: Code review aided automaticprogram repairing. arXiv:2010.01544
Islam MR, Zibran MF (2017) Leveraging automated sentiment analysis in software engineering. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR). IEEE, pp 203–214
Jin C, De-Lin L, Fen-Xiang M (2009) An improved id3 decision tree algorithm. In: 2009 4th international conference on computer science & education. IEEE, pp 127–130
Kaplan RS, Norton DP et al (1992) The balanced scorecard: measures that drive performance
Khomh F, Dhaliwal T, Zou Y, Adams B (2012) Do faster releases improve software quality?: an empirical case study of mozilla firefox. In: Proceedings of the 9th IEEE working conference on mining software repositories. IEEE Press, pp 179–188
Kononenko O, Baysal O, Godfrey MW (2016) Code review quality: How developers see it. In: Proceedings of the 38th international conference on software engineering, ICSE ’16. https://doi.org/10.1145/2884781.2884840. ACM, New York, pp 1028–1038
Kononenko O, Baysal O, Guerrouj L, Cao Y, Godfrey MW (2015) Investigating code review quality: Do people and participation matter?. In: 2015 IEEE International conference on software maintenance and evolution (ICSME), pp 111–120
Mair S (2002) A balanced scorecard for a small software group. IEEE Softw 19(6):21–27
Article Google Scholar
Marlow J, Dabbish L, Herbsleb J (2013) Impression formation in online peer production: activity traces and personal profiles in github. In: Proceedings of the 2013 conference on computer supported cooperative work. pp 117–128
Marr B, Neely A (2003) Automating the balanced scorecard–selection criteria to identify appropriate software applications. Measuring Business Excellence
McCarney R, Warner J, Iliffe S, Van Haselen R, Griffin M, Fisher P (2007) The hawthorne effect: a randomised, controlled trial. BMC Med Res Methodol 7(1):30
Article Google Scholar
McIntosh S, Kamei Y, Adams B, Hassan AE (2014) The impact of code review coverage and code review participation on software quality: A case study of the qt, vtk, and itk projects. In: Proceedings of the 11th working conference on mining software repositories. pp 192–201
Mäntylä MV, Lassenius C (2009) What types of defects are really discovered in code reviews? IEEE Trans Softw Eng 35(3):430–448. https://doi.org/10.1109/TSE.2008.71
Article Google Scholar
Mockus A, Fielding RT, Herbsleb J (2000) A case study of open source software development: the apache server. In: Proceedings of the 22nd international conference on Software engineering. ACM, pp 263–272
Natural language toolkit — nltk 3.5 documentation (2020) https://www.nltk.org/. (Accessed on 12/06/2020)
Naumenko D (2018) Java diff utils. https://github.com/dnaumenko/java-diff-utils
Novielli N, Girardi D, Lanubile F (2018) A benchmark study on sentiment analysis for software engineering research. In: 2018 IEEE/ACM 15th international conference on mining software repositories (MSR). IEEE, pp 364–375
pandas.qcut — pandas 1.1.5 documentation (2020) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html. (Accessed on 12/24/2020)
Papalexandris A, Ioannou G, Prastacos GP (2004) Implementing the balanced scorecard in greece: a software firm’s experience. Long Range Plann 37 (4):351–366
Article Google Scholar
Rahman MM, Roy CK, Collins JA (2016) Correct: code reviewer recommendation in github based on cross-project and technology experience. In: Proceedings of the 38th international conference on software engineering companion. pp 222–231
Rahman MM, Roy CK, Kula RG (2017) Predicting usefulness of code review comments using textual features and developer experience. In: Proceedings of the 14th international conference on mining software repositories, MSR ’17. IEEE Press, pp 215–226
Rigby PC, Bird C (2013) Convergent contemporary software peer review practices. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, pp 202–212
Rigby PC, German DM (2006) A preliminary examination of code review processes in open source projects. Tech. rep., Technical Report DCS-305-IR University of Victoria
Sadowski C, Söderberg E, Church L, Sipko M, Bacchelli A (2018) Modern code review: a case study at google. In: Proceedings of the 40th international conference on software engineering: software engineering in practice. ACM, pp 181–190
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523. https://doi.org/10.1016/0306-4573(88)90021-0
Article Google Scholar
Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Soft Eng 13(2):211–218
Article Google Scholar
StackOverflow (2021) What is reputation? how do i earn (and losek) it?. https://stackoverflow.com/help/whats-reputation
Suykens JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300
Article Google Scholar
textstat - pypi (2020) https://pypi.org/project/textstat/. (Accessed on 12/06/2020)
Thongtanunam P, Hassan AE (2020) Review dynamics and their impact on software quality. IEEE Trans Softw Eng :1–1. https://doi.org/10.1109/TSE.2020.2964660
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2017) Review participation in modern code review. Empir Softw Eng 22(2):768–817
Article Google Scholar
Thongtanunam P, Tantithamthavorn C, Kula RG, Yoshida N, Iida H, Matsumoto K (2015) Who should review my code? a file location-based code-reviewer recommendation approach for modern code review. In: 2015 IEEE 22nd international conference on software analysis, evolution, and reengineering (SANER).IEEE, pp 141–150
Toloşi L, Lengauer T (2011) Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27(14):1986–1994
Article Google Scholar

Download references

Acknowledgment

Work conducted by Dr. Amiangshu Bosu for this research is partially supported by the US National Science Foundation under Grant No. 1850475. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Work conducted for this research is also partially supported by a research grant provided by the Samsung Research Bangladesh.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
Masum Hasan, Anindya Iqbal & Mohammad Rafid Ul Islam
Samsung R, D Institute Bangladesh, Dhaka, Bangladesh
A.J.M. Imtiajur Rahman
Department of Computer Science, Wayne State University, Detroit, MI, USA
Amiangshu Bosu

Authors

Masum Hasan
View author publications
You can also search for this author in PubMed Google Scholar
Anindya Iqbal
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Rafid Ul Islam
View author publications
You can also search for this author in PubMed Google Scholar
A.J.M. Imtiajur Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Amiangshu Bosu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amiangshu Bosu.

Additional information

Communicated by: Sigrid Eldh and Davide Falessi

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hasan, M., Iqbal, A., Islam, M.R.U. et al. Using a balanced scorecard to identify opportunities to improve code review effectiveness: an industrial experience report. Empir Software Eng 26, 129 (2021). https://doi.org/10.1007/s10664-021-10038-w

Download citation

Accepted: 12 August 2021
Published: 23 September 2021
DOI: https://doi.org/10.1007/s10664-021-10038-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using a balanced scorecard to identify opportunities to improve code review effectiveness: an industrial experience report

Abstract

Access this article

Similar content being viewed by others

The Choice of Code Review Process: A Survey on the State of the Practice

Studying the impact of risk assessment analytics on risk awareness and code review performance

What makes a code review useful to OpenDev developers? An empirical investigation

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using a balanced scorecard to identify opportunities to improve code review effectiveness: an industrial experience report

Abstract

Access this article

Similar content being viewed by others

The Choice of Code Review Process: A Survey on the State of the Practice

Studying the impact of risk assessment analytics on risk awareness and code review performance

What makes a code review useful to OpenDev developers? An empirical investigation

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation