Skip to main content
Log in

Interrater reliability and agreement of performance ratings: A methodological comparison

  • Published:
Journal of Business and Psychology Aims and scope Submit manuscript

Abstract

This paper demonstrates and compares methods for estimating the interrater reliability and interrater agreement of performance ratings. These methods can be used by applied researchers to investigate the quality of ratings gathered, for example, as criteria for a validity study, or as performance measures for selection or promotional purposes. While estimates of interrater reliability are frequently used for these purposes, indices of interrater agreement appear to be rarely reported for performance ratings. A recommended index of interrater agreement, theT index (Tinsley & Weiss, 1975), is compared to four methods of estimating interrater reliability (Pearsonr, coefficient alpha, mean correlation between raters, and intraclass correlation). Subordinate and superior ratings of the performance of 100 managers were used in these analyses. The results indicated that, in general, interrater agreement and reliability among subordinates were fairly high. Interrater agreement between subordinates and superiors was moderately high; however, interrater reliability between these two rating sources was very low. The results demonstrate that interrater agreement and reliability are distinct indices and that both should be reported. Reasons are discussed as to why interrater reliability should not be reported alone.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Berry, K., & Mielke, P. (1988). A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters.Educational and Psychological Measurement, 48, 921–933.

    Google Scholar 

  • Berry, K., & Mielke, P. (1990). A generalized agreement measure.Educational and Psychological Measurement, 50, 123–125.

    Google Scholar 

  • Campion, M., & Pursell, E. (1981).Plymouth Fiber Extraboard Validation Report. New Bern, NC: Weyerhaeuser Co.

    Google Scholar 

  • Campion, M., Pursell, E., & Brown, B. (1988). Structured interviewing: Raising the psychometric properties of the employment interview.Personnel Psychology, 41, 25–42.

    Google Scholar 

  • Cronbach, L. Gleser, G., Nanda, H., & Rajaratnam, N. (1972).The dependability of behavioral measurements. New York: Wiley.

    Google Scholar 

  • Ghiselli, E., Campbell, J., & Zedeck, S. (1981).Measurement theory for the behavioral sciences. San Francisco: W. H. Freeman.

    Google Scholar 

  • Guilford, J., & Fruchter, B. (1978).Fundamental statistics in psychology and education. New York: McGraw Hill.

    Google Scholar 

  • James, L., Demaree, R., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias.Journal of Applied Psychology, 69, 322–327.

    Google Scholar 

  • Hayes, W. (1988).Statistics. Fort Worth, TX: Holt, Rinehart and Winston.

    Google Scholar 

  • Kozlowski, S., & Hattrup, K. (1992). A disagreement about within group agreement: Disentangling issues of consistency versus consensus.Journal of Applied Psychology, 77, 161–167.

    Google Scholar 

  • Lawlis, F., & Lu, E. (1972). Judgment of counseling process: Reliability, agreement, and error.Psychological Bulletin, 78, 17–20.

    Google Scholar 

  • Rothstein, H. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe.Journal of Applied Psychology, 75, 85–98.

    Google Scholar 

  • Saal, F, Downey, R., & Lahey, M. (1980). Rating the ratings: Assessing the psychometric quality of rating data.Psychological Bulletin, 88, 413–428.

    Google Scholar 

  • SAS. (1990).SAS/STAT user's guide. Vol. 2. Cary, NC: SAS Institute.

    Google Scholar 

  • Schneider, B., & Schmitt, N. (1986).Staffing organizations. Glen View, Il: Scott, Foresman.

    Google Scholar 

  • Shrout, P., & Fleiss, J. (1979). Intraclass correlations: Uses in assessing rater reliability.Psychological Bulletin, 86, 420–428.

    Google Scholar 

  • Tinsley, H., & Weiss, D. (1975). Interrater reliability and agreement of subjective judgments.Journal of Counseling Psychology, 22, 358–376.

    Google Scholar 

  • Tornow, W. (1993). Perceptions or reality: Is multi-perspective measurement a means or an end?Human Resource Management, 32, 221–229.

    Google Scholar 

  • Winer, B. (1971).Statistical principles in experimental design (2nd ed.). New York: McGraw-Hill.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

This paper is based, in part, on a thesis submitted to East Carolina University by the second author. Portions of this study were presented at the American Psychological Association meeting in New Orleans, LA, August, 1989. The authors would like to thank Michael Campion and two anonymous reviewers for their comments on earlier drafts of this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fleenor, J.W., Fleenor, J.B. & Grossnickle, W.F. Interrater reliability and agreement of performance ratings: A methodological comparison. J Bus Psychol 10, 367–380 (1996). https://doi.org/10.1007/BF02249609

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02249609

Keywords

Navigation