Abstract
This paper demonstrates and compares methods for estimating the interrater reliability and interrater agreement of performance ratings. These methods can be used by applied researchers to investigate the quality of ratings gathered, for example, as criteria for a validity study, or as performance measures for selection or promotional purposes. While estimates of interrater reliability are frequently used for these purposes, indices of interrater agreement appear to be rarely reported for performance ratings. A recommended index of interrater agreement, theT index (Tinsley & Weiss, 1975), is compared to four methods of estimating interrater reliability (Pearsonr, coefficient alpha, mean correlation between raters, and intraclass correlation). Subordinate and superior ratings of the performance of 100 managers were used in these analyses. The results indicated that, in general, interrater agreement and reliability among subordinates were fairly high. Interrater agreement between subordinates and superiors was moderately high; however, interrater reliability between these two rating sources was very low. The results demonstrate that interrater agreement and reliability are distinct indices and that both should be reported. Reasons are discussed as to why interrater reliability should not be reported alone.
Similar content being viewed by others
References
Berry, K., & Mielke, P. (1988). A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters.Educational and Psychological Measurement, 48, 921–933.
Berry, K., & Mielke, P. (1990). A generalized agreement measure.Educational and Psychological Measurement, 50, 123–125.
Campion, M., & Pursell, E. (1981).Plymouth Fiber Extraboard Validation Report. New Bern, NC: Weyerhaeuser Co.
Campion, M., Pursell, E., & Brown, B. (1988). Structured interviewing: Raising the psychometric properties of the employment interview.Personnel Psychology, 41, 25–42.
Cronbach, L. Gleser, G., Nanda, H., & Rajaratnam, N. (1972).The dependability of behavioral measurements. New York: Wiley.
Ghiselli, E., Campbell, J., & Zedeck, S. (1981).Measurement theory for the behavioral sciences. San Francisco: W. H. Freeman.
Guilford, J., & Fruchter, B. (1978).Fundamental statistics in psychology and education. New York: McGraw Hill.
James, L., Demaree, R., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias.Journal of Applied Psychology, 69, 322–327.
Hayes, W. (1988).Statistics. Fort Worth, TX: Holt, Rinehart and Winston.
Kozlowski, S., & Hattrup, K. (1992). A disagreement about within group agreement: Disentangling issues of consistency versus consensus.Journal of Applied Psychology, 77, 161–167.
Lawlis, F., & Lu, E. (1972). Judgment of counseling process: Reliability, agreement, and error.Psychological Bulletin, 78, 17–20.
Rothstein, H. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe.Journal of Applied Psychology, 75, 85–98.
Saal, F, Downey, R., & Lahey, M. (1980). Rating the ratings: Assessing the psychometric quality of rating data.Psychological Bulletin, 88, 413–428.
SAS. (1990).SAS/STAT user's guide. Vol. 2. Cary, NC: SAS Institute.
Schneider, B., & Schmitt, N. (1986).Staffing organizations. Glen View, Il: Scott, Foresman.
Shrout, P., & Fleiss, J. (1979). Intraclass correlations: Uses in assessing rater reliability.Psychological Bulletin, 86, 420–428.
Tinsley, H., & Weiss, D. (1975). Interrater reliability and agreement of subjective judgments.Journal of Counseling Psychology, 22, 358–376.
Tornow, W. (1993). Perceptions or reality: Is multi-perspective measurement a means or an end?Human Resource Management, 32, 221–229.
Winer, B. (1971).Statistical principles in experimental design (2nd ed.). New York: McGraw-Hill.
Author information
Authors and Affiliations
Additional information
This paper is based, in part, on a thesis submitted to East Carolina University by the second author. Portions of this study were presented at the American Psychological Association meeting in New Orleans, LA, August, 1989. The authors would like to thank Michael Campion and two anonymous reviewers for their comments on earlier drafts of this paper.
Rights and permissions
About this article
Cite this article
Fleenor, J.W., Fleenor, J.B. & Grossnickle, W.F. Interrater reliability and agreement of performance ratings: A methodological comparison. J Bus Psychol 10, 367–380 (1996). https://doi.org/10.1007/BF02249609
Issue Date:
DOI: https://doi.org/10.1007/BF02249609