A Statistical Test for Differential Item Pair Functioning

Bechger, Timo M.; Maris, Gunter

doi:10.1007/s11336-014-9408-y

A Statistical Test for Differential Item Pair Functioning

Published: 16 September 2014

Volume 80, pages 317–340, (2015)
Cite this article

Psychometrika Aims and scope Submit manuscript

Timo M. Bechger¹ &
Gunter Maris^1,2

1141 Accesses
32 Citations
3 Altmetric
Explore all metrics

Abstract

This paper presents an IRT-based statistical test for differential item functioning (DIF). The test is developed for items conforming to the Rasch (Probabilistic models for some intelligence and attainment tests, The Danish Institute of Educational Research, Copenhagen, 1960) model but we will outline its extension to more complex IRT models. Its difference from the existing procedures is that DIF is defined in terms of the relative difficulties of pairs of items and not in terms of the difficulties of individual items. The argument is that the difficulty of an item is not identified from the observations, whereas the relative difficulties are. This leads to a test that is closely related to Lord’s (Applications of item response theory to practical testing problems, Erlbaum, Hillsdale, 1980) test for item DIF albeit with a different and more correct interpretation. Illustrations with real and simulated data are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of Differential Item Functioning via the Credible Intervals and Odds Ratios Methods

An R toolbox for score-based measurement invariance tests in IRT models

Article Open access 16 December 2021

A Comparison of Differential Item Functioning (DIF) Detection for Dichotomously Scored Items Using IRTPRO, BILOG-MG, and IRTLRDIF

Notes

Technically, such groups are called non-equivalent to indicate that assignment to group was not random. We assume non-equivalent groups because (random) equivalent groups would not show any systematic difference, let alone DIF. The distribution of ability in non-equivalent groups may be the same or different.
Bad practice is a frequent source of problems: in particular, the misspecification of the measurement model, giving the appearance of DIF, or capitalization on chance (e.g., Teresi, 2006). We assume here a fitting Rasch model in each group.
Some researchers would report item \(1\) on the argument that DIF items are necessarily a minority (e.g., De Boeck, 2008): DIF items are viewed as outliers with respect to other (non-DIF) items (Magis & Boeck, 2012). This, however, is an opinion; it cannot be proven correct or incorrect using empirical evidence. Furthermore, as suggested to us in a personal communication by Norman Verhelst, this working principle implies that items that show DIF in a test, where they are a minority, may not show DIF in another test where they constitute a majority.
The same idea underlies the \(S_i\) test proposed by Glas and Verhelst (1995b).
This follows because \(\gamma _s(c \varvec{\epsilon })=c^s \gamma _s(\varvec{\epsilon })\).
In general, \(\mathrm{Cov}\left( \hat{R}^{(g)}_{ir},\hat{R}^{(g)}_{jr}\right) =\mathrm{Cov}\left( \hat{\delta }_{i,g},\hat{\delta }_{j,g}\right) +\mathrm{Var}\left( \hat{\delta }_{r,g}\right) -\mathrm{Cov}\left( \hat{\delta }_{i,g},\hat{\delta }_{r,g}\right) -\mathrm{Cov}\left( \hat{\delta }_{j,g},\hat{\delta }_{r,g}\right) .\)
Even the properties of abstract IQ test items can be affected by education and/or aging. Fortunately, it is possible to develop an IQ test without concurrent calibration: A possibility that seems to be widely ignored.
If \(\beta _{i,g}=\ln \alpha _{i,g}\), \(\ln \frac{\alpha _{i,g}}{\alpha _{j,g}}=\beta _{i,g}-\beta _{j,g}\). An application of the delta method shows that \(\mathrm{Cov}(\hat{\beta }_{i,g},\hat{\beta }_{j,g})=\mathrm{Cov}(\hat{\alpha }_{i,g},\hat{\alpha }_{j,g})(\hat{\alpha }_{i,g}\hat{\alpha }_{j,g})^{-1}\).

References

Ackerman, T. (1992). A didactic explanation of item bias, impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91.
Article Google Scholar
Andersen, E. B. (1972). The numerical solution to a set of conditional estimation equations. Journal of the Royal Statistical Society, Series B, 34, 283–301.
Google Scholar
Andersen, E. B. (1973b). A goodness of fit test for the rasch model. Psychometrika, 38, 123–140.
Article Google Scholar
Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 69–81.
Article Google Scholar
Angoff, W. H. (1982). Use of diffculty and discrimination indices for detecting item bias. In R. A. Berk (Ed.), Handbook of methods for detecting item bias (pp. 96–116). Baltimore: John Hopkins University Press.
Google Scholar
Angoff, W. H. (1993). Perspective on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3–24). Hillsdale, NJ: Lawrence Erlbaum Associates.
Google Scholar
Bechger, T. M., Maris, G., & Verstralen, H. H. F. M. (2010). A different view on DIF (R&D Report No. 2010–5). Arnhem: Cito.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Reading: Addison-Wesley.
Google Scholar
Bold, D. M. (2002). A monte carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15, 113–141.
Article Google Scholar
Camilli, G. (1993). The case against item bias detection techniques based on internal criteria: Do test bias procedures obscure test fairness issues? In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 397–413). Hillsdale, NJ: Lawrence Earlbaum.
Google Scholar
Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260.
Article Google Scholar
De Boeck, P. (2008). Random IRT models. Psychometrika, 73(4), 533–559.
Article Google Scholar
DeMars, C. (2010). Type I error inflation for detecting DIF in the presence of impact. Educational and Psychological Measurement, 70, 961972.
Article Google Scholar
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hillsdale, NJ: Lawrence Erlbaum Associates.
Google Scholar
Engelhard, G. (1989). Accuracy of bias review judges in identifying teacher certification tests. Applied Measurement in Education, 3, 347–360.
Article Google Scholar
Finch, H. W., & French, B. F. (2008). Anomalous type I error rates for identifying one type of differential item functioning in the presence of the other. Educational and Psychological Measurement, 68(5), 742–759.
Article Google Scholar
Fischer, G. H. (1974). Einfuhrung in die theorie psychologischer tests. Bern: Verlag Hans Huber. (Introduction to the theory of psychological tests.).
Google Scholar
Fischer, G. H. (2007). Rasch models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 515–585). Amsterdam, The Netherlands: Elsevier.
Chapter Google Scholar
Gabrielsen, A. (1978). Consistency and identifiability. Journal of Econometrics, 8, 261–263.
Article Google Scholar
Gierl, M., Gotzmann, A., & Boughton, K. A. (2004). Performance of SIBTEST when the percentage of DIF items is large. Applied Measurement in Education, 17, 241–264.
Article Google Scholar
Glas, C. A. W. (1989). Contributions to estimating and testing rasch models. Unpublished doctoral dissertation, Cito, Arnhem.
Glas, C. A. W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8, 647–667.
Google Scholar
Glas, C. A., & Verhelst, N. D. (1995a). Testing the Rasch model. In G. H. Fischer & I. W. Mole- naar (Eds.), Rasch models: Foundations, recent developments and applications (pp. 69–95). New-York: Spinger.
Chapter Google Scholar
Glas, C. A. W., & Verhelst, N. D. (1995b). Tests of fit for polytomous Rasch models. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments and applications (pp. 325–352). New-York: Spinger.
Chapter Google Scholar
Gnanadesikan, R., & Kettenring, J. R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 28, 81–124.
Article Google Scholar
Hessen, D. J. (2005). Constant latent odds-ratios models and the Mantel-Haenszel null hypothesis. Psychometrika, 70, 497–516.
Article Google Scholar
Holland, P., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdal, NJ: Lawrence Erlbaum Associates.
Google Scholar
Holm, S. (1979). A simple sequentially rejective multiple test procedur. Scandinavian Journal of Statistics, 6, 65–70.
Google Scholar
Jodoin, G. M., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.
Article Google Scholar
Lord, F. M. (1977). A study of item bias, using item characteristic curve theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology (pp. 19–29). Hillsdale, NJ: Lawrence Earl- baum Associates.
Google Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Google Scholar
Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichtomous differential item functioning. Behavior Research Methods, 42, 847–862.
Article PubMed Google Scholar
Magis, D., & Boeck, P. De. (2012). A robust outlier approach to prevent type i error inflation in differential item functioning. Educational and Psychological Measurement, 72, 291–311.
Article Google Scholar
Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: A coun- terexample with Angoff’s Delta plot. Journal of Educational and Psychological Measurement, 73, 293–311.
Article Google Scholar
Mahalanobis, P. C. (1936). On the generalised distance in statistics. In Proceedings of the National Institute of Sciences of India (Vol. 2, pp. 49–55).
Maris, G., & Bechger, T. M. (2007). Scoring open ended questions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 663–680). Amsterdam: Elsevier.
Chapter Google Scholar
Maris, G., & Bechger, T. M. (2009). On interpreting the model parameters for the three parameter logistic model. Measurement, 7, 75–88.
Google Scholar
Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–175.
Article Google Scholar
Meredith, W., & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychomettrika, 57, 289–311.
Article Google Scholar
Penfield, R. D., & Camelli, G. (2007). Differential item functioning and bias. In C. R. Rao & S. Sinharay (Eds.), Psychometrics (Vol. 26). Amsterdam: Elsevier.
Google Scholar
Ponocny, I. (2001). Nonparametric goodness-of-fit tests for the Rasch model. Psychometrika, 66, 437–460.
Article Google Scholar
Raju, N. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.
Article Google Scholar
Raju, N. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Pschological Measurement, 14, 197–207.
Article Google Scholar
Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New-York: Wiley.
Book Google Scholar
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute of Educational Research. (Expanded edition, 1980. Chicago, The University of Chicago Press).
Google Scholar
R Development Core Team. (2010). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/ (ISBN 3-900051-07-0)
San Martin, E., González, J., & Tuerlinckx, F. (2009). Identified parameters, parameters of interest and their relationships. Measurement, 7, 97–105.
Google Scholar
San Martin, E., & Quintana, F. (2002). Consistency and identifiability revisited. Brazilian Journal of Probability and Statistics, 16, 99–106.
Google Scholar
San Martin, E., & Roulin, J. M. (2013). Identifiability of parametric Rasch-type models. Journal of statistical planning and inference, 143, 116–130.
Article Google Scholar
San Martín, E., González, J., & Tuerlinckx, F. (2014). On the unidentifiability of the fixed-effects 3PL model. Psychometrika, 78(2), 341–379.
Shealy, R., & Stout, W. F. (1993). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DIF as well as item bias/DIF. PSychometrika, 58, 159–194.
Article Google Scholar
Soares, T. M., Concalves, F. B., & Gamerman, D. (2009). An integrated bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34(3), 348–377.
Article Google Scholar
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with CFA and IRT: Towards a unified strategy. Journal of Applied Psychology, 19, 1292–1306.
Article Google Scholar
Stout, W. (2002). Psychometrics: From practice to theory and back. Psychometrika, 67, 485–518.
Article Google Scholar
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.
Article Google Scholar
Tay, L., Newman, D. A., & Vermunt, J. K. (2011). Using mixed-measurement item response theory with covariates (mm-irt-c) to ascertain observed and unobserved measurement equivalence. Organizational Research Method, 14, 147–155.
Article Google Scholar
Teresi, J. A. (2006). Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care, 44, 152–170.
Article Google Scholar
Thissen, D. (2001). IRTLRDIF v2.0b: Software for the comutation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Documentation for computer program [Computer software manual]. Chapel Hill.
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group difference in trace lines. In H. Wainer & H. Braun (Eds.), Validity. Hillsdale, NJ: Lawrence Erlbaum Associates.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–114). Hillsdale, NJ: Lawrence Earlbaum.
Google Scholar
Thurman, C. J. (2009). A monte carlo study investigating the in uence of item discrimination, category intersection parameters, and differential item functioning in polytomous items. Unpublished doctoral dissertation, Georgia State University.
Van der Flier, H., Mellenbergh, G. J., Adèr, H. J., & Wijn, M. (1984). An iterative bias detection method. Journal of Educational Measurement, 21, 131–145.
Article Google Scholar
Verhelst, N. D. (1993). On the standard errors of parameter estimators in the Rasch model (Mea- surement and Research Department Reports No. 93–1). Arnhem: Cito.
Verhelst, N. D. (2008). An effcient MCMC-algorithm to sample binary matrices with fixed marginals. Psychometrika, 73, 705–728.
Article Google Scholar
Verhelst, N. D., & Eggen, T. J. H. M. (1989). Psychometrische en statistische aspecten van peiling- sonderzoek (PPON-rapport No. 4). Arnhem: CITO.
Verhelst, N. D., Glas, C. A. W., & van der Sluis, A. (1984). Estimation problems in the Rasch model: The basic symmetric functions. Computational Statistics Quarterly, 1(3), 245–262.
Google Scholar
Verhelst, N. D., Glas, C. A. W., & Verstralen, H. H. F. M. (1994). OPLM: Computer program and manual [Computer software manual]. Arnhem.
Verhelst, N. D., Hartzinger, R., & Mair, P. (2007). The Rasch sampler. Journal of Statistical Software, 20, 1–14.
Google Scholar
Verhelst, N. D., & Verstralen, H. H. F. M. (2008). Some considerations on the partial credit model. Psicológica, 29, 229–254.
Google Scholar
von Davier, M., & von Davier, A. A. (2007). A uniffed approach to IRT scale linking and scale transformations. Methodology, 3(3), 1614–1881.
Google Scholar
Wald, A. (1943). tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426–482.
Article Google Scholar
Wang, W.-C. (2004). Effects of anchor item methods on the detection of differential item functioning within the family of rasch models. The Journal of Experimental Education, 72(3), 221–261.
Article Google Scholar
Wang, W.-C., Shih, C.-L., & Sun, G.-W. (2012). The DIF-free-then-DIF strategy for the assessment of differential item functioning. Educational and Psychological Measurement, 74, 1–22.
Google Scholar
Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning whith the likelihood ratio test. Applied Psychological Measurement, 27(6), 479–498.
Article Google Scholar
Williams, V. S. L. (1997). The ”unbiased” anchor: Bridging the gap between DIF and item bias. Applied Measurement in Education, 10(3), 253–267.
Article Google Scholar
Wilson, M., & Adams, R. (1995). Rasch models for item bundles. Psychometrika, 60, 181–198.
Article Google Scholar
Wright, B. D., Mead, R., & Draba, R. (1976). Detecting and correcting item bias with a logistic response model. (Tech. Rep.). Department of Education, Statistical Laboratory: University of Chicago.
Zwinderman, A. H. (1995). Pairwise parameter estimation in Rasch models. Applied Psychological Measurement, 19, 369–375.
Article Google Scholar
Zwitser, R., & Maris, G. (2013). Conditional statistical inference with multistage testing designs. Psychometrika. doi:10.1007/s11336-013-9369-6.

Download references

Acknowledgments

The authors wish to thank Norman Verhelst, Cees Glas, Paul De Boeck, Don Mellenbergh, and our former student Maarten Marsman for helpful comments and discussions.

Author information

Authors and Affiliations

Cito, Amsterdamseweg 13, Arnhem, The Netherlands
Timo M. Bechger & Gunter Maris
University of Amsterdam, Amsterdam, The Netherlands
Gunter Maris

Authors

Timo M. Bechger
View author publications
You can also search for this author in PubMed Google Scholar
Gunter Maris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timo M. Bechger.

Appendix

Here is some R-code that can be used after estimates and their asymptotic variance–covariance matrix are available. It assumes two groups, nI items, and a normalization of the form \(\delta _{r,1}=\delta _{r,2}=0\).

par1 and par2 are vectors with parameter estimates: the \(r\) entry equal to zero. The variance covariance matrices are called cov1 and cov2. Their \(r\)th row and column are zero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bechger, T.M., Maris, G. A Statistical Test for Differential Item Pair Functioning. Psychometrika 80, 317–340 (2015). https://doi.org/10.1007/s11336-014-9408-y

Download citation

Received: 03 September 2012
Accepted: 20 February 2014
Published: 16 September 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s11336-014-9408-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Statistical Test for Differential Item Pair Functioning

Abstract

Access this article

Similar content being viewed by others

Detection of Differential Item Functioning via the Credible Intervals and Odds Ratios Methods

An R toolbox for score-based measurement invariance tests in IRT models

A Comparison of Differential Item Functioning (DIF) Detection for Dichotomously Scored Items Using IRTPRO, BILOG-MG, and IRTLRDIF

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Statistical Test for Differential Item Pair Functioning

Abstract

Access this article

Similar content being viewed by others

Detection of Differential Item Functioning via the Credible Intervals and Odds Ratios Methods

An R toolbox for score-based measurement invariance tests in IRT models

A Comparison of Differential Item Functioning (DIF) Detection for Dichotomously Scored Items Using IRTPRO, BILOG-MG, and IRTLRDIF

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation