Using MFRM and SEM in the Validation of Analytic Rating Scales of an English Speaking Assessment

Fan, Jinsong; Bond, Trevor

doi:10.1007/978-981-10-1687-5_3

Jinsong Fan^2,3 &
Trevor Bond⁴

596 Accesses
3 Citations

Abstract

This study reports a preliminary investigation into the construct validity of an analytic rating scale developed for a school-based English speaking test. Informed by the theory of interpretative validity argument, this study examined the plausibility and accuracy of three warrants which were deemed essential to the construct validity of the rating scale. Methodologically, this study utilized Many-Facets Rasch Model (MFRM) and Structural Equation Modeling (SEM) in conjunction to examine the three warrants and their respective rebuttals. Though MFRM analysis largely supported the first two warrants, the results indicated that the category structure of the rating scale did not function as intended, and hence needed further revisions. In SEM analysis, multitrait multimethod (MTMM) confirmatory factor analysis (CFA) model was employed, whereby four MTMM models were specified, evaluated, and compared. The results lent support to the third warrant, but raised legitimate concerns over common method bias. The study has implications for the future revisions of the rating scale and the speaking assessment in the interest of improved validity. Meanwhile, this study has methodological implications for performance assessment constructors and rating scale validators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
CFI: Comparative Fit Index; GFI: Goodness of Fit Index; SRMR: Standardized Root Mean Residual; RMSEA: Root Mean Square Error of Approximation.
2.
The numbers in brackets are indicative of acceptable goodness of fit between the model and the empirical data.
3.
Typical annual undergraduate enrollment at FDU is around 3000.

References

Adams, R. J., Wilson, M. R., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–24.
Article Google Scholar
Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12(2), 238–257.
Article Google Scholar
Bachman, L. F., & Palmer, A. S. (1996). Language assessment in practice: Designing and developing useful language tests. Oxford: Oxford University Press.
Google Scholar
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press.
Google Scholar
Batty, A. O. (2015). A comparison of video-and audio-mediated listening tests with Many-Facets Rasch modeling and differential distractor functioning. Language Testing, 32(1), 3–20.
Article Google Scholar
Bentler, P. M., & Wu, E. J. (2005). EQS 6.1 for Windows. Encino, CA: Multivariate Software.
Google Scholar
Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences: New York: Routledge.
Google Scholar
Byrne, B. M. (2006). Structural equation modeling with EQS: Basic concepts, applications, and programming (2nd ed.). Mahwah, New Jersey: Psychology Press.
Google Scholar
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.
Article PubMed Google Scholar
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Building a validity argument for the Test of English as a Foreign Language. New York and London: Routledge, Taylor & Francis Group.
Google Scholar
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233–255.
Article Google Scholar
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facets Rasch analysis. Language Assessment Quarterly: An International Journal, 2(3), 197–221.
Article Google Scholar
Eckes, T. (2011). Introduction to many-facets Rasch measurement. Frankfurt: Peter Lang.
Book Google Scholar
Fan, J. (2014). Chinese test takers’ attitudes towards the Versant English Test: A mixed-methods approach. Language Testing in Asia, 4(6), 1–17.
Google Scholar
Fan, J., & Ji, P. (2013). Exploring the validity of the Fudan English Test (FET): Test data analysis. Foreign Language Testing and Teaching, 3(2), 45–53.
Google Scholar
Fan, J., & Ji, P. (2014). Test candidates’ attitudes and their test performance: The case of the Fudan English Test. University of Sydney Papers in TESOL, 9, 1–35.
Google Scholar
Fan, J., Ji, P., & Song, X. (2014a). Washback of university-based English language tests on students’ learning: A case study. The Asian Journal of Applied Linguistics, 1(2), 178–192.
Google Scholar
Fan, J., Ji, P., & Yu, L. (2014b). Another perspective on language test validation: The factor structure of language tests. Theory and Practice in Foreign Language Teaching, 4, 34–40.
Google Scholar
FDU Testing Team. (2014). The FET Test Syllabus. Shanghai: Fudan University Press.
Google Scholar
Fulcher, G. (1996). Does thick description lead to smart tests? A data-based approach to rating scale construction. Language Testing, 13(2), 208–238.
Article Google Scholar
Gu, L. (2014). At the interface between language testing and second language acquisition: Language ability and context of learning. Language Testing, 31(1), 111–133.
Article Google Scholar
Han, B., Dan, M., & Yang, L. (2004). Problems with College English Test as emerged from a survey. Foreign Languages and Their Teaching, 179(2), 17–23.
Google Scholar
In’nami, Y., & Koizumi, R. (2012). Factor structure of the revised TOEFL test: A multi-sample analysis. Language Testing, 29(1), 131–152.
Google Scholar
In’nami, Y., & Koizumi, R. (2011). Structural equation modeling in language testing and learning research: A review. Language Assessment Quarterly, 8(3), 250–276.
Google Scholar
Kane, M. T. (2012). Validating score interpretations and uses. Language Testing, 29(1), 3–17.
Article Google Scholar
Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New York: Guilford Press.
Google Scholar
Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from? Assessing Writing, 16(2), 81–96.
Article Google Scholar
Kondo-Brown, K. (2002). A FACET analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19, 3–31.
Article Google Scholar
Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural modeling approach Cambridge: Cambridge University Press.
Google Scholar
Kunnan, A. J. (1998). An introduction to structural equation modeling for language assessment research. Language Testing, 15(3), 295–332.
Google Scholar
Linacre, M. (2013). A user’s guide to FACETS (3.71.0). Chicago: MESA Press.
Google Scholar
Linacre, M. (2004). Optimal rating scale category effectiveness. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 258–278). Maple Grove, MN: JAM Press.
Google Scholar
Llosa, L. (2007). Validating a standards-based classroom assessment of English proficiency: A multitrait-multimethod approach. Language Testing, 24(4), 489–515.
Article Google Scholar
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276.
Article Google Scholar
Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.
Book Google Scholar
Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facets Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15(2), 158–180.
Google Scholar
McNamara, T. (1996). Measuring second language proficiency. London: Longman.
Google Scholar
McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 553–574.
Google Scholar
North, B. (2000). The development of common framework scale of language proficiency. New York: Peter Lang.
Book Google Scholar
North, B., & Jones, N. (2009). Further material on maintaining standards across languages, contexts and administrations by exploiting teacher judgment and IRT scaling. Strasbourg: Language Policy Division.
Google Scholar
Ockey, G. J., & Choi, I. (2015). Structural equation modeling reporting practices for language assessment. Language Assessment Quarterly, 12(3), 305–319.
Google Scholar
Oon, P. T., & Subramaniam, R. (2011). Rasch modelling of a scale that explores the take-up of Physics among school students from the perspective of teachers. In R. F. Cavanaugh & R. F. Waugh (Eds.), Applications of Rasch measurement in learning environments research (pp. 119–139). Netherlands: Sense Publishers.
Chapter Google Scholar
Purpura, J. E. (1999). Learner strategy use and performance on language tests: A structural equation modeling approach. Cambridge: Cambridge University Press.
Google Scholar
Sasaki, M., & Hirose, K. (1999). Development of an analytic rating scale for Japanese L1 writing. Language Testing, 16(4), 457–478.
Article Google Scholar
Sato, T. (2012). The contribution of test-takers’ speech content to scores on an English oral proficiency test. Language Testing, 29(2), 223–241.
Article Google Scholar
Sawaki, Y. (2007). Construct validation of analytic rating scale in speaking assessment: Reporting a score profile and a composite. Language Testing, 24(3), 355–390.
Article Google Scholar
Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL Internet-based test. Language Testing, 26(1), 5–30.
Article Google Scholar
Shin, S.-Y., & Ewert, D. (2015). What accounts for integrated reading-to-write task scores? Language Testing, 32(2), 259–281.
Article Google Scholar
Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11(2), 99–123.
Article Google Scholar
TOPE Project Team. (2013). Syllabus for Test of Oral Proficiency in English (TOPE). Beijing: China Renming University Press.
Google Scholar
Tsinghua University Testing Team. (2012). Syllabus for Tsinghua English Proficiency Test (TEPT). Beijing: Tsinghua University Press.
Google Scholar
Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49(1), 3–12.
Article Google Scholar
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: test method and learner discourse. Language Testing, 16(1), 82–111.
Google Scholar
Xie, Q., & Andrews, S. (2012). Do test design and uses influence test preparation? Testing a model of washback with Structural Equation Modeling. Language Testing, 30(1), 49–70.
Article Google Scholar

Download references

Acknowledgments

The study reported in this chapter was supported by the National Social Sciences Fund of the People’s Republic of China under the project title of “Development and Validation of Standards in Language Testing” (Grant No: 13CYY032), and the Research Project of National Foreign Language Teaching in Higher Education under the project title of “Teacher-, Peer-, and Self-assessment in Translation Teaching: A Many-Facets Rasch Modeling Approach” (Grant No: 2014SH0008A). Part of this research was published in the third issue of Foreign Language Education in China (Quarterly) in 2015.

Author information

Authors and Affiliations

Fudan University, Shanghai, People’s Republic of China
Jinsong Fan
The University of Melbourne, Melbourne, Australia
Jinsong Fan
James Cook University, Townsville, Australia
Trevor Bond

Authors

Jinsong Fan
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Bond
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinsong Fan .

Editor information

Editors and Affiliations

College of Foreign Studies, Jiaxing University College of Foreign Studies, Jiaxing, Zhejiang, China
Quan Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, J., Bond, T. (2016). Using MFRM and SEM in the Validation of Analytic Rating Scales of an English Speaking Assessment. In: Zhang, Q. (eds) Pacific Rim Objective Measurement Symposium (PROMS) 2015 Conference Proceedings. Springer, Singapore. https://doi.org/10.1007/978-981-10-1687-5_3

Download citation

DOI: https://doi.org/10.1007/978-981-10-1687-5_3
Published: 07 August 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-1686-8
Online ISBN: 978-981-10-1687-5
eBook Packages: Behavioral Science and PsychologyBehavioral Science and Psychology (R0)

Publish with us

Policies and ethics