Skip to main content

Using MFRM and SEM in the Validation of Analytic Rating Scales of an English Speaking Assessment

  • Conference paper
  • First Online:
Pacific Rim Objective Measurement Symposium (PROMS) 2015 Conference Proceedings

Abstract

This study reports a preliminary investigation into the construct validity of an analytic rating scale developed for a school-based English speaking test. Informed by the theory of interpretative validity argument, this study examined the plausibility and accuracy of three warrants which were deemed essential to the construct validity of the rating scale. Methodologically, this study utilized Many-Facets Rasch Model (MFRM) and Structural Equation Modeling (SEM) in conjunction to examine the three warrants and their respective rebuttals. Though MFRM analysis largely supported the first two warrants, the results indicated that the category structure of the rating scale did not function as intended, and hence needed further revisions. In SEM analysis, multitrait multimethod (MTMM) confirmatory factor analysis (CFA) model was employed, whereby four MTMM models were specified, evaluated, and compared. The results lent support to the third warrant, but raised legitimate concerns over common method bias. The study has implications for the future revisions of the rating scale and the speaking assessment in the interest of improved validity. Meanwhile, this study has methodological implications for performance assessment constructors and rating scale validators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    CFI: Comparative Fit Index; GFI: Goodness of Fit Index; SRMR: Standardized Root Mean Residual; RMSEA: Root Mean Square Error of Approximation.

  2. 2.

    The numbers in brackets are indicative of acceptable goodness of fit between the model and the empirical data.

  3. 3.

    Typical annual undergraduate enrollment at FDU is around 3000.

References

  • Adams, R. J., Wilson, M. R., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–24.

    Article  Google Scholar 

  • Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12(2), 238–257.

    Article  Google Scholar 

  • Bachman, L. F., & Palmer, A. S. (1996). Language assessment in practice: Designing and developing useful language tests. Oxford: Oxford University Press.

    Google Scholar 

  • Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press.

    Google Scholar 

  • Batty, A. O. (2015). A comparison of video-and audio-mediated listening tests with Many-Facets Rasch modeling and differential distractor functioning. Language Testing, 32(1), 3–20.

    Article  Google Scholar 

  • Bentler, P. M., & Wu, E. J. (2005). EQS 6.1 for Windows. Encino, CA: Multivariate Software.

    Google Scholar 

  • Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences: New York: Routledge.

    Google Scholar 

  • Byrne, B. M. (2006). Structural equation modeling with EQS: Basic concepts, applications, and programming (2nd ed.). Mahwah, New Jersey: Psychology Press.

    Google Scholar 

  • Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.

    Article  PubMed  Google Scholar 

  • Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Building a validity argument for the Test of English as a Foreign Language. New York and London: Routledge, Taylor & Francis Group.

    Google Scholar 

  • Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233–255.

    Article  Google Scholar 

  • Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facets Rasch analysis. Language Assessment Quarterly: An International Journal, 2(3), 197–221.

    Article  Google Scholar 

  • Eckes, T. (2011). Introduction to many-facets Rasch measurement. Frankfurt: Peter Lang.

    Book  Google Scholar 

  • Fan, J. (2014). Chinese test takers’ attitudes towards the Versant English Test: A mixed-methods approach. Language Testing in Asia, 4(6), 1–17.

    Google Scholar 

  • Fan, J., & Ji, P. (2013). Exploring the validity of the Fudan English Test (FET): Test data analysis. Foreign Language Testing and Teaching, 3(2), 45–53.

    Google Scholar 

  • Fan, J., & Ji, P. (2014). Test candidates’ attitudes and their test performance: The case of the Fudan English Test. University of Sydney Papers in TESOL, 9, 1–35.

    Google Scholar 

  • Fan, J., Ji, P., & Song, X. (2014a). Washback of university-based English language tests on students’ learning: A case study. The Asian Journal of Applied Linguistics, 1(2), 178–192.

    Google Scholar 

  • Fan, J., Ji, P., & Yu, L. (2014b). Another perspective on language test validation: The factor structure of language tests. Theory and Practice in Foreign Language Teaching, 4, 34–40.

    Google Scholar 

  • FDU Testing Team. (2014). The FET Test Syllabus. Shanghai: Fudan University Press.

    Google Scholar 

  • Fulcher, G. (1996). Does thick description lead to smart tests? A data-based approach to rating scale construction. Language Testing, 13(2), 208–238.

    Article  Google Scholar 

  • Gu, L. (2014). At the interface between language testing and second language acquisition: Language ability and context of learning. Language Testing, 31(1), 111–133.

    Article  Google Scholar 

  • Han, B., Dan, M., & Yang, L. (2004). Problems with College English Test as emerged from a survey. Foreign Languages and Their Teaching, 179(2), 17–23.

    Google Scholar 

  • In’nami, Y., & Koizumi, R. (2012). Factor structure of the revised TOEFL test: A multi-sample analysis. Language Testing, 29(1), 131–152.

    Google Scholar 

  • In’nami, Y., & Koizumi, R. (2011). Structural equation modeling in language testing and learning research: A review. Language Assessment Quarterly, 8(3), 250–276.

    Google Scholar 

  • Kane, M. T. (2012). Validating score interpretations and uses. Language Testing, 29(1), 3–17.

    Article  Google Scholar 

  • Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New York: Guilford Press.

    Google Scholar 

  • Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from? Assessing Writing, 16(2), 81–96.

    Article  Google Scholar 

  • Kondo-Brown, K. (2002). A FACET analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19, 3–31.

    Article  Google Scholar 

  • Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural modeling approach Cambridge: Cambridge University Press.

    Google Scholar 

  • Kunnan, A. J. (1998). An introduction to structural equation modeling for language assessment research. Language Testing, 15(3), 295–332.

    Google Scholar 

  • Linacre, M. (2013). A user’s guide to FACETS (3.71.0). Chicago: MESA Press.

    Google Scholar 

  • Linacre, M. (2004). Optimal rating scale category effectiveness. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 258–278). Maple Grove, MN: JAM Press.

    Google Scholar 

  • Llosa, L. (2007). Validating a standards-based classroom assessment of English proficiency: A multitrait-multimethod approach. Language Testing, 24(4), 489–515.

    Article  Google Scholar 

  • Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276.

    Article  Google Scholar 

  • Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facets Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15(2), 158–180.

    Google Scholar 

  • McNamara, T. (1996). Measuring second language proficiency. London: Longman.

    Google Scholar 

  • McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 553–574.

    Google Scholar 

  • North, B. (2000). The development of common framework scale of language proficiency. New York: Peter Lang.

    Book  Google Scholar 

  • North, B., & Jones, N. (2009). Further material on maintaining standards across languages, contexts and administrations by exploiting teacher judgment and IRT scaling. Strasbourg: Language Policy Division.

    Google Scholar 

  • Ockey, G. J., & Choi, I. (2015). Structural equation modeling reporting practices for language assessment. Language Assessment Quarterly, 12(3), 305–319.

    Google Scholar 

  • Oon, P. T., & Subramaniam, R. (2011). Rasch modelling of a scale that explores the take-up of Physics among school students from the perspective of teachers. In R. F. Cavanaugh & R. F. Waugh (Eds.), Applications of Rasch measurement in learning environments research (pp. 119–139). Netherlands: Sense Publishers.

    Chapter  Google Scholar 

  • Purpura, J. E. (1999). Learner strategy use and performance on language tests: A structural equation modeling approach. Cambridge: Cambridge University Press.

    Google Scholar 

  • Sasaki, M., & Hirose, K. (1999). Development of an analytic rating scale for Japanese L1 writing. Language Testing, 16(4), 457–478.

    Article  Google Scholar 

  • Sato, T. (2012). The contribution of test-takers’ speech content to scores on an English oral proficiency test. Language Testing, 29(2), 223–241.

    Article  Google Scholar 

  • Sawaki, Y. (2007). Construct validation of analytic rating scale in speaking assessment: Reporting a score profile and a composite. Language Testing, 24(3), 355–390.

    Article  Google Scholar 

  • Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL Internet-based test. Language Testing, 26(1), 5–30.

    Article  Google Scholar 

  • Shin, S.-Y., & Ewert, D. (2015). What accounts for integrated reading-to-write task scores? Language Testing, 32(2), 259–281.

    Article  Google Scholar 

  • Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11(2), 99–123.

    Article  Google Scholar 

  • TOPE Project Team. (2013). Syllabus for Test of Oral Proficiency in English (TOPE). Beijing: China Renming University Press.

    Google Scholar 

  • Tsinghua University Testing Team. (2012). Syllabus for Tsinghua English Proficiency Test (TEPT). Beijing: Tsinghua University Press.

    Google Scholar 

  • Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49(1), 3–12.

    Article  Google Scholar 

  • Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: test method and learner discourse. Language Testing, 16(1), 82–111.

    Google Scholar 

  • Xie, Q., & Andrews, S. (2012). Do test design and uses influence test preparation? Testing a model of washback with Structural Equation Modeling. Language Testing, 30(1), 49–70.

    Article  Google Scholar 

Download references

Acknowledgments

The study reported in this chapter was supported by the National Social Sciences Fund of the People’s Republic of China under the project title of “Development and Validation of Standards in Language Testing” (Grant No: 13CYY032), and the Research Project of National Foreign Language Teaching in Higher Education under the project title of “Teacher-, Peer-, and Self-assessment in Translation Teaching: A Many-Facets Rasch Modeling Approach” (Grant No: 2014SH0008A). Part of this research was published in the third issue of Foreign Language Education in China (Quarterly) in 2015.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinsong Fan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media Singapore

About this paper

Cite this paper

Fan, J., Bond, T. (2016). Using MFRM and SEM in the Validation of Analytic Rating Scales of an English Speaking Assessment. In: Zhang, Q. (eds) Pacific Rim Objective Measurement Symposium (PROMS) 2015 Conference Proceedings. Springer, Singapore. https://doi.org/10.1007/978-981-10-1687-5_3

Download citation

Publish with us

Policies and ethics