Quantitative Data Analysis For Language Assessment Volume I Fundamental Techniques, 1st Edition
Quantitative Data Analysis For Language Assessment Volume I Fundamental Techniques, 1st Edition
net/publication/330982032
CITATIONS READS
37 8,423
18 authors, including:
All content following this page was uploaded by Vahid Aryadoust on 16 April 2019.
Introduction 1
VAH I D AR YAD O U S T A ND MICH EL L E RA Q U EL
SECTION I
Test development, reliability, and generalizability 13
SECTION II
Unidimensional Rasch measurement 81
SECTION III
Univariate and multivariate statistical analysis 177
Index 262
Figures
The two-volume books, Quantitative Data Analysis for Language Assessment (Fun-
damental Techniques and Advanced Methods), together with the Companion web-
site, were motivated by the growing need for a comprehensive sourcebook of
quantitative data analysis for the community of language assessment. As the focus on
developing valid and useful assessments continues to intensify in different parts of the
world, having a robust and sound knowledge of quantitative methods has become
an increasingly essential requirement. This is particularly important given that one
of the community’s responsibilities is to develop language assessments that have evi-
dence of validity, fairness, and reliability. We believe this would be achieved primarily
by leveraging quantitative data analysis in test development and validation efforts.
It has been the contributors’ intention to write the chapters with an eye toward
what professors, graduate students, and test-development companies would need.
The chapters progress gradually from fundamental concepts to advanced topics,
making the volumes suitable reference books for professors who teach quantitative
methods. If the content of the volumes is too heavy for teaching in one course,
we would suggest professors consider using them across two semesters, or alterna-
tively choose any chapters that fit the focus and scope of their courses. For gradu-
ate students who have just embarked on their studies or are writing dissertations
or theses, the two volumes would serve as a cogent and accessible introduction
to the methods that are often used in assessment development and validation
research. For organizations in the test-development business, the volumes provide
a unique topic coverage and examples of applications of the methods in small- and
large-scale language tests that such organizations often deal with.
We would like to thank all of the authors who contributed their expertise in
language assessment and quantitative methods. This collaboration has allowed us
to emphasize the growing interdisciplinarity in language assessment that draws
knowledge and information from many different fields. We wish to acknowledge
that in addition to editorial reviews, each chapter has been subjected to rigorous
double-blind peer review. We extend a special note of thanks to a number of col-
leagues who helped us during the review process:
We hope that the readers will find the volumes useful in their research and pedagogy.
Vahid Aryadoust and Michelle Raquel
Editors
April 2019
Editor and contributor
biographies
Conclusion
In sum, Volumes I and II present 23 fundamental and advanced quantitative
methods and their applications in language testing research. An important fac-
tor to consider in choosing these fundamental and advanced methods is the
role of theory and the nature of research questions. Although some may be
drawn to use advanced methods, as they might provide stronger evidence to
support validity and reliability claims, in some cases, using less complex methods
might cater to the needs of researchers. Nevertheless, oversimplifying research
problems could result in overlooking significant sources of variation in data
and drawing possibly wrong or naïve inferences. The authors of the chapters
have, therefore, emphasized that the first step to choosing the methods is the
postulation of theoretical frameworks to specify the nature of relationships
among variables, processes, and mechanisms of the attributes under investiga-
tion. Only after establishing the theoretical framework should one proceed to
select quantitative methods to test the hypotheses of the study. To this end, the
chapters in the volumes provide step-by-step guidelines to achieve accuracy and
12 Vahid Aryadoust and Michelle Raquel
precision in choosing and conducting the relevant quantitative techniques. We
are confident that the joint effort of the authors has emphasized the research
rigor required in the field and highlighted strengths and weaknesses of the data
analysis techniques.
References
Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psycho-
metrics. Cambridge: Cambridge University Press.
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity
argument for the test of English as a foreign language. New York, NY: Routledge.
Introduction
Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psycho-metrics.
Cambridge: Cambridge University Press.
Chapelle, C. A. , Enright, M. K. , & Jamieson, J. M. (Eds.). (2008). Building a validity argument
for the test of English as a foreign language. New York, NY: Routledge.
Application of the rating scale model and the partial credit model in
language assessment research
Adams, R. J. , Griffin, P. E. , & Martin, L . (1987). A latent trait method for measuring a
dimension in second language proficiency. Language Testing, 4(1), 927.
Agresti, A. (2012). Categorical data analysis (3rd ed.). Hoboken, NJ: John Wiley & Sons Inc.
Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 6981.
Andrich, D. (1978a). Application of a psychometric rating model to ordered categories which are
scored with successive integers. Applied Psychological Measurement, 2(4), 581594.
Andrich, D. (1978b). A rating formulation for ordered response categories. Psychometrika,
43(4), 561573.
Aryadoust, V. (2012). How does sentence structure and vocabulary function as a scoring
criterion alongside other criteria in writing assessment? International Journal of Language
Testing, 2(1), 2858.
Aryadoust, V. , Goh, C. C. M. , & Kim, L. O. (2012). Developing and validating an academic
listening questionnaire. Psychological Test and Assessment Modeling, 54(3), 227256.
Bond, T. , & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the
human sciences (3rd ed.). New York, NY: Routledge.
Cai, L. , & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item
factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245276.
Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the R
environment. Journal of Statistical Software, 48(6), 129.
du Toit, M. (Ed.). (2003). IRT from SSI: Bilog-MG, multilog, parscale, testfact. Lincolnwood, IL:
Scientific Software International, Inc.
Eckes, T. (2012). Examinee-centered standard setting for large-scale assessments: The
prototype group method. Psychological Test and Assessment Modeling, 54(3), 257283.
Eckes, T. (2017). Setting cut scores on an EFL placement test using the prototype group
method: A receiver operating characteristic (ROC) analysis. Language Testing, 34(3), 383411.
Fischer, G. H. , & Molenaar, I. W. (Eds.). (1995). Rasch models: Foundations, recent
developments, and applications. New York, NY: Springer Science & Business Media.
Fulcher, G. (1996). Testing tasks: Issues in task design and the group oral. Language Testing,
13(1), 2351.
151 Glas, C. A. W. (2016). Maximum-likelihood estimation. In W. van der Linden (Ed.),
Handbook of item response theory (Vol. 2, pp. 197216). Boca Raton, FL: CRC Press.
Haberman, S. J. (2006). Joint and conditional estimation for implicit models for tests with
polytomous item scores (ETS RR-06-03). Princeton, NJ: Educational Testing Service.
Haberman, S. J. (2016). Models with nuisance and incidental parameters. In W. van der Linden
(Ed.), Handbook of item response theory (Vol. 2, pp. 151170). Boca Raton, FL: CRC Press.
Hambleton, R. K. , & Han, N. (2005). Assessing the fit of IRT models to educational and
psychological test data: A five-step plan and several graphical displays. In R. R. Lenderking &
D. A. Revicki (Eds.), Advancing health outcomes research methods and clinical applications
(pp. 5777). McLean, VA: Degnon Associates.
Hambleton, R. K. , Swaminathan, H. , & Rogers, H. J. (1991). Fundamentals of item response
theory (Vol. 2). Newbury Park, CA: SAGE Publications.
Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied
Measurement, 1, 152176.
Kunnan, A. J. (1991). Modeling relationships among some test-taker characteristics and
performance on tests of English as a foreign language. Unpublished doctoral dissertation,
University of California, LA.
Lee, Y.-W. , Gentile, C. , & Kantor, R. (2008). Analytic scoring of TOEFL CBT essays: Scores
from humans and E-rater (ETS RR-08-81). Princeton, NJ: Educational Testing Service.
Lee-Ellis, S. (2009). The development and validation of a Korean C-test using Rasch analysis.
Language Testing, 26(2), 245274.
Linacre, J. M. (1994). Many-facet Rasch measurement (2nd ed.). Chicago, IL: MESA Press.
Linacre, J. M. (2017a). Winsteps Rasch measurement computer program [Computer software].
Beaverton, OR: Winsteps.com.
Linacre, J. M. (2017b). Winsteps Rasch measurement computer program users guide .
Beaverton, OR: Winsteps.com.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149174.
Maydeu-Olivares, A. , & Joe, H. (2005). Limited and full information estimation and goodness-
of-fit testing in 2n contingency tables: A unified framework. Journal of the American Statistical
Association, 100(471), 10091020.
McNamara, T. , & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement
in language testing. Language Testing, 29(4), 555576.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied
Psychological Measurement, 16(2), 159176.
Neyman, J. , & Scott, E. L. (1948). Consistent estimation from partially consistent observations.
Econometrica, 16, 132.
Pollitt, A. , & Hutchinson, C. (1987). Calibrating graded assessments: Rasch partial credit
analysis of performance in writing. Language Testing, 4(1), 7297.
R Core Team . (2018). R: A language and environment for statistical computing. Vienna,
Austria: R Foundation for Statistical Computing.
Rose, N. , von Davier, M. , & Xu, X. (2010). Modeling nonignorable missing data with item
response theory (ETS RR-10-11). Princeton, NJ: Educational Testing Service.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.
Psychometrika, Monograph Supplement No. 17.
152 Stewart, J. , Batty, A. O. , & Bovee, N. (2012). Comparing multidimensional and continuum
models of vocabulary acquisition: An empirical examination of the vocabulary knowledge scale.
TESOL Quarterly, 46(4), 695721.
Suzuki, Y. (2015). Self-assessment of Japanese as a second language: The role of experiences
in the naturalistic acquisition. Language Testing, 32(1), 6381.
Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language
Testing, 29(3), 325344.
Wells, C. S. , & Hambleton, R. K. (2016). Model fit with residual analysis. In W. van der Linden
(Ed.), Handbook of item response theory (Vol. 2, pp. 395413). Boca Raton, FL: CRC Press.
Wright, B. D. , & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch
Measurement Transactions, 8, 370.
Wright, B. D. , & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press.
Many-facet Rasch measurement
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43,
561573.
Andrich, D. (1998). Thresholds, steps and rating scale conceptualization. Rasch Measurement
Transactions, 12, 648649.
Aryadoust, V. (2016). Gender and academic major bias in peer assessment of oral
presentations. Language Assessment Quarterly, 13, 124.
Baker, B. A. (2012). Individual differences in rater decision-making style: An exploratory mixed-
methods study. Language Assessment Quarterly, 9, 225248.
Barkaoui, K. (2014). Multifaceted Rasch analysis for test evaluation. In A. J. Kunnan (Ed.), The
companion to language assessment: Evaluation, methodology, and interdisciplinary themes
(Vol. 3, pp. 13011322). Chichester: Wiley.
Bejar, I. I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues
and Practice, 31(3), 29.
173 Bonk, W. J. , & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language
group oral discussion task. Language Testing, 20, 89110.
Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement in
Education, 24, 121.
Carr, N. T. (2011). Designing and analyzing language tests. Oxford: Oxford University Press.
Casabianca, J. M. , Junker, B. W. , & Patz, R. J. (2016). Hierarchical rater models. In W. J. van
der Linden (Ed.), Handbook of item response theory (Vol. 1, pp. 449465). Boca Raton, FL:
Chapman & Hall/CRC.
Coniam, D. (2010). Validating onscreen marking in Hong Kong. Asia Pacific Education Review,
11, 423431.
Curcin, M. , & Sweiry, E. (2017). A facets analysis of analytic vs. holistic scoring of identical
short constructed-response items: Different outcomes and their implications for scoring rubric
development. Journal of Applied Measurement, 18, 228246.
DeCarlo, L. T. , Kim, Y. K. , & Johnson, M. S. (2011). A hierarchical rater model for constructed
responses, with a signal detection rater model. Journal of Educational Measurement, 48,
333356.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance
assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2, 197221.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to
rater variability. Language Testing, 25, 155185.
Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater
behavior. Language Assessment Quarterly, 9, 270292.
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating
rater-mediated assessments (2nd ed.). Frankfurt am Main: Peter Lang.
Eckes, T. (2017). Rater effects: Advances in item response modeling of human ratings Part I
(Guest Editorial). Psychological Test and Assessment Modeling, 59(4), 443452.
Elder, C. , Barkhuizen, G. , Knoch, U. , & von Randow, J. (2007). Evaluating rater responses to
an online training program for L2 writing assessment. Language Testing, 24, 3764.
Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral,
and health sciences. New York, NY: Routledge.
Engelhard, G. , Wang, J. , & Wind, S. A. (2018). A tale of two models: Psychometric and
cognitive perspectives on rater-mediated assessments using accuracy ratings. Psychological
Test and Assessment Modeling, 60(1), 3352.
Engelhard, G. , & Wind, S. A. (2018). Invariant measurement with raters and rating scales:
Rasch models for rater-mediated assessments. New York, NY: Routledge.
Hsieh, M. (2013). An application of multifaceted Rasch measurement in the Yes/No Angoff
standard setting procedure. Language Testing, 30, 491512.
Jeong, H. (2017). Narrative and expository genre effects on students, raters, and performance
criteria. Assessing Writing, 31, 113125.
Johnson, R. L. , Penny, J. A. , & Gordon, B. (2009). Assessing performance: Designing,
scoring, and validating performance tasks. New York, NY: Guilford.
Knoch, U. , & Chapelle, C. A. (2018). Validation of rating processes within an argument-based
framework. Language Testing, 35, 477499.
Knoch, U. , Read, J. , & von Randow, J. (2007). Re-training writing raters online: How does it
compare with face-to-face training? Assessing Writing, 12, 2643.
174 Lamprianou, I . (2006). The stability of marker characteristics across tests of the same
subject and across subjects. Journal of Applied Measurement, 7, 192205.
Lane, S. , & Iwatani, E. (2016). Design of performance assessments in education. In S. Lane ,
M. R. Raymond , & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp.
274293). New York, NY: Routledge.
Lee, Y.-W. , & Kantor, R. (2015). Investigating complex interaction effects among facet
elements in an ESL writing test consisting of integrated and independent tasks. Language
Research, 51(3), 653678.
Li, H. (2016). How do raters judge spoken vocabulary? English Language Teaching, 9(2),
102115.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press.
Linacre, J. M. (2000). Comparing Partial Credit Models (PCM) and Rating Scale Models (RSM).
Rasch Measurement Transactions, 14, 768.
Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch
Measurement Transactions, 16, 878.
Linacre, J. M. (2003). The hierarchical rater model from a Rasch perspective. Rasch
Measurement Transactions, 17, 928.
Linacre, J. M. (2006). Demarcating category intervals. Rasch Measurement Transactions, 19,
10411043.
Linacre, J. M. (2010). Transitional categories and usefully disordered thresholds. Online
Educational Research Journal, 1(3).
Linacre, J. M. (2018a). Facets Rasch measurement computer program (Version 3.81)
[Computer software]. Chicago, IL: Winsteps.com.
Linacre, J. M. (2018b). A users guide to FACETS: Rasch-model computer programs. Chicago,
IL: Winsteps.com. Retrieved from www.winsteps.com/facets.htm
Looney, M. A. (2012). Judging anomalies at the 2010 Olympics in mens figure skating.
Measurement in Physical Education and Exercise Science, 16, 5568.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149174.
McNamara, T. , & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement
in language testing. Language Testing, 29, 555576.
Mulqueen, C. , Baker, D. P. , & Dismukes, R. K. (2002). Pilot instructor rater training: The utility
of the multifacet item response theory model. International Journal of Aviation Psychology, 12,
287303.
Myford, C. M. , & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet
Rasch measurement: Part I. Journal of Applied Measurement, 4, 386422.
Myford, C. M. , & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet
Rasch measurement: Part II. Journal of Applied Measurement, 5, 189227.
Norris, J. , & Drackert, A. (2018). Test review: TestDaF. Language Testing, 35, 149157.
Randall, J. , & Engelhard, G. (2009). Examining teacher grades using Rasch measurement
theory. Journal of Educational Measurement, 46, 118.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, IL:
University of Chicago Press (Original work published 1960).
Reed, D. J. , & Cohen, A. D. (2001). Revisiting raters and ratings in oral language assessment.
In C. Elder (Eds.), Experimenting with uncertainty: Essays in honour of Alan Davies (pp. 8296).
Cambridge: Cambridge University Press.
175 Robitzsch, A. , & Steinfeld, J. (2018). Item response models for human ratings: Overview,
estimation methods and implementation in R. Psychological Test and Assessment Modeling,
60(1), 101138.
Schaefer, E. (2016). Identifying rater types among native English-speaking raters of English
essays written by Japanese university students. In V. Aryadoust & J. Fox (Eds.), Trends in
language assessment research and practice: The view from the Middle East and the Pacific
Rim (pp. 184207). Newcastle upon Tyne: Cambridge Scholars.
Springer, D. G. , & Bradley, K. D. (2018). Investigating adjudicator bias in concert band
evaluations: An application of the many-facets Rasch model. Musicae Scientiae, 22, 377393.
Till, H. , Myford, C. , & Dowell, J. (2013). Improving student selection using multiple mini-
interviews with multifaceted Rasch modeling. Academic Medicine, 88, 216223.
Wang, J. , Engelhard, G. , Raczynski, K. , Song, T. , & Wolfe, E. W. (2017). Evaluating rater
accuracy and perception for integrated writing assessments using a mixed-methods approach.
Assessing Writing, 33, 3647.
Wilson, M. (2011). Some notes on the term: Wright map. Rasch Measurement Transactions, 25,
1331.
Wind, S. A. , & Peterson, M. E. (2018). A systematic review of methods for evaluating rating
quality in language assessment. Language Testing, 35, 161192.
Wind, S. A. , & Schumacker, R. E. (2017). Detecting measurement disturbances in rater-
mediated assessments. Educational Measurement: Issues and Practice, 36(4), 4451.
Winke, P. , Gass, S. , & Myford, C. (2013). Raters L2 background as a potential source of bias
in rating oral performance. Language Testing, 30, 231252.
Wolfe, E. W. , & Song, T. (2015). Comparison of models and indices for detecting rater
centrality. Journal of Applied Measurement, 16(3), 228241.
Wright, B. D. , & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press.
Wright, B. D. , & Stone, M. H. (1979). Best test design. Chicago, IL: MESA Press.
Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test and
Assessment Modeling, 59(4), 453470.
Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A
mixed-methods approach. Language Testing, 31, 501527.
Yen, W. M. , & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.),
Educational measurement (4th ed., pp. 111153). Westport, CT: American Council on
Education/Praeger.
Zhang, J. (2016). Same text different processing? Exploring how raters cognitive and meta-
cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 3753.