0% found this document useful (0 votes)
16 views

Quantitative Data Analysis For Language Assessment Volume I Fundamental Techniques, 1st Edition

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Quantitative Data Analysis For Language Assessment Volume I Fundamental Techniques, 1st Edition

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/330982032

Quantitative Data Analysis for Language Assessment


Volume I Fundamental Techniques, 1st Edition

Book · February 2019

CITATIONS READS

37 8,423

18 authors, including:

Vahid Aryadoust Michelle Raquel


Nanyang Technological University The University of Hong Kong
138 PUBLICATIONS 1,677 CITATIONS 13 PUBLICATIONS 97 CITATIONS

SEE PROFILE SEE PROFILE

Yasuyo Sawaki Kirby C. Grabowski


Waseda University Teachers College
41 PUBLICATIONS 1,146 CITATIONS 7 PUBLICATIONS 128 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Vahid Aryadoust on 16 April 2019.

The user has requested enhancement of the downloaded file.


Quantitative Data Analysis for
Language Assessment Volume I

Quantitative Data Analysis for Language Assessment Volume I: Fundamental


Techniques is a resource book that presents the most fundamental techniques
of quantitative data analysis in the field of language assessment. Each chapter
provides an accessible explanation of the selected technique, a review of language
assessment studies that have used the technique, and finally, an example of an
authentic study that uses the technique. Readers also get a taste of how to apply
each technique through the help of supplementary online resources that include
sample datasets and guided instructions. Language assessment students, test
designers, and researchers should find this a unique reference, as it consolidates
theory and application of quantitative data analysis in language assessment.

Vahid Aryadoust is an Assistant Professor of language assessment literacy at the


National Institute of Education of Nanyang Technological University, Singapore.
He has led a number of language assessment research projects funded by, for
example, the Ministry of Education (Singapore), Michigan Language Assessment
(USA), Pearson Education (UK), and Paragon Testing Enterprises (Canada), and
published his research in, for example, Language Testing, Language Assessment
Quarterly, Assessing Writing, Educational Assessment, Educational Psychology,
and Computer Assisted Language Learning. He has also (co)authored a number
of book chapters and books that have been published by Routledge, Cambridge
University Press, Springer, Cambridge Scholar Publishing, Wiley Blackwell, etc.
He is a member of the Advisory Board of multiple international journals including
Language Testing, Language Assessment Quarterly, Educational Assessment,
Educational Psychology, and Asia Pacific Journal of Education. In addition, he
has been awarded the Intercontinental Academia Fellowship (2018–2019) which
is an advanced research program launched by the University-Based Institutes
for Advanced Studies. Vahid’s areas of interest include theory-building and
quantitative data analysis in language assessment, neuroimaging in language
comprehension, and eye-tracking research.

Michelle Raquel is a Senior Lecturer at the Centre of Applied English Studies,


University of Hong Kong, where she teaches language testing and assessment to
postgraduate students. She has extensive assessment development and management
experience in the Hong Kong education and government sector. In particular,
she has either led or been part of a group that designed and administered large-
scale computer-based language proficiency and diagnostic assessments such as the
Diagnostic English Language Tracking Assessment (DELTA). She specializes in
data analysis, specifically Rasch measurement, and has published several articles
in international journals on this topic as well as academic English, diagnostic
assessment, dynamic assessment of English second-language dramatic skills,
and English for specific purposes (ESP) testing. Michelle’s research areas are
classroom-based assessment, diagnostic assessment, and workplace assessment.
Routledge Research in Language Education

The Routledge Research in Language Education series provides a platform for


established and emerging scholars to present their latest research and discuss key
issues in Language Education. This series welcomes books on all areas of language
teaching and learning, including but not limited to language education policy
and politics, multilingualism, literacy, L1, L2 or foreign language acquisition,
curriculum, classroom practice, pedagogy, teaching materials, and language
teacher education and development. Books in the series are not limited to the
discussion of the teaching and learning of English only.

Books in the series include

Interdisciplinary Research Approaches to Multilingual Education


Edited by Vasilia Kourtis-Kazoullis, Themistoklis Aravossitas,
Eleni Skourtou and Peter Pericles Trifonas

From language skills to literacy


Broadening the scope of English language education through media literacy
Csilla Weninger

Addressing Difficult Situations in Foreign-Language Learning


Confusion, Impoliteness, and Hostility
Gerrard Mugford

Translanguaging in EFL Contexts


A Call for Change
Michael Rabbidge

Quantitative Data Analysis for Language Assessment Volume I


Fundamental Techniques
Edited by Vahid Aryadoust and Michelle Raquel

For more information about the series, please visit www.routledge.com/Routledge-


Research-in-Language-Education/book-series/RRLE
Quantitative Data Analysis
for Language Assessment
Volume I
Fundamental Techniques

Edited by Vahid Aryadoust


and Michelle Raquel
First published 2019
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
and by Routledge
52 Vanderbilt Avenue, New York, NY 10017
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2019 selection and editorial matter, Vahid Aryadoust and
Michelle Raquel; individual chapters, the contributors
The right of Vahid Aryadoust and Michelle Raquel to be identified
as the authors of the editorial material, and of the authors for their
individual chapters, has been asserted in accordance with sections 77
and 78 of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or
reproduced or utilised in any form or by any electronic, mechanical,
or other means, now known or hereafter invented, including
photocopying and recording, or in any information storage or
retrieval system, without permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks
or registered trademarks, and are used only for identification and
explanation without intent to infringe.
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book has been requested
ISBN: 978-1-138-73312-1 (hbk)
ISBN: 978-1-315-18781-5 (ebk)
Typeset in Galliard
by Apex CoVantage, LLC
Visit the eResources: www.routledge.com/9781138733121
Contents

List of figures vii


List of tables ix
Preface xi
Editor and contributor biographies xiii

Introduction 1
VAH I D AR YAD O U S T A ND MICH EL L E RA Q U EL

SECTION I
Test development, reliability, and generalizability 13

1 Item analysis in language assessment 15


RI TA G RE EN

2 Univariate generalizability theory in language assessment 30


YAS U Y O S AWA KI A ND XIA O MING XI

3 Multivariate generalizability theory in language assessment 54


KI RBY C. G RA BO WS KI A ND RO NGCH A N L IN

SECTION II
Unidimensional Rasch measurement 81

4 Applying Rasch measurement in language assessment:


unidimensionality and local independence 83
J AS O N FAN AN D T REVO R BO ND

5 The Rasch measurement approach to differential item


functioning (DIF) analysis in language assessment research 103
M I C H EL L E RA Q U EL
vi Contents
6 Application of the rating scale model and the partial credit
model in language assessment research 132
I KKY U C H O I

7 Many-facet Rasch measurement: implications for


rater-mediated language assessment 153
TH O M AS E CKES

SECTION III
Univariate and multivariate statistical analysis 177

8 Analysis of differences between groups: the t-test and


the analysis of variance (ANOVA) in language assessment 179
TU Ğ B A E L I F T O P RA K

9 Application of ANCOVA and MANCOVA in language


assessment research 198
ZH I LI AN D MICH EL L E Y. CH EN

10 Application of linear regression in language assessment 219


D AE R Y O N G S EO A ND H U S EIN TA HERBHA I

11 Application of exploratory factor analysis in language


assessment 243
L I M E I ZH ANG A ND WENS HU L U O

Index 262
Figures

1.1 Facility values and distracter analysis 21


1.2 Discrimination indices 22
1.3 Facility values, discrimination, and internal consistency
(reliability)23
1.4 Task statistics 23
1.5 Distracter problems 24
2.1 A one-facet crossed design example 36
2.2 A two-facet crossed design example 37
2.3 A two-facet partially nested design example 38
3.1 Observed-score variance as conceptualized through CTT 55
3.2 Observed-score variance as conceptualized through G theory 56
4.1 Wright map presenting item and person measures 93
4.2 Standardized residual first contrast plot 96
5.1 ICC of an item with uniform DIF 105
5.2 ICC of an item with non-uniform DIF 106
5.3 Standardized residual plot of 1st contrast 115
5.4 ETS DIF categorization of DIF items based on DIF size
and statistical significance 116
5.5 Sample ICC of item with uniform DIF (positive DIF contrast) 119
5.6 Sample ICC of item with uniform DIF (negative DIF contrast) 119
5.7 Macau high-ability students (M2) vs. HK high-ability
students (H2) sample ICCs of an item with NUDIF
(positive DIF contrast) 121
5.8 Macau high-ability students (M2) vs. HK high-ability
students (H2) sample ICCs of an item with NUDIF
(negative DIF contrast) 121
5.9 Plot diagram of person measures with vs. without DIF items 124
6.1 Illustration of the RSM assumption 136
6.2 Distributions of item responses 140
6.3 Estimated response probabilities for Items 1, 2, and 3 from
the RSM (dotted lines) and the PCM (solid lines) 143
viii Figures
6.4 Estimated standard errors for person parameters and test
information from the RSM (dotted lines) and the PCM
(solid lines) 145
6.5 Estimated response probabilities for Items 6, 7 and 8 from
the RSM (dotted lines) and the PCM (solid lines),
with observed response proportions (unfilled circles) 146
7.1 The basic structure of rater-mediated assessments 154
7.2 Fictitious dichotomous data: Responses of seven test takers
to five items scored as correct (1) or incorrect (0) 155
7.3 Illustration of a two-facet dichotomous Rasch model
(log odds form) 156
7.4 Fictitious polytomous data: Responses of seven test takers
evaluated by three raters on five criteria using a five-category
rating scale 157
7.5 Illustration of a three-facet rating scale measurement model
(log odds form) 157
7.6 Studying facet interrelations within a MFRM framework 160
7.7 Wright map for the three-facet rating scale analysis of the sample
data (FACETS output, Table 6.0: All facet vertical “rulers”) 162
7.8 Illustration of the MFRM score adjustment procedure 167
9.1 An example of boxplots 202
9.2 Temporal distribution of ANCOVA/MANCOVA-based
publications in four language assessment journals 204
9.3 A matrix of scatter plots 212
10.1 Plot of regression line graphed on a two-dimensional chart
representing X and Y axes 221
10.2 Plot of residuals vs. predicted Y scores where the assumption
of linearity holds for the distribution of random errors 224
10.3 Plot of residuals vs. predicted Y scores where the assumption
of linearity does not hold 224
10.4 Plot of standardized residuals vs. predicted values of the
dependent variable that depicts a violation of homoscedasticity 227
10.5 Histogram of residuals 237
10.6 Plot of predicted values vs. residuals 238
11.1 Steps in running EFA 245
11.2 Scatter plots to illustrate relationships between variables 247
11.3 Scree plot for the ReTSUQ data 256
Tables

2.1 Key Steps for Conducting a G Theory Analysis 34


2.2 Data Setup for the p × i Study Example With 30 Items (n = 35) 42
2.3 Expected Mean Square (EMS) Equations (the p × i Study Design) 42
2.4 G-study Results (the p × i Study Design) 43
2.5 D-study Results (the p × I Study Design) 44
2.6 Rating Design for the Sample Application 47
2.7 G- and D-study Variance Component Estimates for the p × r ′
Design (Rating Method) 48
2.8 G- and D-study Variance Component Estimates for the p × r
Design (Subdividing Method) 49
3.1 Areas of Investigation and Associated Research Questions 59
3.2 Research Questions and Relevant Output to Examine 63
3.3 Variance Component Estimates for the Four Subscales
(p• × T • × R• Design; 2 Tasks and 2 Raters) 69
3.4 Variance and Covariance Component Estimates for the Four
Subscales (p• × T • × R• Design; 2 Tasks and 2 Raters) 72
3.5 G Coefficients for the Four Subscales (p• × T • × R• Design) 73
3.6 Universe-Score Correlations Between the Four Subscales
(p• × T • × R• Design) 74
3.7 Effective Weights of Each Subscale to the Composite
Universe-Score Variance (p• × T • × R• Design) 74
3.8 Generalizability Coefficients for the Subscales When Varying
the Number of Tasks (p• × T • × R• Design) 75
4.1 Structure of the FET Listening Test 90
4.2 Summary Statistics for the Rasch Analysis 92
4.3 Rasch Item Measures and Fit Statistics (N = 106) 94
4.4 Standardized Residual Variance 96
4.5 Largest Standardized Residual Correlations 98
5.1 Selected Language-Related DIF Studies 109
5.2 Listening Sub-skills in the DELTA Listening Test 112
5.3 Rasch Analysis Summary Statistics (N = 2,524) 113
5.4 Principal Component Analysis of Residuals 114
5.5 Approximate Relationships Between the Person Measures
in PCAR Analysis 115
x Tables
5.6 Items With Uniform DIF 118
5.7 Number of NUDIF Items 120
5.8 Texts Identified by Expert Panel to Potentially Disadvantage
Macau Students 122
6.1 Item Threshold Parameter Estimates (With Standard Errors in
the Parentheses) From the RSM and the PCM 142
6.2 Person Parameter Estimates and Test Information From the
RSM and the PCM 144
6.3 Infit and Outfit Mean Square Values From the RSM and
the PCM 146
7.1 Excerpt From the FACETS Rater Measurement Report 164
7.2 Excerpt From the FACETS Test Taker Measurement Report 166
7.3 Excerpt From the FACETS Criterion Measurement Report 168
7.4 Separation Statistics and Facet-Specific Interpretations 169
8.1 Application of the t-Test in the Field of Language Assessment 183
8.2 Application of ANOVA in Language Testing and Assessment 189
8.3 Descriptive Statistics for Groups’ Performances on the
Reading Test 193
9.1 Summary of Assumptions of ANCOVA and MANCOVA 201
9.2 Descriptive Statistics of the Selected Sample 209
9.3 Descriptive Statistics of Overall Reading Performance
and Attitude by Sex-Group 210
9.4 ANCOVA Summary Table 210
9.5 Estimated Marginal Means 211
9.6 Descriptive Statistics of Reading Performance and Attitude
by Sex-Group 213
9.7 MANCOVA Summary Table 213
9.8 Summary Table for ANCOVAs of Each Reading Subscale 214
10.1 Internal and External Factors Affecting ELLs’ Language
Proficiency231
10.2 Correlation Matrix of the Dependent Variable and the
Independent Variables 234
10.3 Summary of Stepwise Selection 235
10.4 Analysis of Variance Output for Regression, Including All
the Variables 235
10.5 Parameter Estimates With Speaking Included 235
10.6 Analysis of Variance Without Including Speaking 236
10.7 Parameter Estimates of the Four Predictive Variables, Together
With Their Variation Inflation Function 236
10.8 Partial Output as an Example of Residuals and Predicted Values
(a Model Without Speaking) 237
11.1 Categories and Numbers of Items in the ReTSUQ 253
11.2 Part of the Result of Unrotated Principal Component Analysis 255
11.3 The Rotated Pattern Matrix of the ReTSUQ (n = 650) 257
Preface

The two-volume books, Quantitative Data Analysis for Language Assessment (Fun-
damental Techniques and Advanced Methods), together with the Companion web-
site, were motivated by the growing need for a comprehensive sourcebook of
quantitative data analysis for the community of language assessment. As the focus on
developing valid and useful assessments continues to intensify in different parts of the
world, having a robust and sound knowledge of quantitative methods has become
an increasingly essential requirement. This is particularly important given that one
of the community’s responsibilities is to develop language assessments that have evi-
dence of validity, fairness, and reliability. We believe this would be achieved primarily
by leveraging quantitative data analysis in test development and validation efforts.
It has been the contributors’ intention to write the chapters with an eye toward
what professors, graduate students, and test-development companies would need.
The chapters progress gradually from fundamental concepts to advanced topics,
making the volumes suitable reference books for professors who teach quantitative
methods. If the content of the volumes is too heavy for teaching in one course,
we would suggest professors consider using them across two semesters, or alterna-
tively choose any chapters that fit the focus and scope of their courses. For gradu-
ate students who have just embarked on their studies or are writing dissertations
or theses, the two volumes would serve as a cogent and accessible introduction
to the methods that are often used in assessment development and validation
research. For organizations in the test-development business, the volumes provide
a unique topic coverage and examples of applications of the methods in small- and
large-scale language tests that such organizations often deal with.
We would like to thank all of the authors who contributed their expertise in
language assessment and quantitative methods. This collaboration has allowed us
to emphasize the growing interdisciplinarity in language assessment that draws
knowledge and information from many different fields. We wish to acknowledge
that in addition to editorial reviews, each chapter has been subjected to rigorous
double-blind peer review. We extend a special note of thanks to a number of col-
leagues who helped us during the review process:

Beth Ann O’Brien, National Institute of Education, Singapore


Christian Spoden, The German Institute for Adult Education, Leibniz Centre
for Lifelong Learning, Germany
xii Preface
Tuğba Elif Toprak, Izmir Bakircay University, Turkey
Guangwei Hu, Hong Kong Polytechnic University, Hong Kong
Hamdollah Ravand, Vali-e-Asr University of Rafsanjan, Iran
Ikkyu Choi, Educational Testing Service, USA
Kirby C. Grabowski, Teachers College Columbia University, USA
Mehdi Riazi, Macquarie University, Australia
Moritz Heene, Ludwig-Maximilians-Universität München, Germany
Purya Baghaei, Islamic Azad University of Mashad, Iran
Shane Phillipson, Monash University, Australia
Shangchao Min, Zhejiang University, China
Thomas Eckes, Gesellschaft für Akademische Studienvorbereitung und Teste-
ntwicklung e. V. c/o TestDaF-Institut Ruhr-Universität Bochum, Germany
Trevor Bond, James Cook University, Australia
Wenshu Luo, National Institute of Education, Singapore
Yan Zi, The Education University of Hong Kong, Hong Kong
Yasuyo Sawaki, Waseda University, Japan
Yo In’nami, Chuo University, Japan
Zhang Jie, Shanghai University of Finance and Economics, China

We hope that the readers will find the volumes useful in their research and pedagogy.
Vahid Aryadoust and Michelle Raquel
Editors
April 2019
Editor and contributor
biographies

Vahid Aryadoust is Assistant Professor of language assessment literacy at the


National Institute of Education of Nanyang Technological University, Singa-
pore. He has led a number of language assessment research projects funded
by, for example, the Ministry of Education (Singapore), Michigan Language
Assessment (USA), Pearson Education (UK), and Paragon Testing Enterprises
(Canada), and published his research in Language Testing, Language Assess-
ment Quarterly, Assessing Writing, Educational Assessment, Educational Psy-
chology, and Computer Assisted Language Learning. He has also (co)authored
a number of book chapters and books that have been published by Routledge,
Cambridge University Press, Springer, Cambridge Scholar Publishing, Wiley
Blackwell, etc.
Trevor Bond is an Adjunct Professor in the College of Arts, Society and Educa-
tion at James Cook University Australia and the senior author of the book
Applying the Rasch Model: Fundamental Measurement in the Human Sciences.
He consults with language assessment researchers in Hong Kong and Japan
and with high-stakes testing teams in the US, Malaysia, and the UK. In 2005,
he instigated the Pacific Rim Objective Measurement Symposia (PROMS),
now held annually across East Asia. He is a regular keynote speaker at inter-
national measurement conferences, runs Rasch measurement workshops, and
serves as a specialist reviewer for academic journals.
Michelle Y. Chen is a research psychometrician at Paragon Testing Enterprises.
She received her Ph.D. in measurement, evaluation, and research method-
ology from the University of British Columbia (UBC). She is interested in
research that allows her to collaborate and apply psychometric and statistical
techniques. Her research focuses on applied psychometrics, validation, and
language testing.
Ikkyu Choi is a Research Scientist in the Center for English Language Learn-
ing and Assessment at Educational Testing Service. He received his Ph.D. in
applied linguistics from the University of California, Los Angeles in 2013, with
a specialization in language assessment. His research interests include second-
language development profiles, test-taking processes, scoring of constructed
responses, and quantitative research methods for language assessment data.
xiv Editor and contributor biographies
Thomas Eckes is Head of the Psychometrics and Language Testing Research
Department, TestDaF Institute, University of Bochum, Germany. His research
focuses on psychometric modeling of language competencies, rater effects in
large-scale assessments, and the development and validation of web-based lan-
guage placement tests. He is on the editorial boards of the journals Language
Testing and Assessing Writing. His book Introduction to Many-Facet Rasch
Measurement (Peter Lang) appeared in 2015 in a second, expanded edition.
He was also guest editor of a special issue on advances in IRT modeling of rater
effects (Psychological Test and Assessment Modeling, Parts I & II, 2017, 2018).
Jason Fan is a Research Fellow at the Language Testing Research Centre (LTRC)
at the University of Melbourne, and before that, an Associate Professor at
College of Foreign Languages and Literature, Fudan University. His research
interests include the validation of language assessments and quantitative
research methods. He is the author of Development and Validation of Stan-
dards in Language Testing (Shanghai: Fudan University Press, 2018) and the
co-author (with Tim McNamara and Ute Knoch) of Fairness and Justice in
Language Assessment: The Role of Measurement (Oxford: Oxford University
Press, 2019, in press).
Kirby C. Grabowski is Adjunct Assistant Professor of Applied Linguistics and
TESOL at Teachers College, Columbia University, where she teaches courses
on second-language assessment, performance assessment, generalizability the-
ory, pragmatics assessment, research methods, and linguistics. Dr. Grabowski is
currently on the editorial advisory board of Language Assessment Quarterly and
formerly served on the Board of the International Language Testing Associa-
tion as Member-at-Large. Dr. Grabowski was a Spaan Fellow for the English
Language Institute at the University of Michigan, and she received the 2011
Jacqueline Ross TOEFL Dissertation Award for outstanding doctoral disser-
tation in second/foreign language testing from Educational Testing Service.
Rita Green is a Visiting Teaching Fellow at Lancaster University, UK. She is an
expert in the field of language testing and has trained test development teams for
more than 30 years in numerous projects around the world including those in
the fields of education, diplomacy, air traffic control, and the military. She is the
author of Statistical Analyses for Language Testers (2013) and Designing Listen-
ing Tests: A Practical Approach (2017), both published by Palgrave Macmillan.
Zhi Li is an Assistant Professor in the Department of Linguistics at the University
of Saskatchewan (UoS), Canada. Before joining UoS, he worked as a language
assessment specialist at Paragon Testing Enterprises, Canada, and a sessional
instructor in the Department of Adult Learning at the University of the Fra-
ser Valley, Canada. Zhi Li holds a doctoral degree in applied linguistics and
technology from Iowa State University, USA. His research interests include
language assessment, technology-supported language teaching and learning,
corpus linguistics, and computational linguistics. His research papers have been
published in System, CALICO Journal, and Language Learning & Technology.
Editor and contributor biographies xv
Rongchan Lin is a Lecturer at National Institute of Education, Nanyang Tech-
nological University, Singapore. She has received awards and scholarships
such as the 2017 Asian Association for Language Assessment Best Student
Paper Award, the 2016 and 2017 Confucius China Studies Program Joint
Research Ph.D. Fellowship, the 2014 Tan Ean Kiam Postgraduate Scholarship
(Humanities), and the 2012 Tan Kah Kee Postgraduate Scholarship. She was
named the 2016 Joan Findlay Dunham Annual Fund Scholar by Teachers Col-
lege, Columbia University. Her research interests include integrated language
assessment and rubric design.
Wenshu Luo is an Assistant Professor at National Institute of Education (NIE),
Nanyang Technological University, Singapore. She obtained her Ph.D. in edu-
cational psychology from the University of Hong Kong. She teaches quantita-
tive research methods and educational assessment across a number of programs
for in-service teachers in NIE. She is an active researcher in student motivation
and engagement and has published a number of papers in top journals in this
area. She is also enthusiastic to find out how cultural and contextual factors
influence students’ learning, such as school culture, leadership practices, class-
room practices, and parenting practices.
Michelle Raquel is a Senior Lecturer at the Centre of Applied English Studies,
University of Hong Kong, where she teaches language testing and assessment
to postgraduate students. She has worked in several tertiary institutions in
Hong Kong as an assessment developer and has either led or been part of a
group that designed and administered large-scale diagnostic and language pro-
ficiency assessments. She has published several articles in international journals
on academic English diagnostic assessment, ESL testing of reading and writ-
ing, dynamic assessment of second-language dramatic skills, and English for
specific purposes (ESP) testing.
Yasuyo Sawaki is a Professor of Applied Linguistics at the School of Education,
Waseda University in Tokyo, Japan. Sawaki is interested in a variety of research
topics in language assessment ranging from the validation of large-scale inter-
national English language assessments to the role of assessment in classroom
English language instruction. Her current research topics include examining
summary writing performance of university-level Japanese learners of English
and English-language demands in undergraduate- and graduate-level content
courses at universities in Japan.
Daeryong Seo is a Senior Research Scientist at Pearson. He has led various
state assessments and brings international psychometric experience through
his work with the Australian NAPLAN and Global Scale of English. He has
published several studies in international journals and presented numerous
psychometric issues at international conferences, such as the American Edu-
cational Research Association (AERA). He also served as a Program Chair
of the Rasch special-interest group, AERA. In 2013, he and Dr. Taherb-
hai received an outstanding paper award from the California Educational
xvi Editor and contributor biographies
Research Association. Their paper is titled “What Makes High School Asian
English Learners Tick?”
Husein Taherbhai is a retired Principal Research Scientist who led large-scale
assessments in the U.S. for states, such as Arizona, Washington, New York,
Maryland, Virginia, Tennessee, etc., and for the National Physical Therapists’
Association’s licensure examination. Internationally, Dr. Taherbhai led the
Educational Quality and Assessment Office in Ontario, Canada, and worked
for the Central Board of Secondary Education’s Assessment in India. He has
published in various scientific journals and has reviewed and presented at the
NCME, AERA, and Florida State conferences with papers relating to language
learners, rater effects, and students’ equity and growth in education.
Tuğba Elif Toprak is an Assistant Professor of Applied Linguistics/ELT at
Izmir Bakircay University, Izmir, Turkey. Her primary research interests are
implementing cognitive diagnostic assessment by using contemporary item
response theory models and blending cognition with language assessment in
her research. Dr. Toprak has been collaborating with international researchers
on several research projects that are largely situated in the fields of language
assessment, psycholinguistics, and the learning sciences. Her current research
interest includes intelligent real-time assessment systems, in which she com-
bines techniques from several areas such as the learning sciences, cognitive
science, and psychometrics.
Xiaoming Xi is Executive Director of Global Education and Workforce at ETS.
Her research spans broad areas of theory and practice, including validity, fair-
ness, test validation methods, approaches to defining test constructs, validity
frameworks for automated scoring, automated scoring of speech, the role of
technology in language assessment and learning, and test design, rater, and
scoring issues. She is co-editor of the Routledge book series Innovations in
Language Learning and Assessment and is on the Editorial Boards of Lan-
guage Testing and Language Assessment Quarterly. She received her Ph.D. in
language assessment from UCLA.
Limei Zhang is a lecturer at the Singapore Centre for Chinese language, Nan-
yang Technological University. She obtained her Ph.D. in applied linguistics
with emphasis on language assessment from National Institute of Education,
Nanyang Technological University. Her research interests include language
assessment literacy, reading and writing assessment, and learners’ metacogni-
tion. She has published papers in journals including The Asia-Pacific Education
Researcher, Language Assessment Quarterly, and Language Testing. Her most
recent book is Metacognitive and Cognitive Strategy Use and EFL Reading Test
Performance: A Study of Chinese College Students (Springer).
Introduction
Vahid Aryadoust and Michelle Raquel

Quantitative techniques are mainstream components in most of the published


literature in language assessment, as they are essential in test development and
validation research (Chapelle, Enright, & Jamieson, 2008). There are three fami-
lies of quantitative methods adopted in language assessment research: measure-
ment models, statistical methods, and data mining (although admittedly, setting
a definite boundary between this classification of methods would not be feasible).
Borsboom (2005) proposes that measurement models, the first family of quan-
titative methods in language assessment, either fall in the paradigm of classical test
theory (CTT), Rasch measurement, or item response theory (IRT). The common
feature of the three measurement techniques is that they are intended to predict
outcomes of cognitive, educational, and psychological testing. However, they do
have significant differences in their underlying assumptions and applications. CTT
is founded on true scores, which can be estimated by using the error of measure-
ment and observed scores. Internal consistency reliability and generalizability
theory are also formulated based on CTT premises. Rasch measurement and IRT,
on the other hand, are probabilistic models that are used for the measurement of
latent variables – attributes that are not directly observed. There are a number of
unidimensional Rasch and IRT models, which assume the attribute underlying
test performance comprises only one measurable feature. There are also multidi-
mensional models that postulate that latent variables measured by tests are many
and multidivisible. Determining whether a test is unidimensional or multidimen-
sional requires theoretical grounding, the application of sophisticated quantitative
methods, and an evaluation of the test context. For example, multidimensional
tests can be used to provide fine-grained diagnostic information to stakeholders,
and thus a multidimensional IRT model can be used to derive useful diagnostic
information from test scores. In the current two volumes, CTT and unidimen-
sional Rasch models are discussed in Volume I, and multidimensional techniques
are covered in Volume II.
The second group of methods is statistical and consists of the commonly
used methods in language assessment such as t-tests, analysis of variance
(ANOVA), analysis of covariance (ANCOVA), multivariate analysis of covari-
ance (MANCOVA), regression models, and factor analysis, which are cov-
ered in Volume I. In addition, multilevel modeling and structural equation
2 Vahid Aryadoust and Michelle Raquel
modeling (SEM) are presented in Volume II. The research questions that
these techniques aim to address range from comparing average performances
of test takers to prediction and data reduction. The third group of models
falls under the umbrella of data mining techniques, which we believe are a
relatively underresearched and underutilized technique in language assess-
ment. Volume II presents two data mining methods: classification and regres-
sion trees (CART) and the evolutionary algorithm-based symbolic regression,
both of which are used for prediction and classification. These methods detect
the relationship between dependent and independent variables in the form of
mathematical functions which confirm postulated relationships between vari-
ables across separate datasets. This feature of the two data mining techniques,
discussed in Volume II, improves the precision and generalizability of the
detected relationships.
We provide an overview of the two volumes in the next sections.

Quantitative Data Analysis for Language Assessment


Volume I: Fundamental Techniques
This volume is comprised of 11 chapters contributed by a number of experts in
the field of language assessment and quantitative data analysis techniques. The
aim of the volume is to revisit the fundamental quantitative topics that have
been used in the language assessment literature and shed light on their rationales
and assumptions. This is achieved through delineating the technique covered in
each chapter, providing a (brief) review of its application in previous language
assessment research, and giving a theory-driven example of the application of the
technique. The chapters in Volume I are grouped into three main sections, which
are discussed below.

Section I. Test development, reliability, and generalizability

Chapter 1: Item analysis in language assessment (Rita Green)


This chapter deals with the fundamental, but, as Rita Green notes, an often-
delayed step in language test development. Item analysis is a quantitative method
that allows test developers to examine the quality of test items, i.e., which test
items are working well (constructed to assess the construct they are meant to
assess) and which items should be revised or dropped to improve overall test
reliability. Unfortunately, as the author notes, this step is commonly done after
a test has been administered and not when items have just been developed. The
chapter starts with an explanation of the importance of this method at the test-
development stage. Then several language testing studies that have utilized this
method to investigate test validity and reliability, to improve standard-setting ses-
sions, and to investigate the impact of test format and different testing conditions
on test taker performance are reviewed. The author further emphasizes the need
for language testing professionals to learn this method and its link to language
Introduction 3
assessment research by suggesting five research questions in item analysis. The
use of this method is demonstrated by an analysis of a multiple-choice grammar
and vocabulary test. The author concludes the chapter by demonstrating how the
analysis can answer the five research questions proposed, as well as suggestions on
how to improve the test.

Chapter 2: Univariate generalizability theory in language


assessment (Yasuyo Sawaki and Xiaoming Xi)
In addition to item analysis, investigating reliability and generalizability is a fun-
damental consideration of test development. Chapter 2 presents and extends the
framework to investigate reliability within the paradigm of classical test theory
(CTT). Generalizability theory (G theory) is a powerful method of investigat-
ing the extent in which scores are reliable, as it is able to account for different
sources of variability and their interactions in one analysis. The chapter provides
an overview of the key concepts in this method, outlines the steps in the analyses,
and presents an important caveat in the application of this method, i.e., concep-
tualization of an appropriate rating design that fits the context. A sample study
demonstrating the use of this method is presented to investigate the dependability
of ratings given on an English as a foreign language (EFL) summary writing task.
The authors compared the results of two G theory analyses, the rating method
and the block method, to demonstrate to readers the impact of rating design on
the results of the analysis. The chapter concludes with a discussion of the strengths
of the analysis compared to other CTT-based reliability indices, the value of this
method in investigating rater behavior, and suggested references should readers
wish to extend their knowledge of this technique.

Chapter 3: Multivariate generalizability theory in language


assessment (Kirby C. Grabowski and Rongchan Lin)
In performance assessments, multiple factors contribute to generate a test taker’s
overall score, such as task type, the rating scale structure, and the rater, meaning
that scores are influenced by multiple sources of variance. Although univariate G
theory analysis is able to determine the reliability of scores, it is limited in that it
does not consider the impact of these sources of variance simultaneously. Multi-
variate G theory analysis is a powerful statistical technique, as in addition to results
generated by univariate G theory analysis, it is able to generate a reliability index
accounting for all these factors in one analysis. The analysis is also able to consider
the impact of subscales of a rating scale. The authors begin the chapter with an
overview of the basic concepts of multivariate G theory. Next, they illustrate an
application of this method through an analysis of a listening-speaking test where
they make clear links between research questions and the results of the analysis.
The chapter concludes with caveats in the use of this method and suggested refer-
ences for readers who wish to complement their MG theory analyses with other
methods.
4 Vahid Aryadoust and Michelle Raquel
Section II. Unidimensional Rasch measurement

Chapter 4: Applying Rasch measurement in language


assessment: unidimensionality and local independence
(Jason Fan and Trevor Bond)
This chapter discusses the two fundamental concepts required in the application of
Rasch measurement in language assessment research, i.e., unidimensionality and
local independence. It provides an accessible discussion of these concepts in the
context of language assessment. The authors first explain how the two concepts
should be perceived from a measurement perspective. This is followed by a brief
explanation of the Rasch model, a description of how these two measurement
properties are investigated through Rasch residuals, and a review of Rasch-based
studies in language assessment that reports the existence of these properties to
strengthen test validity claims. The authors demonstrate the investigation of these
properties through the analysis of items in a listening test using the Partial Credit
Rasch model. The results of the study revealed that the listening test is unidi-
mensional and that the principal component analysis of residuals analysis provides
evidence of local independence of items. The chapter concludes with a discussion
of the practical considerations and suggestions on steps to take should test devel-
opers encounter situations in which these properties of measurement are violated.

Chapter 5: The Rasch measurement approach to differential


item functioning (DIF) analysis in language assessment
research (Michelle Raquel)
This chapter continues the discussion of test measurement properties. Differential
item functioning (DIF) is the statistical term used to describe items that inad-
vertently have different item estimates for different subgroups because they are
affected by characteristics of the test takers such as gender, age group, or ethnicity.
The author first explains the concept of DIF and then provides a brief overview of
different DIF detection methods used in language assessment research. A review of
DIF studies in language testing follows, which includes a summary of current DIF
studies, the DIF method(s) used, and whether the studies investigated the causes
of DIF. The chapter then illustrates one of the most commonly used DIF detection
methods, the Rasch-based DIF analysis method. The sample study investigates the
presence of DIF in a diagnostic English listening test in which students were clas-
sified according to the English language curriculum they have taken, Hong Kong
vs. Macau. The results of the study revealed that although there were a significant
number of items flagged for DIF, overall test results did not seem to be affected.

Chapter 6: Application of the rating scale model and the partial


credit model in language assessment research (Ikkyu Choi)
This chapter introduces two Rasch models that are used to analyze polytomous
data usually generated by performance assessments (speaking or writing tests) and
Introduction 5
questionnaires used in language assessment studies. First, Ikkyu Choi explains
the relationship of the Rating Scale Model (RSM) and the Partial Credit Model
(PCM) through a gentle review of their algebraic representations. This is fol-
lowed by a discussion of the differences of these models and a review of studies
that have utilized this method. The author notes in his review that researchers
rarely provide a rationale for the choice of model, and neither do they compare
models. In the sample study investigating the scale of a motivation questionnaire,
the author provides a thorough and graphic comparison and evaluation of the
RSM and the PCM models and their impact on the scale structure of the ques-
tionnaire. The chapter concludes by providing justification as to why the PCM
was more appropriate for the context, the limitations of the parameter estimation
method used by the sample study, and a list of suggested topics to extend their
knowledge of the topic.

Chapter 7: Many-facet Rasch measurement: implications


for rater-mediated language assessment (Thomas Eckes)
This chapter discusses one of the most popular item response theory (IRT)-based
methods to analyze rater-mediated assessments. A common problem in speak-
ing and writing tests is that the marks or grades are dependent on human raters
who most likely have their own conceptions how to mark despite training, which
impacts test reliability. Many-facet Rasch measurement (MFRM) provides a solu-
tion to this problem in that the analysis simultaneously includes multiple facets
such as raters, assessment criteria, test format, or the time when a test is taken.
The author first provides an overview of rater-mediated assessments and MFRM
concepts. The application of this method is illustrated through an analysis of a
writing assessment in which the author demonstrates how to determine rater
severity, consistency of ratings, and how to generate test scores after adjusting
for differences in ratings. The chapter concludes with a discussion on advances in
MFRM research and controversial issues related to this method.

Section III. Univariate and multivariate statistical analysis

Chapter 8: Analysis of differences between groups: the t-test


and the analysis of variance (ANOVA) in language assessment
(Tuğba Elif Toprak)
The third section of this volume starts with a discussion of two of the most fun-
damental and commonly used statistical techniques used to compare test score
results and determine whether differences between the groups are due to chance.
For example, language testers often find themselves trying to compare two or
multiple groups of test takers or compare pre-test and post-test scores. The
chapter starts with an overview of t-tests and the analysis of variance (ANOVA)
and the assumptions that must be met before embarking on these analyses. The
literature review provides summary tables of recent studies that have employed
each method. The application of the t-test is demonstrated through a sample
6 Vahid Aryadoust and Michelle Raquel
study that investigated the impact of English songs on students’ pronuncia-
tion development in which the author divided the students into two groups
(experimental vs. control group) and then compared the groups’ results on a
pronunciation test. The second study utilized ANOVA to determine if students’
academic reading proficiency differed across college years (freshmen, sopho-
mores, juniors, seniors), and to determine which group was significantly different
from the others.

Chapter 9: Application of ANCOVA and MANCOVA in


language assessment research (Zhi Li and Michelle Y. Chen)
This chapter extends the discussion of methods used to compare test results.
Instead of using one variable to classify groups that are compared, analysis of
covariance (ANCOVA), and multivariate analysis of covariance (MANCOVA)
consider multiple variables of multiple groups to determine whether differences
in group scores are statistically significant. ANCOVA is used when there is only
one independent variable, while MANCOVA is used when there are two or more
independent variables that are included in the comparison. Both techniques
control for the effect of one or more variables that co-vary with the dependent
variables. The chapter begins with a brief discussion of these two methods, the
situations in which they should be used, the assumptions that must be fulfilled
before analysis can begin, and a brief discussion of how results should be reported.
The authors present the results of their meta-analyses of studies that have utilized
these methods and outline the issues related to results reporting in these studies.
The application of these methods is demonstrated in the analyses of the Pro-
gramme for International Student Assessment (PISA) 2009 reading test results
of Canadian children.

Chapter 10: Application of linear regression in language


assessment (Daeryong Seo and Husein Taherbhai)
There are cases in which language testers need to determine the impact of one
variable on another variable, such as if someone’s first language has an impact on
the learning of a second language. Linear regression is the appropriate statistical
technique to use when one aims to determine the extent to which one or more
independent variables linearly impact a dependent variable. This chapter opens
with a brief discussion of the differences between single and multiple linear
regression and a full discussion on the assumptions that must be fulfilled before
commencing analysis. Next, the authors present a brief literature review of factors
that affect English language proficiency, as these determine what variables should
be included in the statistical model. The sample study illustrates the application
of linear regression by predicting students’ results on an English language arts
examination based on their performance in English proficiency tests of reading,
listening, speaking, and writing. The chapter concludes with a checklist of con-
cepts to consider before doing regression analysis.
Introduction 7
Chapter 11: Application of exploratory factor analysis in
language assessment (Limei Zhang and Wenshu Luo)
A standard procedure in test and survey development is to check and see whether
a test or questionnaire measures one underlying construct or dimension. Ideally,
test and questionnaire items are constructed to measure a latent construct (e.g.,
20-items to measure listening comprehension), but each item is designed to
measure different aspects of the construct (e.g., items that measure the ability to
listen for details, ability to listen for main ideas, etc.). Exploratory factor analysis
(EFA) is a statistical technique that examines how items are grouped together
into themes and ultimately measure the latent trait. The chapter commences with
an overview of EFA, the different methods to extract the themes (factors) from
the data, and an outline of steps in conducting an EFA. This is followed by a
literature review that highlights the different ways the method has been applied
in language testing research, with specific focus on studies that confirm the factor
structure of tests and questionnaires. The sample study demonstrates how EFA
can do this by analyzing the factor structure of the Reading Test Strategy Use
Questionnaire used to determine the types of reading strategies that Chinese
students use as they complete reading comprehension tests.

Quantitative Data Analysis for Language Assessment


Volume II: Advanced Methods
Volume II comprises three major categories of quantitative methods in language
testing research: advanced IRT, advanced statistical methods, and nature-inspired
data mining methods. We provide an overview of the sections and chapters below.

Section I. Advanced Item Response Theory (IRT) models in


language assessment

Chapter 1: Mixed Rasch modeling in assessing reading comprehension


(Purya Baghaei, Christoph Kemper, Samuel Greif, and Monique
Reichert)
In this chapter, the authors discuss the application of the mixed Rasch model
(MRM) in assessing reading comprehension. MRM is an advanced psychometric
approach for detecting latent class differential item functioning (DIF) which
conflates the Rasch model and latent class analysis. MRM relaxes some of the
requirements of conventional Rasch measurement while preserving most of the
fundamental features of the method. MRM further combines the Rasch model
with latent class modeling, which classifies test takers into exclusive classes with
qualitatively different features. Baghaei et al. apply the model to a high-stakes
reading comprehension test in English as a foreign language and detect two latent
classes of test takers for whom the difficulty level of the test items differs. They
discuss the differentiating feature of the classes and conclude that MRM can be
applied to identify sources of multidimensionality.
8 Vahid Aryadoust and Michelle Raquel
Chapter 2: Multidimensional Rasch models in first language
listening tests (Christian Spoden and Jens Fleischer)
Since the introduction of Rasch measurement to language assessment, a group
of scholars has contended that language is not a unidimensional phenomenon,
and, accordingly, unidimensional modeling of language assessment data (e.g.,
through the unidimensional Rasch model) would conceal the role of many lin-
guistic features that are integral to language performance. The multidimensional
Rasch model could be viewed as a response to these concerns. In this chapter,
the authors provide a didactic presentation of the multidimensional Rasch model
and apply it to a listening assessment. They discuss the advantages of adopting
the model in language assessment research, specifically the improvement in the
estimation of reliability as a result of the incorporation of dimension correlations,
and explain how model comparison can be carried out while elaborating on
multidimensionality in listening comprehension assessments. They conclude the
chapter with a brief summary of other multidimensional Rasch models and their
value in language assessment research.

Chapter 3: The Log-Linear Cognitive Diagnosis Modeling


(LCDM) in second language listening assessment (Elif Toprak,
Vahid Aryadoust, and Christine Goh)
Another group of multidimensional models, called cognitive diagnostic mod-
els (CDMs), combines psychometrics and psychology. One of the differences
between CDMs and the multidimensional Rasch models is that the former family
estimates sub-skills mastery of test takers, whereas the latter group provides gen-
eral estimation of ability for each sub-skill. In this chapter, the authors introduce
the Log-Linear Cognitive Diagnosis Modeling (LCDM), which is a flexible CDM
technique for modeling assessment data. They apply the model to a high-stakes
norm-referenced listening test (a practice that is known as retrofitting) to deter-
mine whether they can derive diagnostic information concerning test takers’
weaknesses and strengths. Toprak et al. argue that although norm-referenced
assessments do not usually provide such diagnostic information about the lan-
guage abilities of test takers, providing such information is practical, as it helps
language learners who wish to know this information to improve their language
skills. They provide guidelines on the estimation and fitting of the LCDM, which
is also applicable to other CDM techniques.

Chapter 4: Hierarchical diagnostic classification models in


assessing reading comprehension (Hamdollah Ravand)
In this chapter, the author presents another group of CDM techniques including
the deterministic noisy and gate (DINA) model and the generalized deterministic
noisy and gate (G-DINA) model, which are increasingly attracting more atten-
tion in language assessment research. Ravand begins the chapter by providing
Introduction 9
step-by-step guidelines to model selection, development, and evaluation, elabo-
rating on fit statistics and other relevant concepts in CDM analysis. Like Toprak
et al. who presented the LCDM in Chapter 3, Ravand argues for retrofitting
CDMs to norm-referenced language assessments and provides an illustrative
example of the application of CDMs to a non-diagnostic high-stakes test of read-
ing. He further explains how to use and interpret fit statistics (i.e., relative and
absolute fit indices) to select the optimal model among the available CDMs.

Section II. Advanced statistical methods in language assessment

Chapter 5: Structural equation modeling in language assessment


(Xuelian Zhu, Michelle Raquel, and Vahid Aryadoust)
This chapter discusses one of the most commonly used techniques in the field,
whose application in assessment research goes back to, at least, the 1990s. Instead
of modeling a linear relationship of variables, structural equation modeling (SEM)
is used to concurrently model direct and indirect relationships between variables.
The authors first provide a review of SEM in language assessment research and
propose a framework for model development, specification, and validation. They
discuss the requirements of sample size, fit, and model respecification and apply
SEM to confirm the use of a diagnostic test in predicting the proficiency level of
test takers as well as the possible mediating role for some demographic factors in
the model tested. While SEM can be applied to both dichotomous and polyto-
mous data, the authors focus on the latter group of data, while stressing that the
principles and guidelines spelled out are directly applicable to dichotomous data.
They further mention other applications of SEM such as multigroup modeling
and SEM of dichotomous data.

Chapter 6: Growth modeling using growth percentiles for


longitudinal studies (Daeryong Seo and Husein Taherbhai)
This chapter presents a method for modeling growth that is called student growth
percentile (SGP) for longitudinal data, which is estimated by using the quantile
regression method. A distinctive feature of SGP is that it compares test takers with
those who had the same history of test performance and achievement. This means
that even when the current test scores are the same for two test takers with differ-
ent assessment histories, their actual SGP scores on the current test can be differ-
ent. Another feature of SGP that differentiates it from similar techniques such as
MLM and latent growth curve models is that SGP does not require test equating,
which, in itself, could be a time-consuming process. Oftentimes, researchers and
language teachers wish to determine whether a particular test taker has a chance
to achieve a pre-determined cut score, but a quick glance at the available literature
shows that the quantitative tools available do not provide such information. Seo
and Taherbhai show that through the quantile regression method, one can esti-
mate the propensity of test takers to achieve an SGP score required to reach the
10 Vahid Aryadoust and Michelle Raquel
cut score. The technique lends itself to investigation of change in four language
modalities, i.e., reading, writing, listening, and speaking.

Chapter 7: Multilevel modeling to examine sources of variability in


second language test scores (Yo In’nami and Khaled Barkaoui)
Multilevel modeling (MLM) is based on the premise that test takers’ perfor-
mance is a function of students’ measured abilities as well as another level of
variation such as the classrooms, schools, or cities the test takers come from.
According to the authors, MLM is particularly useful when test takers are from
pre-specified homogeneous subgroups such as classrooms, which have differ-
ent characteristics from test takers placed in other subgroups. The between-
subgroup heterogeneity combined with the within-subgroup homogeneity yield
a source of variance in data which, if ignored, can inflate chances of a Type I
error (i.e., rejection of a true null hypothesis). The authors provide guidelines
and advice on using MLM and showcase the application of the technique to a
second-language vocabulary test.

Chapter 8: Longitudinal multilevel modeling to examine changes in


second language test scores (Khaled Barkaoui and Yo In’nami)
In this chapter, the authors propose that flexibility of MLM renders it well suited
for modeling growth and investigating the sensitivity of test scores to change
over time. The authors argued that MLM is an alternative hierarchical method
to linear methods such as analysis of variance (ANOVA) and linear regression.
They present an example of second-language longitudinal data. They encour-
age MLM users to consider and control for the variability of test forms that can
confound assessments over time to ensure test equity before using test scores
for MLM analysis and to maximize the validity of the uses and interpretations
of the test scores.

Section III. Nature-inspired data mining methods in


language assessment

Chapter 9: Classification and regression trees in predicting listening


item difficulty (Vahid Aryadoust and Christine Goh)
The first data mining method in this section is classification and regression
tress (CART), which is presented by Aryadoust and Goh. CART is used in a
similar way as linear regression or classification techniques are used in predic-
tion and classification research. CART, however, relaxes the normality and other
assumptions that are necessary for parametric models such as regression analysis.
Aryadoust and Goh review the literature on the application of CART in language
Introduction 11
assessment and propose a multi-stage framework for CART modeling that starts
with the establishing of theoretical frameworks and ends in cross-validation.
The authors apply CART to 321 listening test items and generate a number of
IF-THEN rules which link item difficulty to the linguistic features of the items
in a non-linear way. The chapter also stresses the role of cross-validation in
CART modeling and the features of two cross-validation methods (n-fold cross-
validation and train-test cross).

Chapter 10: Evolutionary algorithm-based symbolic regression to


determine the relationship of reading and lexico-grammatical
knowledge (Vahid Aryadoust)
Aryadoust introduces the evolutionary algorithm-based (EA-based) symbolic
regression method and showcases it application in reading assessment. Like
CART, EA-based symbolic regression is a non-linear data analysis method that
comprises a training and a cross-validation stage. The technique is inspired by
the principles of Darwinian evolution. Accordingly, concepts such as survival of
the fittest, offspring, breeding, chromosomes, and cross-over are incorporated
into the mathematical modeling procedures. The non-parametric nature and
cross-validation capabilities of EA-based symbolic regression render it a powerful
classification and prediction model in language assessment. Aryadoust presents a
prediction study in which he adopts lexico-grammatical abilities as independent
variables to predict the reading ability of English learners. He compared the pre-
diction power of the method with that of a linear regression model and showed
that the technique renders more precise solutions.

Conclusion
In sum, Volumes I and II present 23 fundamental and advanced quantitative
methods and their applications in language testing research. An important fac-
tor to consider in choosing these fundamental and advanced methods is the
role of theory and the nature of research questions. Although some may be
drawn to use advanced methods, as they might provide stronger evidence to
support validity and reliability claims, in some cases, using less complex methods
might cater to the needs of researchers. Nevertheless, oversimplifying research
problems could result in overlooking significant sources of variation in data
and drawing possibly wrong or naïve inferences. The authors of the chapters
have, therefore, emphasized that the first step to choosing the methods is the
postulation of theoretical frameworks to specify the nature of relationships
among variables, processes, and mechanisms of the attributes under investiga-
tion. Only after establishing the theoretical framework should one proceed to
select quantitative methods to test the hypotheses of the study. To this end, the
chapters in the volumes provide step-by-step guidelines to achieve accuracy and
12 Vahid Aryadoust and Michelle Raquel
precision in choosing and conducting the relevant quantitative techniques. We
are confident that the joint effort of the authors has emphasized the research
rigor required in the field and highlighted strengths and weaknesses of the data
analysis techniques.

References
Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psycho-
metrics. Cambridge: Cambridge University Press.
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity
argument for the test of English as a foreign language. New York, NY: Routledge.
Introduction
Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psycho-metrics.
Cambridge: Cambridge University Press.
Chapelle, C. A. , Enright, M. K. , & Jamieson, J. M. (Eds.). (2008). Building a validity argument
for the test of English as a foreign language. New York, NY: Routledge.

Item analysis in language assessment


Alderson, J. C. (2007). Final report on the ELPAC validation study. Retrieved from
www.elpac.info/
Alderson, J. C. (2010). A survey of aviation English tests. Language Testing, 27(1), 5172.
Alderson, J. C. , & Huhta, A. (2005). The development of a suite of computer-based diagnostic
tests based on the Common European Framework. Language Testing, 22(3), 301320.
Alderson, J. C. , Percsich, R. , & Szabo, G. (2000). Sequencing as an item type. Language
Testing, 17(4), 423447.
Anderson, N. J. , Bachman, L. , Perkins, K. , & Cohen, A. (1991). An exploratory study into the
construct validity of a reading comprehension test: Triangulation of data sources. Language
Testing, 8(1), 4166.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge
University Press.
Bhumichitr, D. , Gardner, D. , & Green, R. (2013, July). Developing a test for diplomats:
Challenges, impact and accountability. Paper presented at the Language Testing Research
Colloquium, Seoul, South Korea.
Campfield, D. E. (2017). Lexical difficulty: Using elicited imitation to study child L2. Language
Testing, 34(2), 197221.
Cizek, G. J. , & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating
performance standards on tests. Thousand Oaks, CA: SAGE Publications.
Culligan, B. (2015). A comparison of three test formats to assess word difficulty. Language
Testing, 32(4), 503520.
Currie, M. , & Chiramanee, T. (2010). The effect of the multiple-choice item format on the
measurement of knowledge of language structure. Language Testing, 27(4), 471491.
Field, A. (2009). Discovering statistics using SPSS. London: SAGE Publications.
Fortune, A. (2004). Testing listening comprehension in a foreign language: Does the number of
times a text is heard affect performance. Unpublished MA dissertation, Lancaster University,
United Kingdom.
Green, R. (2005). English Language Proficiency for Aeronautical Communication ELPAC.
Paper presented at the Language Testing Forum (LTF), University of Cambridge.
29 Green, R. (2013). Statistical analyses for language testers. New York, NY: Palgrave
Macmillan.
Green, R. (2017). Designing listening tests: A practical approach. London: Palgrave Macmillan.
Green, R. , & Spoettl, C. (2009, June). Going national, standardized and live in Austria:
Challenges and tensions. Paper presented at the EALTA Conference, Turku, Finland. Retrieved
from www.ealta.eu.org/conference/2009/programme.htm
Green, R. , & Spoettl, C. (2011, May). Building up a pool of standard setting judges: Problems,
solutions and insights. Paper presented at the EALTA Conference, Siena, Italy. Retrieved from
www.ealta.eu.org/conference/2011/programme.html
Henning, G. (1987). A guide to language testing: Development, evaluation, research.
Cambridge: Newbury House Publishers.
Hsu, T. H. L. (2016). Removing bias towards World Englishes: The development of a rater
attitude instrument using Indian English as a stimulus. Language Testing, 33(3), 367389.
Ilc, G. , & Stopar, A. (2015). Validating the Slovenian national alignment to CEFR: The case of
the B2 reading comprehension examination in English. Language Testing, 32(4), 443462.
Jafarpur, A. (2003). Is the test constructor a facet? Language Testing, 20(1), 5787.
Jang, E. E. , Dunlop, M. , Park, G. , & van der Boom, E. H. (2015). How do young students with
different profiles of reading skill mastery, perceived ability, and goal orientation respond to
holistic diagnostic feedback? Language Testing, 32(3), 359383.
Kobayashi, M. (2002). Method effects on reading comprehension test performance: Text
organization and response format. Language Testing, 19(2), 193220.
LaFlair, G. T. , Isbell, D. , May, L. N. , Gutierrez Arvizu, M. N. , & Jamieson, J. (2015). Equating
in small-scale language testing programs. Language Testing, 34(1), 127144.
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill.
Pallant, J. (2007). SPSS survival manual: A step by step guide to data analysis using SPSS for
Windows (3rd ed.). New York, NY: Open University Press.
Papageorgiou, S. (2016). Aligning language assessments to standards and frameworks. In D.
Tsagari & J. Banerjee (Eds.), Handbook of second language assessment (pp. 327340). Berlin:
Walter de Gruyter Inc.
Popham, W. J. (2000). Modern educational measurement: Practical guidelines for educational
leaders. Boston, MA: Allyn & Bacon.
Salkind, N. J. (2006). Tests & measurement for people who (think they) hate tests &
measurement. Thousand Oaks, CA: SAGE Publications.
Sarig, G. (1989). Testing meaning construction: Can we do it fairly? Language Testing, 6(1),
7794.
Shizuka, T. , Takeuchi, O. , Yashima, T. , & Yoshizawa, K. (2006). A comparison of three-and
four-option English tests for university entrance selection purposes in Japan. Language Testing,
23(1), 3557.
Wagner, E. (2008). Video listening tests: What are they measuring? Language Assessment
Quarterly, 5(3), 218243.
Wagner, E. (2010). The effect of the use of video texts on ESL listening test-taker performance.
Language Testing, 27(4), 493513.
Zieky, M. J. , Perie, M. , & Livingston, S. A. (2008). Cut-scores: A manual for setting standards
of performance on educational and occupational tests. Princeton, NJ: Educational Testing
Service.

Univariate generalizability theory in language assessment


Atilgan, H. (2013). Sample size for estimation of g and phi coefficients in generaliz-ability theory.
Eurasian Journal of Educational Research, 51, 215228.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge
University Press.
Bachman, L. F. , Lynch, B. K. , & Mason, M. (1995). Investigating variability in tasks and rater
judgment in a performance test of foreign language speaking. Language Testing, 12(2),
238257.
Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College
Testing.
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag, Inc.
Brennan, R. L. , Gao, X. , & Colton, D. A. (1995). Generalizability analyses of Work Keys
listening and writing tests. Educational and Psychological Measurement, 55(2), 157176.
Brown, J. D. (1999). The relative importance of persons, items, subtests and languages to
TOEFL test variance. Language Testing, 16(2), 217238.
Brown, J. D. , & Bailey, K. M. (1984). A categorical instrument for scoring second language
writing skills. Language Learning, 34(4), 2138.
Brown, J. D. , & Hudson, T. (2002). Criterion-referenced language testing. Cambridge:
Cambridge University Press.
Cardinet, J. , Johnson, S. , & Pini, G. (2009). Applying generalizability theory using EduG. New
York, NY: Routledge.
Chiu, C. W. T. (2001). Scoring performance assessments based on judgments: Generaliz-ability
theory. Boston, MA: Kluwer Academic.
Chiu, C. W. T. , & Wolfe, E. W. (2002). A method for analyzing sparse data matrices in the
generalizability theory framework. Applied Psychological Measurement, 26(3), 321338.
Crick, J. E. , & Brennan, R. L. (2001). GENOVA (Version 3.1) [Computer program].
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,
297334.
Cronbach, L. J. , Gleser, G. C. , Nanda, H. , & Rajaratnam, N. (1972). The dependability of
behavioral measurements: Theory of generalizability for scores and profiles. New York, NY:
Wiley.
Hoyt, W. T. (2010). Inter-rater reliability and agreement. In G. R. Hancock & R. O. Mueller
(Eds.), The reviewers guide to quantitative methods in the social sciences (pp. 141154). New
York, NY: Routledge.
Innami, Y. , & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A synthesis
of generalizability studies. Language Testing, 33(3), 341366.
Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting of
integrated and independent tasks. Language Testing, 23(2), 131166.
Lee, Y.-W. , & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating
prototype tasks and alternative rating schemes. TOEFL Monograph Series No. 31. Princeton,
NJ: Educational Testing Service.
53 Lin, C.-K. (2017). Working with sparse data in rated language tests: Generalizability theory
applications. Language Testing, 34(2), 271289.
Lynch, B. K. , & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement
in the development of performance of ESL speaking skills of immigrants. Language Testing,
15(2), 158180.
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill.
Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment:
Reporting a score profile. Language Testing, 24(3), 355390.
Sawaki, Y. (2013). Classical test theory. In A. Kunnan (Ed.), The companion to language
assessment (pp. 11471164). New York, NY: Wiley.
Sawaki, Y. , & Sinharay, S. (2013). The value of reporting TOEFL iBT subscores. TOEFL iBT
Research Report No. TOEFLiBT-21. Princeton, NJ: Educational Testing Service.
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation
modeling. Language Testing, 22(1), 130.
Shavelson, R. J. , & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA:
SAGE Publications.
Xi, X. (2007). Evaluating analytic scores for the TOEFL Academic Speaking Test (TAST) for
operational use. Language Testing, 24(2), 251286.
Xi, X. , & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test.
Language Learning, 61(4), 12221255.
Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and languages
on TOEIC score dependability. Language Testing, 23(3), 351369.

Multivariate generalizability theory in language assessment


Alderson, J. C. (1981). Report of the discussion on the testing of English for specific purposes.
In J. C. Alderson & A. Hughes (Eds.), Issues in language testing. ELT Documents No. 111 (pp.
187194). London: British Council.
Atilgan, H. (2013). Sample size for estimation of G and phi coefficients in generaliz-ability
theory. Eurasian Journal of Educational Research, 51, 215227.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge
University Press.
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment
Quarterly, 2, 134.
Bachman, L. F. , Lynch, B. K. , & Mason, M. (1995). Investigating variability in tasks and rater
judgment in a performance test of foreign language speaking. Language Testing, 12(2),
238257.
Bachman, L. F. , & Palmer, A. (1982). The construct validation of some components of
communicative proficiency. TESOL Quarterly, 16, 446465.
Bachman, L. F. , & Palmer, A. (1996). Language testing in practice. Oxford: Oxford University
Press.
Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: ACT, Inc.
Brennan, R. L. (1992). Generalizability theory. Educational Measurement: Issues and Practice,
11(4), 2734.
Brennan, R. L. (2001a). Generalizability theory. New York, NY: Springer-Verlag.
Brennan, R. L. (2001b). mGENOVA (Version 2.1) [Computer software]. Iowa City, IA: The
University of Iowa. Retrieved from https://education.uiowa.edu/centers/center-advanced-
studies-measurement-and-assessment/computer-programs
Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement in
Education, 24(1), 121. doi: 10.1080/08957347.2011.532417
Brennan, R. L. , Gao, X. , & Colton, D. A. (1995). Generalizability analyses of work keys
listening and writing tests. Educational and Psychological Measurement, 55(2), 157176.
79 Chiu, C. , & Wolfe, E. (2002). A method for analyzing sparse data matrices in the
generalizability theory framework. Applied Psychological Measurement, 26(3), 321338.
Cronbach, L. J. , Gleser, G. C. , Nanda, H. , & Rajaratnam, N. (1972). The dependability of
behavioral measurement: Theory of generalizability for scores and profiles. New York, NY:
Wiley.
Davies, A. , Brown, A. , Elder, E. , Hill, K. , Lumley, T. , & McNamara, T. (1999). Dictionary of
language testing. Studies in Language Testing, 7. Cambridge: Cambridge University Press.
Frost, K. , Elder, C. , & Wigglesworth, G. (2011). Investigating the validity of an integrated
listening-speaking task: A discourse-based analysis of test takers oral performances. Language
Testing, 29(3), 345369.
Grabowski, K. C. (2009). Investigating the construct validity of a test designed to measure
grammatical and pragmatic knowledge in the context of speaking. Unpublished doctoral
dissertation, Teachers College, Columbia University.
Innami, Y. , & Koizumi, R. (2015). Task and rater effects in L2 speaking and writing: A synthesis
of generalizability studies. Language Testing, 33(3), 341366.
Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting of
integrated and independent tasks. Language Testing, 23(2), 131166.
Lee, Y.-W. , & Kantor, R. (2007). Evaluating prototype tasks and alternative rating schemes for
a new ESL writing test through g-theory. International Journal of Testing, 7(4), 353385.
Liao, Y.-F. (2016). Investigating the score dependability and decision dependability of the GEPT
listening test: A multivariate generalizability theory approach. English Teaching and Learning,
40(1), 79111.
Lin, R. (2017, June). Operationalizing content integration in analytic scoring: Assessing
listening-speaking ability in a scenario-based assessment. Paper presented at the 4th Annual
International Conference of the Asian Association for Language Assessment (AALA), Taipei.
Linacre, M. (2001). Generalizability theory and Rasch measurement. Rasch Measurement
Transactions, 15(1), 806807.
Lynch, B. K. , & McNamara, T. (1998). Using G-theory and many-facet Rasch measurement in
the development of performance assessments of the ESL speaking skills of migrants. Language
Testing, 15(2), 158180.
McNamara, T. (1996). Measuring second language test performance. New York, NY: Longman.
Plakans, L. (2013). Assessment of integrated skills. In C. A. Chapelle (Ed.), The encyclopedia of
applied linguistics (pp. 205212). Malden, MA: Blackwell.
Sato, T. (2011). The contribution of test-takers speech content to scores on an English oral
proficiency test. Language Testing, 29(2), 223241.
Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment:
Reporting a score profile and a composite. Language Testing, 24(3), 355390.
Shavelson, R. , & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: SAGE
Publications.
Wang, M. W. , & Stanley, J. C. (1970). Differential weighting: A review of methods and empirical
studies. Review of Educational Research, 40(5), 663705.
Webb, N. M. , Shavelson, R. J. , & Maddahian, E. (1983). Multivariate generalizability theory. In
L. J. Fyans (Ed.), New directions in testing and measurement: Generaliz-ability theory (pp.
6782). San Francisco, CA: Jossey-Bass.
80 Weigle, S. (2004). Integrating reading and writing in a competency test for non-native
speakers of English. Assessing Writing, 9, 2755.
Xi, X. (2007). Evaluating analytic scoring for the TOEFL Academic Speaking Test (TAST) for
operational use. Language Testing, 24(2), 251286.
Xi, X. , & Mollaun, P. (2014). Investigating the utility of analytic scoring for the TOEFL Academic
Speaking Test (TAST). ETS Research Report Series, 2006(1), 171.
Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and language on
TOEIC score dependability. Language Testing, 23(3), 351369.

Applying Rasch measurement in language assessment


Aryadoust, V. , Goh, C. C. , & Kim, L. O. (2011). An investigation of differential item functioning
in the MELAB listening test. Language Assessment Quarterly, 8(4), 361385.
Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what
we count counts. Language Testing, 17(1), 142.
Bachman, L. F. , & Palmer, A. S. (1996). Language assessment in practice: Designing and
developing useful language tests. Oxford: Oxford University Press.
Bachman, L. F. , & Palmer, A. S. (2010). Language assessment in practice: Developing
language assessments and justifying their use in the real world. Oxford: Oxford University
Press.
Beglar, D. (2010). A Rasch-based validation of the Vocabulary Size Test. Language Testing,
27(1), 101118.
Bejar, I. I. (1983). Achievement testing: Recent advances. Beverly Hills, CA: SAGE
Publications.
Bond, T. , & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the
human sciences. New York, NY: Routledge.
Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.
Chapelle, C. A. , Enright, M. K. , & Jamieson, J. M. (2008). Building a validity argument for the
test of English as a foreign language. New York, NY and London: Routledge, Taylor & Francis
Group.
Chen, W.-H. , & Thissen, D. (1997). Local dependence indexes for item pairs using item
response theory. Journal of Educational and Behavioral Statistics, 22(3), 265289.
Chou, Y. T. , & Wang, W. C. (2010). Checking dimensionality in item response models with
principal component analysis on standardized residuals. Educational and Psychological
Measurement, 70(5), 717731.
Christensen, K. B. , Makransky, G. , & Horton, M. (2017). Critical values for Yens Q 3:
Identification of local dependence in the Rasch model using residual correlations. Applied
Psychological Measurement, 41(3), 178194.
Eckes, T. (2011). Introduction to many-facet Rasch measurement. Frankfurt am Main: Peter
Lang.
Fan, J. , & Ji, P. (2014). Test candidates attitudes and their test performance: The case of the
Fudan English Test. University of Sydney Papers in TESOL, 9, 135.
Fan, J. , Ji, P. , & Song, X. (2014). Washback of university-based English language tests on
students learning: A case study. The Asian Journal of Applied Linguistics, 1(2), 178192.
101 Fan, J. , & Yan, X. (2017). From test performance to language use: Using self-assessment
to validate a high-stakes English proficiency test. The Asia-Pacific Education Researcher,
26(12), 6173.
FDU Testing Team . (2014). The Fudan English test syllabus. Shanghai: Fudan University
Press.
Ferguson, G. A. (1941). The factorial interpretation of test difficulty. Psychometrika, 6(5),
323330.
Ferne, T. , & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in language
testing: Methodological advances, challenges, and recommendations. Language Assessment
Quarterly, 4(2), 113148.
Field, A. (2009). Discover statistics using SPSS. London: SAGE Publications.
Hamp-Lyons, L. (1989). Applying the partial credit method of Rasch analysis: Language testing
and accountability. Language Testing, 6(1), 109118.
Linacre, J. M. (1998). Detecting multidimensionality: Which residual data-type works best?
Journal of Outcome Measurement, 2, 266283.
Linacre, J. M. (2017a). Winsteps (Version 3.93.0) [Computer software]. Beaverton, OR:
Winsteps.com. Retrieved January 1, 2017 from www.winsteps.com.
Linacre, J. M. (2017b). Facets computer program for many-facet Rasch measurement, version
3.80.0 users guide. Beaverton, OR: Winsteps.com.
Linacre, J. M. (2017c). Winsteps Rasch measurement computer program users guide.
Beaverton, OR: Winsteps.com.
Linacre, J. M. (May, 2017). Personal Communication.
Marais, I. (2013). Local dependence. In K. B. Christensen , S. Kreiner , & M. Mesbah (Eds.),
Rasch models in health (pp. 111130). London: John Wiley & Sons Ltd.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149174.
McNamara, T. (1996). Measuring second language proficiency. London: Longman Publishing
Group.
McNamara, T. , & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement
in language testing. Language Testing, 29(4), 553574.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13103).
New York, NY: American Council on Education/Macmillan Publishing Company.
Min, S. , & He, L. (2014). Applying unidimensional and multidimensional item response theory
models in testlet-based reading assessment. Language Testing, 31(4), 453477.
Pae, H. K. , Greenberg, D. , & Morris, R. D. (2012). Construct validity and measurement
invariance of the Peabody Picture Vocabulary Test III Form A. Language Assessment
Quarterly, 9(2), 152171.
Sawaki, Y. , Stricker, L. J. , & Oranje, A. H. (2009). Factor structure of the TOEFL Internet-
based test. Language Testing, 26(1), 530.
Skehan, P. (1989). Language testing part II. Language Teaching, 22(1), 113.
Stevens, J. (2002). Applied multivariate statistics for the social sciences. Mahwah, NJ:
Lawrence Erlbaum Associates, Inc.
Taylor, L. , & Geranpayeh, A. (2011). Assessing listening for academic purposes: Defining and
operationalising the test construct. Journal of English for Academic Purposes, 10(2), 89101.
Thissen, D. , Steinberg, L. , & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-
categorical-response models. Journal of Educational Measurement, 26(3), 247260.
102 Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of
the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125145.
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item
dependence. Journal of Educational Measurement, 30(3), 187213.
Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency
classification under competing measurement models. Language Testing, 27(1), 119140.

The Rasch measurement approach to differential item functioning (DIF)


analysis in language assessment research
Abbott, M. L. (2007). A confirmatory approach to differential item functioning on an ESL reading
assessment. Language Testing, 24(1), 736. doi: 10.1177/0265532207071510
Allalouf, A. (2003). Revising translated differential item functioning items as a tool for improving
cross-lingual assessment. Applied Measurement in Education, 16(1), 5573. doi:
10.1207/S15324818AME1601_3
Allalouf, A. , & Abramzon, A. (2008). Constructing better second language assessments based
on differential item functioning analysis. Language Assessment Quarterly, 5(2), 120141. doi:
10.1080/15434300801934710
Allalouf, A. , Hambleton, R. K. , & Sireci, S. G. (1999). Identifying the causes of DIF in translated
verbal items. Journal of Educational Measurement, 36(3), 185198.
Aryadoust, V. (2012). Differential item functioning in while-listening performance tests: The case
of the International English Language Testing System (IELTS) listening module. International
Journal of Listening, 26(1), 4060. doi: 10.1080/10904018.2012.639649
Aryadoust, V. , Goh, C. C. M. , & Kim, L. O. (2011). An investigation of differential item
functioning in the MELAB listening test. Language Assessment Quarterly, 8(4), 361385. doi:
10.1080/15434303.2011.628632
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment
Quarterly, 2(1), 134. doi: 10.1207/s15434311laq0201_1
Bachman, L. F. , & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University
Press.
Bae, J. , & Bachman, L. F. (1998). A latent variable approach to listening and reading: Testing
factorial invariance across two groups of children in the Korean/English two-way immersion
program. Language Testing, 15(3), 380414. doi: 10.1177/026553229801500304
Banerjee, J. , & Papageorgiou, S. (2016). Whats in a topic? Exploring the interaction between
test-taker age and item content in high-stakes testing. International Journal of Listening, 30(12),
824. doi: 10.1080/10904018.2015.1056876
Bauer, D. J. (2016). A more general model for testing measurement invariance and differential
item functioning. Psychological Methods. doi: 10.1037/met0000077
126 Bollmann, S. , Berger, M. , & Tutz, G. (2017). Item-focused trees for the detection of
differential item functioning in partial credit models. Educational and Psychological
Measurement, 78(5), 781804. doi: 10.1177/0013164417722179
Bond, T. G. , & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the
human sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Borsboom, D. , Mellenbergh, G. J. , & Van Heerden, J. (2002). Different kinds of DIF: A
distinction between absolute and relative forms of measurement invariance and bias. Applied
Psychological Measurement, 26(4), 433450. doi: 10.1177/014662102237798
Bray, M. , Butler, R. , Hui, P. , Kwo, O. , & Mang, E. (2002). Higher education in Macau: Growth
and strategic development. Hong Kong: Comparative Education Research Centre (CERC), The
University of Hong Kong.
Bray, M. , & Koo, R. (2004). Postcolonial patterns and paradoxes: Language and education in
Hong Kong and Macao. Comparative Education, 40(2), 215239.
Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.
Bulut, O. , Quo, Q. , & Gierl, M. J. (2017). A structural equation modeling approach for
examining position effects in large-scale assessments. Large-Scale Assessments in Education,
5(1), 8. doi: 10.1186/s40536-017-0042-x
Carlson, J. E. , & von Davier, M. (2017). Item response theory. In R. E. Bennett & M. vonDavier
(Eds.), Advancing human assessment (pp. 133160). Cham, Switzerland: Springer Open. doi:
10.1007/978-3-319-58689-2
Cheng, Y. , Shao, C. , & Lathrop, Q. N. (2015). The mediated MIMIC model for understanding
the underlying mechanism of DIF. Educational and Psychological Measurement, 76(1), 4363.
doi: 10.1177/0013164415576187
Dorans, N. J. (2017). Contributions to the quantitative assessment of item, test, and score
fairness. In R. E. Bennett & M. vonDavier (Eds.), Advancing human assessment (pp. 204230).
Cham, Switzerland: Springer Open. doi: 10.1007/978-3-319-58689-2
Elder, C. (1996). The effect of language background on Foreign language test performance:
The case of Chinese, Italian, and modern Greek. Language Learning, 46(2), 233282. doi:
10.1111/j.1467-1770.1996.tb01236.x
Elder, C. , McNamara, T. , & Congdon, P. (2003). Rasch techniques for detecting bias in
performance assessments: An example comparing the performance of native and non-native
speakers on a test of academic English. Journal of Applied Measurement, 4(2), 181197.
Engelhard Jr., G. (2013). Invariant measurement: Using Rasch models in the social, behavioral
and health sciences. New York, NY: Routledge.
Engelhard Jr., G. , Kobrin, J. L. , & Wind, S. A. (2014). Exploring differential subgroup
functioning on SAT writing items: What happens when English is not a test takers best
language? International Journal of Testing, 14(4), 339359. doi: 10.1080/15305058.2014.931281
Evans, S. , & Morrison, B. (2012). Learning and using English at university: Lessons from a
longitudinal study in Hong Kong. The Journal of Asia TEFL, 9(2), 2147.
Ferne, T. , & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in language
testing: Methodological advances, challenges, and recommendations. Language Assessment
Quarterly, 4(2), 113148. doi: 10.1080/15434300701375923
Field, J. (1998). Skills and strategies: Towards a new methodology for listening. ELT Journal,
52(2), 110118.
Field, J. (2005). Intelligibility and the listener: The role of lexical stress. TESOL Quarterly, 39(3),
399423.
127 Filipi, A. (2012). Do questions written in the target language make foreign language
listening comprehension tests more difficult? Language Testing, 29(4), 511532. doi:
10.1177/0265532212441329
Freedle, R. , & Kostin, I. (1997). Predicting black and white differential item functioning in verbal
analogy performance. Intelligence, 24(3), 417444. doi: 10.1016/S0160-2896(97)90058-1
Geranpayeh, A. , & Kunnan, A. J. (2007). Differential item functioning in terms of age in the
Certificate in Advanced English examination. Language Assessment Quarterly, 4(2), 190222.
doi: 10.1080/15434300701375758
Gierl, M. J. , & Khaliq, S. N. (2006). Identifying sources of differential item and bundle
functioning on translated achievement tests: A confirmatory analysis. Journal of Educational
Measurement, 38(2), 164187. doi: 10.1111/j.1745-3984.2001.tb01121.x
Gnaldi, M. , & Bacci, S. (2016). Joint assessment of the latent trait dimensionality and observed
differential item functioning of students national tests. Quality and Quantity, 50(4), 14291447.
doi: 10.1007/s11135-015-0214-0
Harding, L. (2012). Accent, listening assessment and the potential for a shared-L1 advantage: A
DIF perspective. Language Testing, 29(2), 163180. doi: 10.1177/0265532211421161
Hidalgo, M. D. , & Gmez-Benito, J. (2010). Education measurement: Differential item
functioning. In P. Peterson , E. Baker , & B. McGaw (Eds.), International encyclopedia of
education (3rd ed., Vol. 4, pp. 3644). Oxford, UK: Elsevier.
Holland, P. W. , & Thayer, D. T. (1988). Differential item performance and the Mantel Haenszel
procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 7686). Hillsdale, NJ: Routledge.
Huang, X. (2010). Differential item functioning: The consequence of language, curriculum, or
culture? PhD dissertation, University of California.
Huang, X. , Wilson, M. , & Wang, L. (2016). Exploring plausible causes of differential item
functioning in the PISA science assessment: Language, curriculum or culture. Educational
Psychology, 36(2), 378390. doi: 10.1080/01443410.2014.946890
Jang, E. E. , & Roussos, L. (2009). Integrative analytic approach to detecting and interpreting
L2 vocabulary DIF. International Journal of Testing, 9(3), 238259. doi:
10.1080/15305050903107022
Karami, H. , & Salmani Nodoushan, M. A. (2011). Differential item functioning: Current problems
and future directions. International Journal of Language Studies, 5(3), 133142.
Koo, J. , Becker, B. J. , & Kim, Y.-S. (2014). Examining differential item functioning trends for
English language learners in a reading test: A meta-analytical approach. Language Testing,
31(1), 89109. doi: 10.1177/0265532213496097
Kunnan, A. J. (Ed.). (2000). Fairness and validation in language assessment: Selected papers
from the 19th Language Testing Research Colloquium, Orlando, Florida. Cambridge:
Cambridge University Press.
Kunnan, A. J. (2007). Test fairness, test bias, and DIF. Language Assessment Quarterly, 4(2),
109112. doi: 10.1080/15434300701375865
Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch
Measurement Transactions, 16(2), 878.
Linacre, J. M. (2003). PCA: Data variance: Explained, modeled and empirical. Rasch
Measurement Transactions, 13(3), 942943.
Linacre, J. M. (2018a). DIF-DPF-bias-interactions concepts. Winsteps Help for Rasch Analysis.
Retrieved from www.winsteps.com/winman/difconcepts.htm
128 Linacre, J. M. (2018b). Dimensionality: Contrasts & variances. Winsteps Help for Rasch
Analysis. Retrieved from www.winsteps.com/winman/principalcomponents.htm
Linacre, J. M. (2018c). Winsteps (Version 4.2.0) [Computer software]. Beaverton, OR:
Winsteps.com . Retrieved from www.winsteps.com
Linacre, J. M. , & Wright, B. D. (1987). Item bias: Mantel-Haenszel and the Rasch model.
Memorandum No. 39. Retrieved from MESA Psychometric Laboratory, University of Chicago:
www.rasch.org/memo39.pdf
Linacre, J. M. , & Wright, B. D. (1989). Mantel-Haenszel DIF and PROX are equivalent! Rasch
Measurement Transactions, 3(2), 5253.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,
NJ: Lawrence Erlbaum Associates.
Lund, R. J. (1990). A taxonomy for teaching second language listening. Foreign Language
Annals, 23(2), 105115.
Luppescu, S. (1993). DIF detection examined: Which item has the real differential item
functioning? Rasch Measurement Transactions, 7(2), 285286.
Magis, D. , Raiche, G. , Beland, S. , & Gerard, P. (2011). A generalized logistic regression
procedure to detect differential item functioning among multiple groups. International Journal of
Testing, 11(4), 365386. doi: 10.1080/15305058.2011.602810
Magis, D. , Tuerlinckx, F. , & De Boeck, P. (2015). Detection of differential item functioning
using the lasso approach. Journal of Educational and Behavioral Statistics, 40(2), 111135. doi:
10.3102/1076998614559747
Mantel, N. , & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective
studies of disease. JNCI: Journal of the National Cancer Institute, 22(4), 719748. doi:
10.1093/jnci/22.4.719
Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of
Educational Statistics, 7(2), 105118. doi: 10.2307/1164960
Mellenbergh, G. J. (2005a). Item bias detection: Classical approaches. In B. S. Everitt & D. C.
Howell (Eds.), Encyclopedia of statistics in behavioral science (Vol. 2, pp. 967970). West
Sussex: Wiley.
Mellenbergh, G. J. (2005b). Item bias detection: Modern approaches. In B. S. Everitt & D. C.
Howell (Eds.), Encyclopedia of statistics in behavioral science (Vol. 2, pp. 970974). West
Sussex: Wiley.
Millsap, R. E. , & Everson, H. T. (1993). Methodology review: Statistical approaches for
assessing measurement bias. Applied Psychological Measurement, 17(4), 297334. doi:
10.1177/014662169301700401
Mullis, I. V. S. , Martin, M. O. , Kennedy, A. M. , & Foy, P. (2007). PIRLS 2006 International
Report: IEAs Progress in International Reading Literacy Study in primary schools in 40
countries. Retrieved from https://timss.bc.edu/PDF/PIRLS2006_international_report.pdf
Nandakumar, R. (1993). Simultaneous DIF amplification and cancellation: Shealy-Stouts test for
DIF. Journal of Educational Measurement, 30(4), 293311.
Oliveri, M. E. , Ercikan, K. , Lyons-Thomas, J. , & Holtzman, S. (2016). Analyzing fairness
among linguistic minority populations using a latent class differential item functioning approach.
Applied Measurement in Education, 29(1), 1729. doi: 10.1080/08957347.2015.1102913
Oliveri, M. E. , Lawless, R. , Robin, F. , & Bridgeman, B. (2018). An exploratory analysis of
differential item functioning and its possible sources in a higher education 129admissions
context. Applied Measurement in Education, 31(1), 116. doi: 10.1080/08957347.2017.1391258
Ownby, R. L. , & Waldrop-Valverde, D. (2013). Differential item functioning related to age in the
reading subtest of the Test of Functional Health Literacy in adults. Journal of Aging Research,
2013, 6. doi: 10.1155/2013/654589
Pae, T.-I. (2004). DIF for examinees with different academic backgrounds. Language Testing,
21(1), 5373. doi: 10.1191/0265532204lt274oa
Pae, T.-I. (2012). Causes of gender DIF on an EFL language test: A multiple-data analysis over
nine years. Language Testing, 29(4), 533554. doi: 10.1177/0265532211434027
Park, G.-P. (2008). Differential item functioning on an English listening test across gender.
TESOL Quarterly, 42(1), 115123. doi: 10.1002/j.1545-7249.2008.tb00212.x
Roussos, L. , & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. Applied
Psychological Measurement, 20(4), 355371. doi: 10.1177/014662169602000404
Roussos, L. , & Stout, W. (2004). Differential item functioning analysis: Detecting DIF items and
testing DIF hypotheses. In D. Kaplan (Ed.), The SAGE handbook of quantitative methodology
for the social sciences (pp. 107115). Thousand Oaks, CA: SAGE Publications. doi:
10.4135/9781412986311
Runnels, J. (2013). Measuring differential item and test functioning across academic disciplines.
Language Testing in Asia, 3(1), 111. doi: 10.1186/2229-0443-3-9
Salubayba, T. (2013). Differential item functioning detection in reading comprehension test
using Mantel-Haenszel, item response theory, and logical data analysis. The International
Journal of Social Sciences, 14 (1), 7682.
Salzberger, T. , Newton, F. J. , & Ewing, M. T. (2014). Detecting gender item bias and
differential manifest response behavior: A Rasch-based solution. Journal of Business Research,
67(4), 598607. doi: 10.1016/j.jbusres.2013.02.045
Sandilands, D. , Oliveri, M. E. , Zumbo, B. D. , & Ercikan, K. (2013). Investigating sources of
differential item functioning in international large-scale assessments using a confirmatory
approach. International Journal of Testing, 13(2), 152174. doi: 10.1080/15305058.2012.690140
Schauberger, G. , & Tutz, G. (2015). Detection of differential item functioning in Rasch models
by boosting techniques. British Journal of Mathematical and Statistical Psychology, 69(1),
80103. doi: 10.1111/bmsp.12060
Shealy, R. , & Stout, W. (1993a). An item response theory model for test bias. In P. W. Holland
& H. Wainer (Eds.), Differential item functioning (pp. 197239). Hillsdale, NJ: Lawrence Erlbaum
Associates.
Shealy, R. , & Stout, W. (1993b). A model-based standardization approach that separates true
bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF.
Psychometrika, 58(2), 159194. doi: 10.1007/BF02294572
Shimizu, Y. , & Zumbo, B. D. (2005). A logistic regression for differential item functioning primer.
Japan Language Testing Association, 7, 110124. doi: 10.20622/jltaj.7.0_110
Shin, S.-K. (2005). Did they take the same test? Examinee language proficiency and the
structure of language tests. Language Testing, 22(1), 3157. doi: 10.1191/0265532205lt296oa
Sireci, S. G. , & Rios, J. A. (2013). Decisions that make a difference in detecting differential item
functioning. Educational Research and Evaluation, 19(23), 170187. doi:
10.1080/13803611.2013.767621
130 Smith, R. M. (1994). Detecting item bias in the Rasch rating scale model. Educational and
Psychological Measurement, 54(4), 886896.
Song, M.-Y. (2008). Do divisible subskills exist in second language (L2) comprehension? A
structural equation modeling approach. Language Testing, 25(4), 435464. doi:
10.1177/0265532208094272
Swaminathan, H. , & Rogers, H. J. (1990). Detecting differential item functioning using logistic
regression procedures. Journal of Educational Measurement, 27(4), 361370. doi:
10.1111/j.1745-3984.1990.tb00754.x
Tennant, A. , & Pallant, J. (2007). DIF matters: A practical approach to test if differential item
functioning makes a difference. Rasch Measurement Transactions, 20(4), 10821084.
Teresi, J. A. (2006). Different approaches to differential item functioning in health applications:
Advantages, disadvantages and some neglected topics. Medical Care, 44(11), S152S170.
Thissen, D. , Steinberg, L. , & Wainer, H. (1993). Detection of differential item functioning using
the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item
functioning (pp. 67113). Hillsdale, NJ: Lawrence Erlbaum Associates.
Uiterwijk, H. , & Vallen, T. (2005). Linguistic sources of item bias for second generation
immigrants in Dutch tests. Language Testing, 22(2), 211234. doi: 10.1191/0265532205lt301oa
University of Macau . (2016, November 18). Registered students. Retrieved from
https://reg.umac.mo/qfacts/y2016/student/registered-students/
Urmston, A. , Raquel, M. R. , & Tsang, C. (2013). Diagnostic testing of Hong Kong tertiary
students English language proficiency: The development and validation of DELTA. Hong Kong
Journal of Applied Linguistics, 14(2), 6082.
Uusen, A. , & Mrsepp, M. (2012). Gender differences in reading habits among boys and girls of
basic school in Estonia. Procedia Social and Behavioral Sciences, 69, 17951804. doi:
doi.org/10.1016/j.sbspro.2012.12.129
van de Vijver, F. J. R. , & Poortinga, Y. H. (1997). Towards an integrated analysis of bias in
cross-cultural assessment. European Journal of Psychological Assessment, 13(1), 2937. doi:
10.1027/1015-5759.13.1.29
Wedman, J. (2018). Reasons for gender-related differential item functioning in a college
admissions test. Scandinavian Journal of Educational Research, 62(6), 959970. doi:
10.1080/00313831.2017.1402365
Wyse, A. (2013). DIF cancellation in the Rasch model. Journal of Applied Measurement, 14(2),
118128.
Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27(2),
147170. doi: 10.1177/0265532209349465
Yan, X. (2017). The language situation in Macao. Current Issues in Language Planning, 18(1),
138. doi: 10.1080/14664208.2016.1125594
Yan, X. , Cheng, L. , & Ginther, A. (2018). Factor analysis for fairness: Examining the impact of
task type and examinee L1 background on scores of an ITA speaking test. Language Testing,
128. doi: 10.1177/0265532218775764
Yoo, H. , & Manna, V. F. (2015). Measuring English language workplace proficiency across
subgroups: Using CFA models to validate test score interpretation. Language Testing, 34(1),
101126. doi: 10.1177/0265532215618987
Yoo, H. , Manna, V. F. , Monfils, L. F. , & Oh, H.-J. (2018). Measuring English language
proficiency across subgroups: Using score equity assessment to evaluate test fairness.
Language Testing. doi: 10.1177/0265532218776040
131 Young, M. Y. C. (2011). English use and education in Macao. In A. Feng (Ed.), English
language education across greater China (pp. 114130). Bristol, UK: Multilingual
Matters/Channel View Publications.
Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where
it is now, and where it is going. Language Assessment Quarterly, 4(2), 223233. doi:
10.1080/15434300701375832
Zumbo, B. D. , Liu, Y. , Wu, A. D. , Shear, B. R. , Olvera Astivia, O. L. , & Ark, T. K. (2015). A
methodology for Zumbos third generation DIF analyses and the ecology of item responding.
Language Assessment Quarterly, 12(1), 136151. doi: 10.1080/15434303.2014.972559
Zumbo, B. D. , Liu, Y. , Wu, A. D. , Shear, B. R. , Olvera Astivia, O. L. , & Ark, T. K. (2015). A
methodology for Zumbos third generation DIF analyses and the ecology of item responding.
Language Assessment Quarterly, 12(1), 136151. doi: 10.1080/15434303.2014.972559

Application of the rating scale model and the partial credit model in
language assessment research
Adams, R. J. , Griffin, P. E. , & Martin, L . (1987). A latent trait method for measuring a
dimension in second language proficiency. Language Testing, 4(1), 927.
Agresti, A. (2012). Categorical data analysis (3rd ed.). Hoboken, NJ: John Wiley & Sons Inc.
Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 6981.
Andrich, D. (1978a). Application of a psychometric rating model to ordered categories which are
scored with successive integers. Applied Psychological Measurement, 2(4), 581594.
Andrich, D. (1978b). A rating formulation for ordered response categories. Psychometrika,
43(4), 561573.
Aryadoust, V. (2012). How does sentence structure and vocabulary function as a scoring
criterion alongside other criteria in writing assessment? International Journal of Language
Testing, 2(1), 2858.
Aryadoust, V. , Goh, C. C. M. , & Kim, L. O. (2012). Developing and validating an academic
listening questionnaire. Psychological Test and Assessment Modeling, 54(3), 227256.
Bond, T. , & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the
human sciences (3rd ed.). New York, NY: Routledge.
Cai, L. , & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item
factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245276.
Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the R
environment. Journal of Statistical Software, 48(6), 129.
du Toit, M. (Ed.). (2003). IRT from SSI: Bilog-MG, multilog, parscale, testfact. Lincolnwood, IL:
Scientific Software International, Inc.
Eckes, T. (2012). Examinee-centered standard setting for large-scale assessments: The
prototype group method. Psychological Test and Assessment Modeling, 54(3), 257283.
Eckes, T. (2017). Setting cut scores on an EFL placement test using the prototype group
method: A receiver operating characteristic (ROC) analysis. Language Testing, 34(3), 383411.
Fischer, G. H. , & Molenaar, I. W. (Eds.). (1995). Rasch models: Foundations, recent
developments, and applications. New York, NY: Springer Science & Business Media.
Fulcher, G. (1996). Testing tasks: Issues in task design and the group oral. Language Testing,
13(1), 2351.
151 Glas, C. A. W. (2016). Maximum-likelihood estimation. In W. van der Linden (Ed.),
Handbook of item response theory (Vol. 2, pp. 197216). Boca Raton, FL: CRC Press.
Haberman, S. J. (2006). Joint and conditional estimation for implicit models for tests with
polytomous item scores (ETS RR-06-03). Princeton, NJ: Educational Testing Service.
Haberman, S. J. (2016). Models with nuisance and incidental parameters. In W. van der Linden
(Ed.), Handbook of item response theory (Vol. 2, pp. 151170). Boca Raton, FL: CRC Press.
Hambleton, R. K. , & Han, N. (2005). Assessing the fit of IRT models to educational and
psychological test data: A five-step plan and several graphical displays. In R. R. Lenderking &
D. A. Revicki (Eds.), Advancing health outcomes research methods and clinical applications
(pp. 5777). McLean, VA: Degnon Associates.
Hambleton, R. K. , Swaminathan, H. , & Rogers, H. J. (1991). Fundamentals of item response
theory (Vol. 2). Newbury Park, CA: SAGE Publications.
Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied
Measurement, 1, 152176.
Kunnan, A. J. (1991). Modeling relationships among some test-taker characteristics and
performance on tests of English as a foreign language. Unpublished doctoral dissertation,
University of California, LA.
Lee, Y.-W. , Gentile, C. , & Kantor, R. (2008). Analytic scoring of TOEFL CBT essays: Scores
from humans and E-rater (ETS RR-08-81). Princeton, NJ: Educational Testing Service.
Lee-Ellis, S. (2009). The development and validation of a Korean C-test using Rasch analysis.
Language Testing, 26(2), 245274.
Linacre, J. M. (1994). Many-facet Rasch measurement (2nd ed.). Chicago, IL: MESA Press.
Linacre, J. M. (2017a). Winsteps Rasch measurement computer program [Computer software].
Beaverton, OR: Winsteps.com.
Linacre, J. M. (2017b). Winsteps Rasch measurement computer program users guide .
Beaverton, OR: Winsteps.com.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149174.
Maydeu-Olivares, A. , & Joe, H. (2005). Limited and full information estimation and goodness-
of-fit testing in 2n contingency tables: A unified framework. Journal of the American Statistical
Association, 100(471), 10091020.
McNamara, T. , & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement
in language testing. Language Testing, 29(4), 555576.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied
Psychological Measurement, 16(2), 159176.
Neyman, J. , & Scott, E. L. (1948). Consistent estimation from partially consistent observations.
Econometrica, 16, 132.
Pollitt, A. , & Hutchinson, C. (1987). Calibrating graded assessments: Rasch partial credit
analysis of performance in writing. Language Testing, 4(1), 7297.
R Core Team . (2018). R: A language and environment for statistical computing. Vienna,
Austria: R Foundation for Statistical Computing.
Rose, N. , von Davier, M. , & Xu, X. (2010). Modeling nonignorable missing data with item
response theory (ETS RR-10-11). Princeton, NJ: Educational Testing Service.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.
Psychometrika, Monograph Supplement No. 17.
152 Stewart, J. , Batty, A. O. , & Bovee, N. (2012). Comparing multidimensional and continuum
models of vocabulary acquisition: An empirical examination of the vocabulary knowledge scale.
TESOL Quarterly, 46(4), 695721.
Suzuki, Y. (2015). Self-assessment of Japanese as a second language: The role of experiences
in the naturalistic acquisition. Language Testing, 32(1), 6381.
Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language
Testing, 29(3), 325344.
Wells, C. S. , & Hambleton, R. K. (2016). Model fit with residual analysis. In W. van der Linden
(Ed.), Handbook of item response theory (Vol. 2, pp. 395413). Boca Raton, FL: CRC Press.
Wright, B. D. , & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch
Measurement Transactions, 8, 370.
Wright, B. D. , & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press.
Many-facet Rasch measurement
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43,
561573.
Andrich, D. (1998). Thresholds, steps and rating scale conceptualization. Rasch Measurement
Transactions, 12, 648649.
Aryadoust, V. (2016). Gender and academic major bias in peer assessment of oral
presentations. Language Assessment Quarterly, 13, 124.
Baker, B. A. (2012). Individual differences in rater decision-making style: An exploratory mixed-
methods study. Language Assessment Quarterly, 9, 225248.
Barkaoui, K. (2014). Multifaceted Rasch analysis for test evaluation. In A. J. Kunnan (Ed.), The
companion to language assessment: Evaluation, methodology, and interdisciplinary themes
(Vol. 3, pp. 13011322). Chichester: Wiley.
Bejar, I. I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues
and Practice, 31(3), 29.
173 Bonk, W. J. , & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language
group oral discussion task. Language Testing, 20, 89110.
Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement in
Education, 24, 121.
Carr, N. T. (2011). Designing and analyzing language tests. Oxford: Oxford University Press.
Casabianca, J. M. , Junker, B. W. , & Patz, R. J. (2016). Hierarchical rater models. In W. J. van
der Linden (Ed.), Handbook of item response theory (Vol. 1, pp. 449465). Boca Raton, FL:
Chapman & Hall/CRC.
Coniam, D. (2010). Validating onscreen marking in Hong Kong. Asia Pacific Education Review,
11, 423431.
Curcin, M. , & Sweiry, E. (2017). A facets analysis of analytic vs. holistic scoring of identical
short constructed-response items: Different outcomes and their implications for scoring rubric
development. Journal of Applied Measurement, 18, 228246.
DeCarlo, L. T. , Kim, Y. K. , & Johnson, M. S. (2011). A hierarchical rater model for constructed
responses, with a signal detection rater model. Journal of Educational Measurement, 48,
333356.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance
assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2, 197221.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to
rater variability. Language Testing, 25, 155185.
Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater
behavior. Language Assessment Quarterly, 9, 270292.
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating
rater-mediated assessments (2nd ed.). Frankfurt am Main: Peter Lang.
Eckes, T. (2017). Rater effects: Advances in item response modeling of human ratings Part I
(Guest Editorial). Psychological Test and Assessment Modeling, 59(4), 443452.
Elder, C. , Barkhuizen, G. , Knoch, U. , & von Randow, J. (2007). Evaluating rater responses to
an online training program for L2 writing assessment. Language Testing, 24, 3764.
Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral,
and health sciences. New York, NY: Routledge.
Engelhard, G. , Wang, J. , & Wind, S. A. (2018). A tale of two models: Psychometric and
cognitive perspectives on rater-mediated assessments using accuracy ratings. Psychological
Test and Assessment Modeling, 60(1), 3352.
Engelhard, G. , & Wind, S. A. (2018). Invariant measurement with raters and rating scales:
Rasch models for rater-mediated assessments. New York, NY: Routledge.
Hsieh, M. (2013). An application of multifaceted Rasch measurement in the Yes/No Angoff
standard setting procedure. Language Testing, 30, 491512.
Jeong, H. (2017). Narrative and expository genre effects on students, raters, and performance
criteria. Assessing Writing, 31, 113125.
Johnson, R. L. , Penny, J. A. , & Gordon, B. (2009). Assessing performance: Designing,
scoring, and validating performance tasks. New York, NY: Guilford.
Knoch, U. , & Chapelle, C. A. (2018). Validation of rating processes within an argument-based
framework. Language Testing, 35, 477499.
Knoch, U. , Read, J. , & von Randow, J. (2007). Re-training writing raters online: How does it
compare with face-to-face training? Assessing Writing, 12, 2643.
174 Lamprianou, I . (2006). The stability of marker characteristics across tests of the same
subject and across subjects. Journal of Applied Measurement, 7, 192205.
Lane, S. , & Iwatani, E. (2016). Design of performance assessments in education. In S. Lane ,
M. R. Raymond , & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp.
274293). New York, NY: Routledge.
Lee, Y.-W. , & Kantor, R. (2015). Investigating complex interaction effects among facet
elements in an ESL writing test consisting of integrated and independent tasks. Language
Research, 51(3), 653678.
Li, H. (2016). How do raters judge spoken vocabulary? English Language Teaching, 9(2),
102115.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press.
Linacre, J. M. (2000). Comparing Partial Credit Models (PCM) and Rating Scale Models (RSM).
Rasch Measurement Transactions, 14, 768.
Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch
Measurement Transactions, 16, 878.
Linacre, J. M. (2003). The hierarchical rater model from a Rasch perspective. Rasch
Measurement Transactions, 17, 928.
Linacre, J. M. (2006). Demarcating category intervals. Rasch Measurement Transactions, 19,
10411043.
Linacre, J. M. (2010). Transitional categories and usefully disordered thresholds. Online
Educational Research Journal, 1(3).
Linacre, J. M. (2018a). Facets Rasch measurement computer program (Version 3.81)
[Computer software]. Chicago, IL: Winsteps.com.
Linacre, J. M. (2018b). A users guide to FACETS: Rasch-model computer programs. Chicago,
IL: Winsteps.com. Retrieved from www.winsteps.com/facets.htm
Looney, M. A. (2012). Judging anomalies at the 2010 Olympics in mens figure skating.
Measurement in Physical Education and Exercise Science, 16, 5568.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149174.
McNamara, T. , & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement
in language testing. Language Testing, 29, 555576.
Mulqueen, C. , Baker, D. P. , & Dismukes, R. K. (2002). Pilot instructor rater training: The utility
of the multifacet item response theory model. International Journal of Aviation Psychology, 12,
287303.
Myford, C. M. , & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet
Rasch measurement: Part I. Journal of Applied Measurement, 4, 386422.
Myford, C. M. , & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet
Rasch measurement: Part II. Journal of Applied Measurement, 5, 189227.
Norris, J. , & Drackert, A. (2018). Test review: TestDaF. Language Testing, 35, 149157.
Randall, J. , & Engelhard, G. (2009). Examining teacher grades using Rasch measurement
theory. Journal of Educational Measurement, 46, 118.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, IL:
University of Chicago Press (Original work published 1960).
Reed, D. J. , & Cohen, A. D. (2001). Revisiting raters and ratings in oral language assessment.
In C. Elder (Eds.), Experimenting with uncertainty: Essays in honour of Alan Davies (pp. 8296).
Cambridge: Cambridge University Press.
175 Robitzsch, A. , & Steinfeld, J. (2018). Item response models for human ratings: Overview,
estimation methods and implementation in R. Psychological Test and Assessment Modeling,
60(1), 101138.
Schaefer, E. (2016). Identifying rater types among native English-speaking raters of English
essays written by Japanese university students. In V. Aryadoust & J. Fox (Eds.), Trends in
language assessment research and practice: The view from the Middle East and the Pacific
Rim (pp. 184207). Newcastle upon Tyne: Cambridge Scholars.
Springer, D. G. , & Bradley, K. D. (2018). Investigating adjudicator bias in concert band
evaluations: An application of the many-facets Rasch model. Musicae Scientiae, 22, 377393.
Till, H. , Myford, C. , & Dowell, J. (2013). Improving student selection using multiple mini-
interviews with multifaceted Rasch modeling. Academic Medicine, 88, 216223.
Wang, J. , Engelhard, G. , Raczynski, K. , Song, T. , & Wolfe, E. W. (2017). Evaluating rater
accuracy and perception for integrated writing assessments using a mixed-methods approach.
Assessing Writing, 33, 3647.
Wilson, M. (2011). Some notes on the term: Wright map. Rasch Measurement Transactions, 25,
1331.
Wind, S. A. , & Peterson, M. E. (2018). A systematic review of methods for evaluating rating
quality in language assessment. Language Testing, 35, 161192.
Wind, S. A. , & Schumacker, R. E. (2017). Detecting measurement disturbances in rater-
mediated assessments. Educational Measurement: Issues and Practice, 36(4), 4451.
Winke, P. , Gass, S. , & Myford, C. (2013). Raters L2 background as a potential source of bias
in rating oral performance. Language Testing, 30, 231252.
Wolfe, E. W. , & Song, T. (2015). Comparison of models and indices for detecting rater
centrality. Journal of Applied Measurement, 16(3), 228241.
Wright, B. D. , & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press.
Wright, B. D. , & Stone, M. H. (1979). Best test design. Chicago, IL: MESA Press.
Wu, M. (2017). Some IRT-based analyses for interpreting rater effects. Psychological Test and
Assessment Modeling, 59(4), 453470.
Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A
mixed-methods approach. Language Testing, 31, 501527.
Yen, W. M. , & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.),
Educational measurement (4th ed., pp. 111153). Westport, CT: American Council on
Education/Praeger.
Zhang, J. (2016). Same text different processing? Exploring how raters cognitive and meta-
cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 3753.

Analysis of differences between groups


Agresti, A. , & Finlay, B. (2009). Statistical methods for the social sciences (4th ed.). London:
Pearson Prentice Hall.
Allen, I. E. , & Seaman, C. A. (2007). Likert scales and data analyses. Quality Progress, 40(7),
6465.
Argyrous, G. (1996). Statistics for social research. Melbourne: MacMillan Education.
Barkaoui, K. (2014). Examining the impact of L2 proficiency and keyboarding skills on scores on
TOEFL-iBT writing tasks. Language Testing, 31(2), 241259.
Batty, A. O. (2015). A comparison of video- and audio-mediated listening tests with many-facet
Rasch modeling and differential distractor functioning. Language Testing, 32(1), 320.
Beglar, D. , & Hunt, A. (1999). Revising and validating the 2000 Word Level and University
Word Level Vocabulary Tests. Language Testing, 16(2), 131162.
Brown, J. D. (2008). Effect size and eta squared. Shiken: JALT Testing & Evaluation SIG
Newsletter, 12(2), 3843.
Casey, L. B. , Miller, N. D. , Stockton, M. B. , & Justice, W. V. (2016). Assessing writing in
elementary schools: Moving away from a focus on mechanics. Language Assessment
Quarterly, 13(1), 4254.
Chae, S. (2003). Adaptation of a picture-type creativity test for pre-school children. Language
Testing, 20(2), 179188.
Chang, Y. (2006). On the use of the immediate recall task as a measure of second language
reading comprehension. Language Testing, 23(4), 520543.
Chapelle, C. A. , Chung, Y. , Hegelheimer, V. , Pendar, N. , & Xu, J. (2010). Towards a
computer-delivered test of productive grammatical ability. Language Testing, 27(4) 443469.
Cheng, L. , Andrews, S. , & Yu, Y. (2010). Impact and consequences of school-based
assessment (SBA): Students and parents views of SBA in Hong Kong. Language Testing,
28(2), 221249.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Coniam, D. (2000). The use of audio or video comprehension as an assessment instrument in
the certification of English language teachers: A case study. System, 29(1), 114.
Connolly, T. , & Sluckin, W. (1971). An introduction to statistics for the social sciences. London:
MacMillan Education.
Coyle, Y. , & Gmez Gracia, R. (2014). Using songs to enhance L2 vocabulary acquisition in
preschool children. ELT Journal, 68(3), 276285.
Currie, M. , & Chiramanee, T. (2010). The effect of the multiple-choice item format on the
measurement of knowledge of language structure. Language Testing, 27(4), 471491.
Dancey, P. , & Reidy, J. (2011). Statistics without maths for psychology. New York, NY:
Prentice Hall/Pearson.
196 Davis, L. (2016). The influence of training and experience on rater performance in scoring
spoken language. Language Testing, 33(1), 117135.
Davis, S. F. , & Smith, R. A. (2005). An introduction to statistics and research methods:
Becoming a psychological detective. Upper Saddle River, NJ: Pearson/Prentice Hall.
East, M. (2007). Bilingual dictionaries in tests of L2 writing proficiency: Do they make a
difference? Language Testing, 24(3), 331353.
Elgort, I. (2012). Effects of L1 definitions and cognate status of test items on the Vocabulary
Size Test. Language Testing, 30(2), 253272.
Fidalgo, A. , Alavi, S. , & Amirian, S. (2014). Strategies for testing statistical and practical
significance in detecting DIF with logistic regression models. Language Testing, 31(4) 433451.
Gebril, A. , & Plakans, L. (2013). Toward a transparent construct of reading-to-write tasks: The
interface between discourse features and proficiency. Language Assessment Quarterly, 10(1),
927.
Gordon, R. A. (2012). Applied statistics for the social and health sciences. New York, NY:
Routledge.
Grabe, W. (2009). Reading in a second language: Moving from theory to practice. Cambridge:
Cambridge University Press.
Green, R. (2013). Statistical analyses for language testers. New York, NY: Palgrave Macmillan.
He, L. , & Shi, L. (2012). Topical knowledge and ESL writing. Language Testing, 29(3), 443464.
Ilc, G. , & Stopar, A. (2015). Validating the Slovenian national alignment to CEFR: The case of
the B2 reading comprehension examination in English. Language Testing, 32(4), 443462.
Kiddle, T. , & Kormos, J. (2011). The effect of mode of response on a semidirect test of oral
proficiency. Language Assessment Quarterly, 8(4), 342360.
Kim, A. Y. (2015). Exploring ways to provide diagnostic feedback with an ESL placement test:
Cognitive diagnostic assessment of L2 reading ability. Language Testing, 32(2), 227258.
Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-
integration model. Psychological Review, 95(2), 163182.
Knoch, U. , & Elder, C. (2010). Validity and fairness implications of varying time conditions on a
diagnostic test of academic English writing proficiency. System, 38(1), 6374.
Kobayashi, M. (2002). Method effects on reading comprehension test performance: Text
organization and response format. Language Testing, 19(2), 193220.
Kormos, J. (1999). Simulating conversations in oral proficiency assessment: A conversation
analysis of role plays and non-scripted interviews in language exams. Language Testing, 16(2),
163188.
Leaper, D. A. , & Riazi, M. (2014). The influence of prompt on group oral tests. Language
Testing, 31(2), 177204.
Lee, H. , & Winke, P. (2012). The differences among three-, four-, and five-option-item formats
in the context of a high-stakes English-language listening test. Language Testing, 30(1), 99123.
Lehmann, E. L. , & Romano, J. P. (2005). Generalizations of the familywise error rate. The
Annals of Statistics, 33(3), 11381154.
Leong, C. , Ho, M. , Chang, J. , & Hau, K. (2012). Differential importance of language
components in determining secondary school students Chinese reading literacy performance.
Language Testing, 30(4), 419439.
197 Levine, T. R. , & Hullett, C. R. (2002). Eta squared, partial eta squared, and misre-porting of
effect size in communication research. Human Communication Research, 28(4), 612625.
Li, X. , & Brand, M. (2009). Effectiveness of music on vocabulary acquisition, language usage,
and meaning for mainland Chinese ESL learners. Contributions to Music Education, 7384.
Mann, W. , Roy, P. , & Morgan, G. (2016). Adaptation of a vocabulary test from British Sign
Language to American Sign Language. Language Testing, 33(1), 322.
Neri, A. , Mich, O. , Gerosa, M. , & Giuliani, D. (2008). The effectiveness of computer assisted
pronunciation training for foreign language learning by children. Computer Assisted Language
Learning, 21(5), 393408.
Nitta, R. , & Nakatsuhara, F. (2014). A multifaceted approach to investigating pre-task planning
effects on paired oral test performance. Language Testing, 31(2), 147175.
Ockey, G. (2007). Construct implications of including still image or video in computer-based
listening tests. Language Testing, 24(4), 517537.
OSullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance.
Language Testing, 19(3), 277295.
Ott, R. L. , & Longnecker, M. (2001). An introduction to statistical methods and data analysis
(5th ed.). Belmont, CA: Duxbury Press.
Paquette, K. R. , & Rieg, S. A. (2008). Using music to support the literacy development of young
English language learners. Early Childhood Education Journal, 36(3), 227232.
Richardson, J. T. (2011). Eta squared and partial eta squared as measures of effect size in
educational research. Educational Research Review, 6(2), 135147.
Saricoban, A. , & Metin, E. (2000). Songs, verse and games for teaching grammar. The Internet
TESL Journal, 6(10), 17.
Sasaki, M. (2000). Effects of cultural schemata on students test-taking processes for cloze
tests: A multiple data source approach. Language Testing, 17(1), 85114.
Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing,
25(4), 465493.
Schmitt, N. , Ching, W. , & Garras, J. (2011). The Word Associates Format: Validation evidence.
Language Testing, 28(1), 105126.
Schoonen, R. , & Verhallen, M. (2008). The assessment of deep word knowledge in young first
and second language learners. Language Testing, 25(2), 211236.
Seaman, M. A. , Levin, J. R. , & Serlin, R. C. (1991). New developments in pairwise multiple
comparisons: Some powerful and practicable procedures. Psychological Bulletin, 110(3), 577.
Shin, S. Y. , & Ewert, D. (2015). What accounts for integrated reading-to-write task scores?
Language Testing, 32(2), 259281.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677680.
Vermeer, A. (2000). Coming to grips with lexical richness in spontaneous speech data.
Language Testing, 17(1), 6583.
Wigglesworth, G. , & Elder, C. (2010). An investigation of the effectiveness and validity of
planning time in speaking test tasks. Language Assessment Quarterly, 7(1), 124.

Application of ANCOVA and MANCOVA in language assessment


research
* Bae, J. , & Lee, Y.-S. (2011). The validation of parallel test forms: Mountain and beach picture
series for assessment of language skills. Language Testing, 28(2), 155177. doi:
10.1177/0265532210382446
* Barkaoui, K. (2014). Examining the impact of L2 proficiency and keyboarding skills on scores
on TOEFL-iBT writing tasks. Language Testing, 31(2), 241259. doi:
10.1177/0265532213509810
* Becker, A. (2016). Student-generated scoring rubrics: Examining their formative value for
improving ESL students writing performance. Assessing Writing, 29, 1524. doi:
10.1016/j.asw.2016.05.002
* Bochner, J. H. , Samar, V. J. , Hauser, P. C. , Garrison, W. M. , Searls, J. M. , & Sanders, C.
A. (2016). Validity of the American Sign Language Discrimination Test. Language Testing,
33(4), 473495. doi: 10.1177/0265532215590849
* Bridgeman, B. , Trapani, C. , & Bivens-Tatum, J. (2011). Comparability of essay question
variants. Assessing Writing, 16(4), 237255. doi: 10.1016/j.asw.2011.06.002
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum Associates.
* Diab, N. M. (2011). Assessing the relationship between different types of student feedback
and the quality of revised writing. Assessing Writing, 16(4), 274292. doi:
10.1016/j.asw.2011.08.001
* Erling, E. J. , & Richardson, J. T. E. (2010). Measuring the academic skills of university
students: Evaluation of a diagnostic procedure. Assessing Writing, 15(3), 177193. doi:
10.1016/j.asw.2010.08.002
Faul, F. , Erdfelder, E. , Lang, A. G. , & Buchner, A. (2007). G* Power 3: A flexible statistical
power analysis program for the social, behavioral, and biomedical sciences. Behavior Research
Methods, 39(2), 175191. doi: 10.3758/BF03193146
Field, A. (2009). Discovering statistics using SPSS. London: SAGE Publications.
* Green, A. B. (2005). EAP study recommendations and score gains on the IELTS Academic
Writing Test. Assessing Writing, 10(1), 4460. doi: 10.1016/j.asw.2005.02.002
* Huang, S.-C. (2010). Convergent vs. divergent assessment: Impact on college EFL students
motivation and self-regulated learning strategies. Language Testing, 28(2), 251271. doi:
10.1177/0265532210392199
* Huang, S.-C. (2015). Setting writing revision goals after assessment for learning. Language
Assessment Quarterly, 12(4), 363385. doi: 10.1080/15434303.2015.1092544
217 Huberty, C. J. , & Morris, J. D. (1989). Multivariate analysis versus multiple univariate
analyses. Psychological Bulletin, 105(2), 302308. doi: 10.1037/0033-2909.105.2.302
IBM Corporation (2016). SPSS for Windows (Version 24). Armonk, NY: IBM Corporation.
Keselman, H. J. , Huberty, C. J. , Lix, L. M. , Olejnik, S. , Cribbie, R. A. , Donahue, B. , & Levin,
J. R. (1998). Statistical practices of educational researchers: An analysis of their ANOVA,
MANOVA, and ANCOVA analyses. Review of Educational Research, 68(3), 350386. doi:
10.3102/00346543068003350
* Khonbi, Z. , & Sadeghi, K. (2012). The effect of assessment type (self vs. peer vs. teacher) on
Iranian University EFL students course achievement. Language Testing in Asia, 2(4), 4774. doi:
10.1186/2229-0443-2-4-47
* Kobrin, J. L. , Deng, H. , & Shaw, E. J. (2011). The association between SAT prompt
characteristics, response features, and essay scores. Assessing Writing, 16(3), 154169. doi:
10.1016/j.asw.2011.01.001
* Ling, G. (2017a). Are TOEFL iBT writing test scores related to keyboard type? A survey of
keyboard-related practices at testing centers. Assessing Writing, 31, 112. doi:
10.1016/j.asw.2016.04.001
* Ling, G. (2017b). Is writing performance related to keyboard type? An investigation from
examinees perspectives on the TOEFL iBT. Language Assessment Quarterly, 14(1), 3653. doi:
10.1080/15434303.2016.1262376
Logan, S. , & Johnston, R. (2009). Gender differences in reading ability and attitudes:
Examining where these differences lie. Journal of Research in Reading, 32(2), 199214. doi:
10.1111/j.1467-9817.2008.01389.x
Mair, P. , and Wilcox, R. (2017). WRS2: A collection of robust statistical methods [R package
version 0.92]. Retrieved from http://CRAN.R-project.org/package=WRS2
McKenna, M. C. , Conradi, K. , Lawrence, C. , Jang, B. G. , & Meyer, J. P. (2012). Reading
attitudes of middle school students: Results of a U.S. survey. Reading Research Quarterly,
47(3), 283306. doi: 10.1002/rrq.021
Nicol, A. A. M. , & Pexman, P. M. (2010). Presenting your findings: A practical guide for creating
tables. Washington, DC: American Psychological Association.
* Ockey, G. J. (2009). The effects of group members personalities on a test takers L2 group oral
discussion test scores. Language Testing, 26(2), 161186. doi: 10.1177/0265532208101005
Organisation for Economic Co-operation and Development (OECD) . (2000). Measuring student
knowledge and skills: The PISA 2000 assessment of reading, mathematical and scientific
literacy. Paris: OECD Publishing.
Organisation for Economic Co-operation and Development (OECD) . (2009). PISA 2009
assessment framework key competencies in reading, mathematics and science programme for
international student assessment. Paris: OECD Publishing.
Organisation for Economic Co-operation and Development (OECD) . (2012). PISA 2009
technical report. Paris: OECD Publishing.
Organisation for Economic Co-operation and Development (OECD) . (2014). PISA 2012
technical report. Paris: OECD Publishing.
Petscher, Y. (2010). A meta-analysis of the relationship between student attitudes towards
reading and achievement in reading. Journal of Research in Reading, 33(4), 335355. doi:
10.1111/j.1467-9817.2009.01418.x
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting
practices in quantitative L2 research. Studies in Second Language Acquisition, 35(4), 655687.
doi: 10.1017/S0272263113000399
218 Plonsky, L. , & Oswald, F. L. (2014). How big is big? Interpreting effect sizes in L2
research. Language Learning, 64(4), 878912. doi: 10.1111/lang.12079
* Riazi, A. M. (2016). Comparing writing performance in TOEFL-iBT and academic
assignments: An exploration of textual features. Assessing Writing, 28, 1527. doi:
10.1016/j.asw.2016.02.001
* Ross, J. A. , Rolheiser, C. , & Hogaboam-Gray, A. (1999). Effects of self-evaluation training on
narrative writing. Assessing Writing, 6(1), 107132. doi: 10.1016/S1075-2935(99)00003-3
* Shohamy, E. , & Inbar, O. (1991). Validation of listening comprehension tests: The effect of
text and question type. Language Testing, 8(1), 2340. doi: 10.1177/026553229100800103
* Sundeen, T. H. (2014). Instructional rubrics: Effects of presentation options on writing quality.
Assessing Writing, 21, 7488. doi: 10.1016/j.asw.2014.03.003
Tabachnick, B. G. , & Fidell, L. S. (2013). Using multivariate statistics (4th ed.). Boston, MA:
Allyn and Bacon.
von Davier, M. , Gonzalez, E. , & Mislevy, R. (2009). What are plausible values and why are
they useful. In M. von Davier & Hastedt (Eds.), Issues and methodologies in large-scale
assessments. IERI Monograph Series (pp. 936). Princeton, NJ: International Association for the
Evaluation of Educational Achievement and Educational Testing Service.
* Wagner, E. (2010). The effect of the use of video texts on ESL listening test-taker
performance. Language Testing, 27(4), 493513. doi: 10.1177/0265532209355668
* Windsor, J. (1999). Effect of semantic inconsistency on sentence grammaticality judgements
for children with and without language-learning disabilities. Language Testing, 16(3), 293313.
doi: 10.1177/026553229901600304
* Xi, X. (2005). Do visual chunks and planning impact performance on the graph description
task in the SPEAK exam? Language Testing, 22(4), 463508. doi: 10.1191/0265532205lt305oa
* Zeidner, M. (1986). Are English language aptitude tests biased towards culturally different
minority groups? Some Israeli findings. Language Testing, 3(1), 8098. doi:
10.1177/026553228600300104
* Zeidner, M. (1987). A comparison of ethnic, sex and age bias in the predictive validity of
English language aptitude tests: Some Israeli data. Language Testing, 4(1), 5571. doi:
10.1177/026553228700400106

Application of linear regression in language assessment


Abedi, J. (2008). Classification system for English language learners: Issues and
recommendations. Educational Measurement: Issues and Practice, 27(3), 1731.
Aguinis, H. , Petersen, S. A. , & Pierce, C. A. (1999). Appraisal of the homogeneity of error
variance assumption and alternatives to multiple regression for estimating moderating effects of
categorical variables. Organizational Research Methods, 2(4), 315339.
Allison, P. (2012). When can you safely ignore multicollinearity? Retrieved from
https://statisticalhorizons.com/multicollinearity
Benoit, K. (2011). Linear regression models with logarithmic transformations. Retrieved from
https://pdfs.semanticscholar.org/169c/c9bbbd77cb7cec23481f6ecb-2ce071e4e94e.pdf
Billings, A. B. (2016). Linear regression analysis using PROC GLM. Retrieved from
www.stat.wvu.edu/~abilling/STAT521_ProcGLMRegr.pdf
Bohrnstedt, G. W. , & Carter, T. M. (1971). Robustness in regression analysis. Sociological
Methodology, 3, 118146.
Box, G. E. P. (1979). Robustness in the strategy of scientific model building. In R. L. Launer and
G. N. Wilkinson (Eds.), Robustness in statistics (pp. 201236). New York, NY: Academic Press.
Cody, R. P. , & Smith, J. K. (1991). Applied statistics and the SAS programming language.
Englewood Cliffs, NJ: Prentice Hall.
241 Cummins, J. (2008). BICS and CALP: Empirical and theoretical status of the distinction. In
B. Street & N. H. Hornberger (Eds.), Encyclopedia of language and education (2nd ed., Vol. 2,
pp. 7183). New York, NY: Springer Science + Business Media LLC.
Darlington, R. B. (1968). Multiple regression in psychological research and practice.
Psychological Bulletin, 69(3), 161182.
Gottlieb, M. (2004). English language proficiency standards: Kindergarten through grade 12.
Retrieved from www.wida.us/standards/Resource_Guide_web.pdf
Grace-Martin, K. (2012). Assessing the fit of regression models. Retrieved from
www.cscu.cornell.edu/news/statnews/stnews68.pdf
Green, S. (1991). How many subjects does it take to do a regression analysis? Multivariate
Behavioral Research, 26(3), 499510.
Halle, T. , Hair, E. , Wandner, L. , McNamara, M. , & Chien, N. (2012). Predictors and outcomes
of early versus later English language proficiency among English language learners. Early
Childhood Research Quarterly, 27(1), 120.
Hosmer, D. W. , & Lemeshow, S. (1999). Applied survival analysis: Regression modeling of
time to event data. New York, NY: John Wiley & Sons, Inc.
Hoyt, W. , Leierer, S. , & Millington, M. (2006). Analysis and interpretation of findings using
multiple regression techniques. Rehabilitation Counseling Bulletin, 49(4), 223233.
Jaccard, J. , Guilamo-Ramos, V. , Johansson, M. , & Bouris, A. (2006). Multiple regression
analyses in clinical child and adolescent psychology. Journal of Clinical Child and Adolescent
Psychology, 35(3), 456479.
James, G. , Witten, D. , Hastie, T. , & Tibshirani, R. (2013). An introduction to statistical
learning: With applications in R. New York, NY: Springer Science + Business Media LLC.
Keith, T. (2006). Multiple regression and beyond. San Antonio, TX: Pearson Education.
Linn, R. L. (2008). Validation of uses and interpretations of state assessments. Washington,
DC: Technical Issues in Large-Scale Assessment, Council of Chief State School Officers.
Nau, R. (2017). Statistical forecasting: Notes on regression and timeseries analysis. Retrieved
from http://people.duke.edu/~rnau/testing.htm
Osborne, J. W. , & Waters, E. (2002). Four assumptions of multiple regression that researchers
should always test. Practical Assessment, Research and Evaluation, 8(2), 15.
Pedhazur, E. J. (1982). Multiple regression in behavioral research: Explanation and prediction.
San Antonio, TX: Harcourt College Publication.
Poole, M. , & OFarrell, P. (1971). The assumptions of the linear regression model. Transactions
of the Institute of British Geographers, 52, 145158.
Schmidt, F. L. (1971). The relative efficiency of regression and simple unit predictor weights in
applied differential psychology. Educational Psychological Measurement, 31(3), 699714.
Seo, D. , & Taherbhai, H. (2018). What makes Asian English language learners tick? The Asia-
Pacific Education Researcher, 27(4), 291302.
Seo, D. , Taherbhai, H. , & Franz, R. (2016). Psychometric evaluation and discussions of
English language learners learning strategies in the listening domain. International Journal of
Listening, 30(12), 4766.
Shieh, G. (2010). On the misconception of multicollinearity in detection of moderating effects:
Multicollinearity is not always detrimental. Multivariate Behavioral Research, 45(3), 483507.
242 Shoebottom, P. (2017). The factors that influence the acquisition of a second language.
Retrieved from http://esl.fis.edu/teachers/support/factors.htm
Shtatland, E. S. , Cain, E. , & Barton, M. B. (n.d.). The perils of stepwise logistic regression and
how to escape them using information criteria and the output delivery system. Paper # 222226.
Harvard Pilgrim Health Care, Harvard Medical School, Boston, MA. Retrieved from
www2.sas.com/proceedings/sugi26/p222-26.pdf
Stevens, J. (1992). Applied multivariate statistics for the social sciences (1st ed.). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Van Roekel, D. (2008). English language learners face unique challenges. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.190.2895&rep=rep1&type=pdf
Weisberg, S. (2005). Applied linear regression (3rd ed.). Hoboken, NJ: John Wiley & Sons, Inc.
Yu, C. (2016). Multicollinearity, variance inflation and orthogonalization in regression. Retrieved
from www.creative-wisdom.com/computer/sas/collinear_VIF.html
Zoghil, M. , Kazemi, S. A. , & Kalani, A. (2013). The effect of gender on language learning.
Retrieved from http://jnasci.org/wp-content/uploads/2013/12/1124-1128.pdf

Application of exploratory factor analysis in language assessment


Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bachman, L. F. , Davidson, F. , & Milanovic, M. (1996). The use of test method characteristics
in the content analysis and design of EFL proficiency tests. Language Testing, 13(2), 125150.
Bachman, L. F. , & Palmer, A. S. (2010). Language Assessment in practice. Oxford: Oxford
University Press.
Bryant, F. B. , & Yarnold, P. R. (1995). Principal-components analysis and exploratory and
confirmatory factor analysis. In L. G. Grimm & P. R. Yarnold (Eds.), Reading and understanding
multivariate statistics (pp. 99136). Washington, DC: American Psychological Association.
Carr, N. T. (2006). The factor structure of test task characteristics and examinee performance.
Language Testing, 23(3), 269289.
Celce-Murcia, M. , & Larsen-Freeman, D. (1999). The grammar book: An ESL/EFL teachers
course (2nd ed.). Boston, MA: Heinle & Heinle Publishers.
Cohen, A. D. (2006). The coming age of research on test-taking strategies. Language
Assessment Quarterly, 3(4), 307331.
Cohen, A. D. (2013). Using test: Wiseness strategy research in task development. In A. J.
Kunnan (Ed.), The companion to language assessment: Evaluation, methodology, 260and
interdisciplinary themes (Vol. 2, pp. 893905). Chichester: John Wiley & Sons Ltd.
Cohen, A. D. , & Upton, T. A. (2006). Strategies in responding to the new TOEFL reading tasks.
Monograph No. 33. Princeton, NJ: ETS. Retrieved from www.ets.org/Media/Research/pdf/RR-
06-06.pdf
Comrey, A. L. , & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Dunteman, G. H. (1989). Principal components analysis. Thousand Oaks, CA: SAGE
Publications.
Everitt, S. (1975). Multivariate analysis: The need for data, and other problems. British Journal
of Psychiatry, 126(3), 237240.
Fang, Z. (2010). A complete guide to the college English test band 4. Beijing: Foreign Language
Teaching and Research Press.
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Singapore: SAGE Publications.
Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of cognitive-
developmental inquiry. American Psychologist, 34(10), 906911.
Freedle, R. , & Kostin, I. (1993). The prediction of TOEFL reading item difficulty: Implications for
construct validity. Language Testing, 10(2), 133170.
Gagne, E. D. , Yekovich, C. W. , & Yekovich, F. R. (1993). The cognitive psychology of school
learning (2nd ed.). New York, NY: Harper Collins College Publishers.
Green, S. B. , & Salkind, N. J. (2008). Using SPSS for Windows and Macintosh: Analyzing and
understanding data (5th ed.). Upper Saddle River, NJ: Pearson Education Inc.
Guadagnoli, E. , & Velicer, W. F. (1988). Relation of sample size to the stability of component
patterns. Psychological Bulletin, 103(2), 265275.
Hair, J. , Black, B. , Babin, B. , Anderson, R. , & Tatham, R. (2010). Multivariate data analysis
(7th ed.). Upper Saddle River, NJ: Prentice-Hall.
Henson, R. K. , & Roberts, J. K. (2006). Use of exploratory factor analysis in published
research: Common errors and some comment on improved practice. Educational and
Psychological Measurement, 66(3), 393416.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and
Psychological Measurement, 20(1), 141151.
Kline, P. (1979). Psychometrics and psychology. London: Academic Press.
Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural modeling
approach. Cambridge: Cambridge University Press.
McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Mokhtari, K. , Sheorey, R. , & Reichard, C. A. (2008). Measuring the reading strategies of first
and second language readers. In K. Mokhtari & R. Sheorey (Eds.), Reading strategies of first-
and second language learners (pp. 8598). Norwood, MA: Christopher-Gordon Publishers, Inc.
Paris, S. G. , & Winograd, P. (1990). How metacognition can promote academic learning and
instruction. In B. F. Jones & L. Idol (Eds.), Dimensions of thinking and cognitive instruction (pp.
1551). Hillsdale, NJ: Lawrence Erlbaum Associates.
Pett, M. , Lackey, N. , & Sullivan, J. (2003). Making sense of factor analysis. Thousand Oaks,
CA: SAGE Publications.
Phakiti, A. (2008). Construct validation of Bachman and Palmers (1996) strategic competence
model over time in EFL reading tests. Language Testing, 25(2), 237272.
261 Pressley, M. , & Afflerbach, P. (1995). Verbal protocols of reading: The nature of
constructively responsive reading. Hillsdale, NJ: Lawrence Erlbaum Associates.
Purpura, J. E. (1999). Learner strategy use and performance on language tests: A structural
equation modeling approach. Cambridge: Cambridge University Press.
Rmhild, A. (2008). Investigating the invariance of the ECPE factor structure across different
proficiency levels. Spaan Fellow Working Papers in Second or Foreign Language Assessment,
6, 2955.
Sawaki, Y. , Quinlan, T. , & Lee, Y.-W. (2013). Understanding learner strengths and
weaknesses: Assessing performance on an integrated writing task. Language Assessment
Quarterly, 10(1), 7395.
Song, X. , & Cheng, L. (2006). Language learner strategy use and test performance of Chinese
learners of English. Language Assessment Quarterly, 3(3), 243266.
Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4th ed.). London:
Lawrence Erlbaum Associates.
Swaim, V. S. (2009). Determining the number of factors in data containing a single outlier: A
study of factor analysis of simulated data. Unpublished PhD dissertation, Louisiana State
University, Louisiana.
Tabachnick, B. G. , & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Upper Saddle
River, NJ: Pearson Education Inc.
Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts
and applications. Washington, DC: American Psychological Association.
Vandergrift, L. , Goh, C. C. M. , Mareschal, C. J. , & Tafaghodtari, M. H. (2006). The
metacognitive awareness listening questionnaire: Development and validation. Language
Learning, 56(3), 431462.
Zhang, L. M. (2014). A structural equation modeling approach to investigating test takers
strategy use and their EFL reading test performance. Asian EFL Journal, 16(1), 153188.
Zhang, L. M. (2017). Metacognitive and cognitive strategy use and reading comprehension: A
structural equation modelling approach. Singapore: Springer Nature Singapore Pte Ltd.
Zwick, W. R. , & Velicer, W. F. (1982). Factors influencing four rules for determining the number
of components to retain. Multivariate Behavioral Research, 17(2), 253269.
Zwick, W. R. , & Velicer, W. F. (1986). Comparison of five rules for determining the number of
components to retain. Psychological Bulletin, 99(3), 432442.

View publication stats

You might also like