0% found this document useful (0 votes)
6 views

ITEP Academic Reliability Validity 25JAN19

The iTEP Academic Validity & Reliability Report outlines the development, structure, and assessment process of the iTEP Academic test, which evaluates English language proficiency for college applicants. It details the test's sections—Grammar, Listening, Reading, Writing, and Speaking—and discusses the reliability and validity of the assessment methods used. The report emphasizes the importance of accurate language measurement for educational decision-making and highlights iTEP's recognition by various educational institutions.

Uploaded by

ELC ACADEMY
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

ITEP Academic Reliability Validity 25JAN19

The iTEP Academic Validity & Reliability Report outlines the development, structure, and assessment process of the iTEP Academic test, which evaluates English language proficiency for college applicants. It details the test's sections—Grammar, Listening, Reading, Writing, and Speaking—and discusses the reliability and validity of the assessment methods used. The report emphasizes the importance of accurate language measurement for educational decision-making and highlights iTEP's recognition by various educational institutions.

Uploaded by

ELC ACADEMY
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

iTEP Academic

VALIDITY &
RELIABILITY REPORT
TABLE OF CONTENTS

About this Report 4

Chapter 1 Introduction 5
Approach and Rationale for the Development of iTEP Academic 6
Chapter 2 Detailed Description of iTEP Academic 7
Theoretical Model for Language Assessment 7
Description of iTEP Academic Sections 8
Grammar Section 8
Listening Section 9
Reading Section 10
Writing Section 11
Speaking Section 12
Test Administration 13
Delivery Method 13
Examinee Experience 13
Scoring/Grading 14
Proficiency Levels 14
Chapter 3 iTEP Academic Development Process, Reliability, and Validity 15
Development Process 15
Reliability 16
Internal Consistency Reliability 16
Test-Retest Reliability 18
Rater Agreement 19
Validity 22
Content Validity 22
Convergent and Discriminant Validity 23
References 25
Appendix A: Examinee Pre-Assessment Modules and Instructions 26
Appendix B: iTEP Ability Guide 29
Appendix C: Summary of Steps to Minimize Content-Irrelevant Test Material 31

2 iTEP Academic
LIST OF FIGURES
Figure 1. Continuous Cycle of Item Development 16

LIST OF TABLES
Table 1. Internal Consistency Reliability Estimates for Relevant iTEP Academic Sections 18
Table 2. Test-Retest Reliability Estimates for iTEP Academic 19
Table 3. Raw Rater Agreement Analysis – Writing Section 20
Table 4. Raw Rater Agreement Analysis – Speaking Section 21
Table 5. rWG Rater Agreement Statistics – Writing Section 22
Table 6. rWG Rater Agreement Statistics – Speaking Section 22
Table 7. iTEP Academic Section Intercorrelations 24

Copyright © iTEP International

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, except as may be expressly permitted by the 1976 Copyright Act or by iTEP in writing.

Va l i d i t y & R e l i a b i l i t y R e p o r t 3
ABOUT THIS REPORT

Purpose
The purpose of this report is to describe the academic details and rationale for the development
of iTEP Academic and to summarize the reliability and validity of the assessment.

Acknowledgments
The data analyzed for this report were provided in raw form by iTEP International to Dr. Stephanie
Seiler, Independent Consultant. The data were not manipulated in any way, unless the data
manipulation was done according to accepted statistical procedures and/or was done according
to other rationale as described in this report. Dr. Seiler conducted all analyses and wrote the
report. Dr. Jay Verkuillen and Dr. Rob Stilson provided statistical consultation.

About the Author


Stephanie Seiler holds a PhD in Industrial and Organizational Psychology from the University
of Illinois at Urbana-Champaign. Dr. Seiler worked in the personnel selection industry for
five years as a Selection and Assessment Manager and then as the Director of Research and
Development for a leading personnel selection company. In these roles, she was responsible for
customer research, consulting, implementation of assessments, and development of new and
innovative assessments. Stephanie has published and presented in the areas of educational and
personnel assessment, innovative assessment methodologies, and research ethics. She is currently
operating as an independent assessment consultant and is a Lead Research Associate and
Product Developer with the University of Illinois at Urbana Champaign with the National Center
for Professional and Research Ethics (NCPRE).

About iTEP International


iTEP International, founded by career international educators, developed the iTEP suite of exam-
inations to provide institutions and individual test-takers with an efficient, secure, accurate, and
affordable on-demand language proficiency assessment.

There are currently seven versions of iTEP available: iTEP Academic, iTEP SLATE (Secondary
Level Assessment Test of English), iTEP Business, iTEP Au Pair, iTEP Intern, iTEP Hospitality, and
iTEP Conversation. All seven exams have the same basic structure, standardized rubric scoring,
and administration procedures. The exams assess all or some of the following five compo-
nents of English language proficiency: Grammar, Listening, Reading, Writing, and/or Speaking.

4 iTEP Academic
CHAPTER 1
INTRODUCTION

The International Test of English Proficiency - Academic (iTEP Academic), developed and published
by iTEP International, is a multimedia assessment that evaluates the English language proficiency
of English as a Second Language (ESL) college applicants and students.

iTEP Academic is commonly used for:


• Making admissions decisions
• Placing students within language programs
• Guiding course instruction and curriculum development
• Evaluating pre- and post-course progress
• Determining eligibility for scholarships

iTEP Academic is also used to assess the proficiency of English language teachers, government
initiatives, and official certifications.

In order to target the level and type of English proficiency needed to be a successful college
student, the content of iTEP Academic is tailored to reflect the academic and life experiences of
individuals who attend or plan to attend college. iTEP Academic does not require any specialized
academic or cultural knowledge, so it is well-suited for testing in any academic discipline. The
assessment evaluates examinees’ ability to apply their English knowledge and skill to process,
learn from, and respond appropriately to new information that is presented in English. iTEP
Academic is delivered over the Internet at secure Certified iTEP Test Centers around the world.
Examinees can schedule a testing date within three business days of contacting the test center.

There are two versions of iTEP Academic:


• iTEP Academic-Core: assesses Grammar, Listening, and Reading and is 50 minutes in length,
with an additional 10 minutes for pre-test preparation. Results are available immediately.
• iTEP Academic-Plus: assesses Grammar, Listening, Reading, Writing, and Speaking and is
80 minutes in length, with an additional 10 minutes for pre-test preparation. Results are
available within one business day.

iTEP automatically emails the examinee’s official score report to the client. An online iTEP client
account provides a variety of tools for managing results.

Va l i d i t y & R e l i a b i l i t y R e p o r t 5
Approach and Rationale for the Development of iTEP
Academic
College is a social and communicative experience. Whether the student is listening to a lecture,
writing a paper, reading exam instructions, working on a group project, or making a purchase
at the bookstore, the ability to understand and use the college’s primary language is a funda-
mental prerequisite for the student to succeed. Though success in college can also depend on
factors that have little direct link to language, such as intelligence, motivation, self-discipline,
and physical and emotional health, these will have little use for the student if he/she is unable
to process, learn from, and respond to information.

iTEP Academic was designed and developed to provide English language proficiency scores
that are valid for many types of educational decision making. The developers of iTEP Academic
recognized that in order to thoroughly evaluate English proficiency, the assessment needed to
include items that evaluated both written and spoken language, as well as the examinee’s grasp
of English grammar. In addition, iTEP developers made the distinction between receptive lan-
guage skills (i.e., listening and reading) and expressive language skills (i.e., writing and speaking).
Assessment items that measure an examinee’s ability to express ideas in English were developed
for inclusion in iTEP Academic-Plus.

When language proficiency is measured accurately, reliably, and comprehensively, educators


or administrators can use examinees’ scores on the assessment to make more rigorous, evi-
dence-based decisions. iTEP Academic was developed with these goals in mind. Furthermore,
iTEP Academic uses the best technology available and on-demand support to help ensure an
engaging, user-friendly examinee and administrator experience.

iTEP is recognized by the Academic Credentials Evaluation Institute (ACEI) as an approved


internationally regarded English proficiency exam that meets institutional standards. ACCET,
the U.S.-based Accrediting Council for Continuing Education and Training has determined that
iTEP satisfies the requirement of a nationally recognized English assessment exam to validate
Intensive English Program (IEP) curricula. In addition, iTEP International is committed to actively
engaging with the international education community through memberships and affiliations
with NAFSA, EnglishUSA, TESOL, ACEI, ACCET, and AISAP.

6 iTEP Academic
CHAPTER 2
DETAILED DESCRIPTION
OF iTEP ACADEMIC

Theoretical Model for Language Assessment


Traditionally, language researchers and educators have grouped language skills into four distinct
categories (Listening, Reading, Writing, and Speaking), and from a commonsense perspective
this categorization is no surprise, as each of these elements of communication refers to a dis-
tinct set of activities and knowledge used for distinct purposes. In addition, it is common for
a distinction to be made between language skills and language knowledge (e.g., grammar and
vocabulary) (Bachman, 1990).

On the surface, the Listening, Reading, Writing, and Speaking sections of iTEP Academic align
with the traditional categorization of language skills, and the Grammar section aligns with
the notion of language knowledge. Listening scores reflect the ability to comprehend spoken
language, Grammar scores reflect the knowledge of correct grammar, and so on. Additionally,
practical considerations clearly warrant testing across multiple competency areas. In the case
of admissions or certification, use of multiple measures helps ensure content coverage (mea-
surement breadth) across the most critical elements of language; in the case of placement or
program evaluation, multiple measures help pinpoint different areas of an examinee's strengths
and weaknesses.

The traditional categorization of language into skills and knowledge domains may seem to
suggest that each iTEP Academic section measures an isolated language capability; however,
modern theories of language emphasize the interrelatedness of language knowledge and skill
and the practical fact that any attempt to measure a single component of language will likely be
confounded by other language skills that are necessary to answer the question (e.g., an eval-
uation of reading proficiency requires knowledge of grammar, sentence structure, vocabulary,
etc.). In addition, these theories emphasize that one must consider the context in which the
communication occurs; communication in a casual setting is likely to involve a different set of
competencies—and a different judgment of effectiveness—than communication in an academic
or business setting. These modern theories suggest that in practice, language effectiveness must
be evaluated in the situational context for which the assessment is to be used (Association of
Language Testers in Europe (ALTE), 2011; Bachman, 1990). Plainly stated, a language assessment
should represent the real-world use of language. iTEP Academic aligns with best practices in
language assessment by evaluating one's ability to communicate effectively in the context of
common scenarios that are encountered in college.

Va l i d i t y & R e l i a b i l i t y R e p o r t 7
Description of iTEP Academic Sections
Grammar Section
The ability to understand and use a language’s grammar rules correctly is an important compo-
nent of effective communication. Grammar does not need to be perfect in order for someone to
comprehend the meaning of a statement, yet as the number of grammatical errors increases, the
likelihood that the information will be conveyed incorrectly also increases. Still higher standards
for grammatical correctness are present within most academic settings.

The iTEP Academic: Grammar section evaluates an examinee’s understanding of and ability to
use proper English grammar. It is comprised of 25 multiple-choice questions, each of which
tests the examinee’s familiarity with a key feature of English structure (e.g., use of the correct
article, verb tense, modifier, or conjunction; identifying the correct sentence structure, pronoun,
or part of speech). The Grammar section includes a range of sentence structures from simple
to more complex, as well as both beginning and advanced vocabulary. The first 13 questions
require the examinee to select the word or phrase that correctly completes a sentence, and
the next 12 questions require the examinee to identify the word or phrase in a sentence that is
grammatically incorrect. Each of the two question types is preceded by an on-screen example.

The Grammar section takes 10 minutes to complete.

Sample Grammar Item

8 iTEP Academic
Listening Section
The ability to comprehend spoken information is of central importance within an academic
setting and is vital to navigating the social aspects of college life. The typical model for a college
course, particularly during the first two years of coursework, involves students attending lec-
tures. The iTEP Academic Listening section evaluates an examinee’s proficiency in understanding
spoken English information. In this section, the examinee listens to two types of spoken infor-
mation: (1) a short conversation between two speakers; and (2) a brief lecture on an academic
topic. After listening to the conversation or lecture, the examinee is presented with a question
(orally and in writing) that measures several key indicators of whether the information was
understood. These indicators include: identifying the primary subject of the conversation or
lecture (Main Idea), recalling important points (Catching Details), understanding why a particu-
lar statement was made (Determining the Purpose), inferring information based on contextual
information (Making Implications), and determining the relationship between key pieces of
information (Connecting Content).

To ensure realism in the Listening section, item writers take steps to ensure that the content
reflects a conversational tone. In addition, while the examinee listens to each audio file, a static
image of the speaker(s) is presented on-screen.

The Listening section takes 20 minutes to complete and consists of three parts:
Part 1: Four high-beginning to low-intermediate-difficulty level conversations of two to three
sentences, each followed by one multiple-choice question

Part 2: One two- to three-minute intermediate-difficulty level conversation followed by four multiple
-choice questions

Part 3: One four-minute lecture followed by six multiple-choice questions

Sample Listening Item:


Transcript of audio played to examinee [text is for demonstration in this report and is not presented to the examinee]

Male Student
“Hi Tara. Did you hear that Professor Johnson’s biology class was canceled? He moved the quiz to next week.”

Tara
“No, I didn’t. Thanks for telling me. That will give me more time to write my history report and finish my math
homework.”

Va l i d i t y & R e l i a b i l i t y R e p o r t 9
Reading Section
Along with listening, the ability to comprehend written information is critical both for effective
learning in an academic setting and for navigating college life in general. Course lectures are
typically paired with required textbooks or other reading materials, and students are frequently
evaluated on their recall and understanding of both the lectures and the readings. Additionally,
the typical examination in lower-level college courses involves written materials such as multi-
ple-choice questions.

The iTEP Academic Reading section evaluates an examinee’s level of reading comprehension by
measuring several key indicators of whether a written passage was understood. These indicators
include: identifying the significant points and main focus of the written passage (Catching Details
and Main Idea, respectively), determining what a word means based on its context (Vocabulary),
and understanding why a particular statement within a larger passage was written by connecting
together relevant information (Synthesis). In addition, the Reading section evaluates the exam-
inee’s understanding of how a paragraph should be constructed in order to properly convey
information (Sequencing). Sequencing items require the examinee to read a paragraph and
determine where a new target sentence should be placed based on the surrounding content.

The Reading section takes 20 minutes to complete and consists of two parts:
Part 1: One intermediate reading level passage about 250 words in length, followed by four multiple-
choice questions

Part 2: One upper reading level paragraph about 450 words in length, followed by six multiple-
choice questions

Sample Reading Item:

10 iTEP Academic
Writing Section
In addition to evaluating speaking, iTEP Academic-Plus also evaluates the examinee’s English
Writing ability.

During the Writing section of the assessment, the examinee reads a question and then writes a
response. The responses are submitted for later evaluation by a trained iTEP rater.

The Writing section takes 25 minutes to complete and consists of two parts:
Part 1: The examinee is given five minutes to write a 50-75 word note, geared at the low inter-
mediate level, on a supplied topic.

Part 2: The examinee is given 20 minutes to write a 175-225 word piece expressing and sup-
porting his or her opinion on an upper level written topic.

Sample Writing Item:

Va l i d i t y & R e l i a b i l i t y R e p o r t 11
Speaking Section
Both writing and speaking in a new language are often considered more advanced skills, devel-
oped after the individual has acquired a basic grasp of the language’s grammar and vocabulary
and learned to apply this knowledge to comprehend written and spoken information. The longer
version of iTEP Academic, iTEP Academic-Plus, evaluates the examinee’s English speaking ability
(along with writing ability as previously described).

During the Speaking section of the assessment, the examinee listens to and reads a prompt
(either a question or a brief lecture), and then prepares an oral response. The examinee then
records his/her response for later evaluation by a trained iTEP rater.

The Speaking section takes five minutes to complete and consists of two parts:
Part 1: The examinee hears and reads a short question geared at low-intermediate level, then
has 30 seconds to prepare a spoken response and 45 seconds to speak.

Part 2: The examinee hears a brief upper-level statement presenting two sides of an issue, then
is asked to express his or her thoughts on the topic with 45 seconds to prepare and 60 seconds
to speak.

Sample Speaking Item:

12 iTEP Academic
Test Administration
Delivery Method
iTEP Academic-Plus is administered via the Internet. Items are administered to examinees at
random from a larger item bank, according to programming logic and test development proce-
dures that ensure each examinee receives an overall examination of comparable content and
difficulty to other examinees.

A static paper-and-pencil version of iTEP Academic-Core is also available.

iTEP Academic must be administered at a secure location or a Certified iTEP Test Center.

The examinee inputs responses to the test in the following manner:


• During the Grammar, Listening, and Reading sections, the examinee selects from a list of
multiple choice options for each question.
• Writing samples are keyboarded directly into a text entry field.
• Speaking samples are recorded with a headset and microphone at the examinee’s computer.

Examinee Experience
Prior to the start of the test, the examinee logs in and completes a registration form. The system
guides the examinee through a series of steps to ensure technical compatibility and to prepare
him/her for the format of the assessment.

Each section/skill has a fixed time allotted to it. In the Grammar and Reading sections, examinees
can advance to the next section if there is time remaining, or they are free to use any extra time
to review and revise their answers. In the Listening section, the prompts each play only once
and once submitted, an item response cannot be reviewed or changed. In the Writing section,
there are fixed time limits for each part, but examinees may advance to the next section before
time expires. In the Speaking section, there are fixed time limits for each part and examinees
cannot advance until time expires.

The directions for each section are displayed for a set amount of time, and are also read aloud.
The amount of time instructions are displayed varies according to the amount of text to be read. If
an examinee needs more time to read a particular section’s directions, he or she can access them
by clicking the Help button, which displays a complete menu of directions for all test sections.

Following each section of the test, examinees see a transition screen indicating which section
will be completed next. This transition screen provides a 15-second break between sections,
and displays a progress bar showing completed and remaining test sections. After the last test
section is completed, examinees see a final screen telling them the test is complete and to wait
for further directions from the administrator.

Screenshots of the examinee experience, including pre-assessment modules and instructions,


are shown in Appendix A.

Va l i d i t y & R e l i a b i l i t y R e p o r t 13
Scoring/Grading
iTEP Academic computes an overall proficiency level from 0 (Beginner) to 6 (Mastery), as well
as individual skill proficiency levels from 0 to 6 for each skill. Linguistic sub-skill scores are also
provided (e.g. parts of speech, synthesis, main idea) in order to give a more detailed picture
of the examinee’s level in the Grammar, Listening, and Reading sections. The Overall score
represents the combination of scores across each skill; for greater accuracy, Overall scores are
reported to one decimal point (e.g., 0.0, 0.1, 0.2, … , 5.9, 6.0).

iTEP Academic is graded as follows:


• The Grammar, Listening, and Reading sections are scored automatically by the computer.
Each response is worth one point. There is no penalty for guessing.
• The Writing and Speaking sections are evaluated by native English-speaking, ESL-trained
professionals, according to a standardized scoring rubric (see Appendix B). Raters attend
refresher training sessions throughout the year to ensure continued adherence to the rubric.
• For computing the Overall score, each test section is weighed equally.
• The official score report presents an individual’s scoring information in both tabular and
graphical formats. The graphical format, or skill profile, is particularly useful for displaying
an examinee’s strengths and weaknesses in each of the skills evaluated.

Proficiency Levels
The seven iTEP Academic proficiency levels may be expressed briefly as follows:

Level 0: Beginning

Level 1: Elementary

Level 2: Low Intermediate

Level 3: Intermediate

Level 4: High Intermediate

Level 5: Low Advanced

Level 6: Advanced

iTEP has mapped iTEP Academic Proficiency Levels to the levels described in the Common
European Framework of Reference for Languages (CEFR; See Appendix B).

14 iTEP Academic
CHAPTER 3
iTEP ACADEMIC DEVELOPMENT
PROCESS, RELIABILITY, AND
VALIDITY1

Development Process
iTEP International adheres to a continuous cycle of item analysis (see Figure 1) to ensure the
content of the assessment adheres to the reliability and validity goals of the assessment. The
cycle begins with item writing, enters an expert review and content analysis stage, and then works
through a number of statistical analyses to evaluate the difficulty level and other psychometric
properties of the item. Items that do not meet quality standards during the content analysis
and/or statistical analysis phase are either removed from further consideration, or repurposed
if it is determined that minor adjustments will improve the item. Items that meet quality stan-
dards during the content analysis and statistical analysis phases are retained in the assessment.
In order to maintain a secure assessment and minimize the likelihood of an item being shared
among examinees over time, all items used in the assessment are retired after a certain length
of time. Items may also be identified as having “drifted” in difficulty over time, indicating that
the item may have been compromised; these items are retired immediately upon identification.

Keep
Item
Repurpose Write
Item Item

Figure 1. Continuous Cycle of Item


Statistical Content
Development
Analysis Analysis

Retire
Item

1 All analysis and evaluation of iTEP Academic as described in Chapter 3 was conducted in accordance with the Standards
for Educational and Psychological Testing (hereafter Standards; American Educational Research Association, American
Psychological Association, & National Council on Measurement in Education, 2014), Uniform Guidelines on Employee
Selection Procedures (Equal Employment Opportunity Commission, 1978), and the Principles for the Validation and Use of
Personnel Selection Procedures (Society for Industrial and Organizational Psychology, 2003)

Va l i d i t y & R e l i a b i l i t y R e p o r t 15
Reliability
The reliability of an assessment refers to the degree to which the assessment provides stable,
consistent information about an examinee. Demonstrating reliability is important because if a
test is not stable and consistent—whether across the items in the assessment, across repeated
administrations of the assessment, or based on performance scores provided by trained raters—
then the results cannot be relied upon as accurate. Moreover, the reliability of an assessment
theoretically sets a maximum limit for its validity; when an assessment is not consistent, it is
less effective as an indicator of a person’s true ability and will therefore demonstrate lower
correlations with relevant outcomes (such as grades, academic adjustment, or attrition).

Internal Consistency Reliability


Internal consistency reliability refers to the stability of the items within a particular assessment,
or in this case, within each assessment section. When it can be shown that the items are sta-
tistically related to each other, the case can be made that the assessment is consistent in its
measurement. Cronbach’s alpha (Cronbach, 1951) is a commonly-used and accepted classical
test theory (CTT) statistic that is used to estimate internal consistency reliability. The statistic
reflects the average correlation between all items within an assessment or assessment section.
Values of .70 or above have traditionally been considered desirable, with some scholars stating
that test developers should aim to develop tests with values of at least .80 or even .90 and
higher. These benchmarks are general rules and do not take into account other desirable char-
acteristics of an assessment, such as assessment brevity to minimize testing time (Gatewood
& Field, 2001), the breadth of content coverage within the assessment (to ensure that a large
domain of the characteristic being measured is represented) (Loevinger, 1954), and the validity
of the assessment (Nunnally & Bernstein, 1994). Test developers must think critically about the
interrelated factors influencing test reliability and validity and use their best judgment when
deciding what should be considered acceptable (Gatewood & Field, 2001).

Because the calculation of internal consistency reliability requires that the assessment section
contain multiple items, this class of statistics is appropriate for the Grammar, Listening, and
Reading sections of iTEP Academic; calculation of internal consistency reliability is not possible
for the Speaking and Writing sections, as trained raters provide only one summary score for
each of these sections based on the examinee’s overall Speaking or Writing performance.

Within the Grammar, Listening, and Reading sections of iTEP Academic, the set of items adminis-
tered to each examinee are selected at random from a larger item bank; therefore, the traditional
CTT calculation of Cronbach’s alpha is not possible. In order to compute an internal consistency
reliability estimate for each scale, the following procedure was used to derive an estimate that
can be interpreted in a manner similar to Cronbach’s alpha. The procedure relies on statistics
derived from item response theory (IRT), a class of statistical models that are particularly suited
to handling randomly administered items.

16 iTEP Academic
1 For each section, compute the IRT common discrimination parameter using the 1-Parameter
Logistic model (1PL). The common a parameter reflects the average extent to which each
item provides statistical information that distinguishes lower-performing examinees from
higher-performing examinees. The a parameter is in concept most similar to an item-total
correlation from classical test theory.
2 Use the a parameter estimate to compute an intraclass correlation coefficient (ICC). This
formula is:
a2
ICC =
a 2 + π2 /3

3 The resulting value of the ICC reflects the average internal consistency reliability for any one
item in the sections, and therefore the final internal reliability estimate (α) must be “stepped
up” using the Spearman-Brown prophecy formula to reflect the reliability of the total
sections. The Spearman-Brown prophecy method is the same method that would be used
to examine the impact of shortening or lengthening a test (for example, cutting a 50-item
test in half). The Spearman-Brown prophecy formula is:

K . ICC
a=
1 + (K – 1) . K . ICC
Where K is a scaling factor reflecting the proportional increase or decrease in the number
of test items. In the current case, K is the number of items in the sections.

The internal consistency reliability results, which can be interpreted as conceptually similar to
Cronbach’s alpha estimates, were computed for a sample of over 17,000 examines who com-
pleted iTEP Academic between 2014 and 2016.2 The results are provided in Table 1. As shown,
all values exceed the .70 benchmark, and the Grammar estimate exceeds the .80 benchmark.

Table 1. Internal Consistency Reliability Estimates for Relevant iTEP Academic Sections

Intraclass Internal Consistency


Section Number of Items Discrimination (a) Correlation (ICC) Reliability (α)
Grammar 25 1.08 .25 .89
Listening 14 0.85 .21 .78
Reading 10 0.93 .22 .74

Note: The sample size for the analysis was N = 17,731. The internal consistency reliability estimates are not
Cronbach’s alpha values, but can be interpreted in a similar manner to Cronbach’s alpha.

2 All examinee data provided by iTEP was included in the analysis, with the exception of the following: (1) when a unique
identifier indicated with data was for an examinee re-testing, only the examinee's first testing occasion was included; or
(2) if the examinee timed-out on any section without seeing one or more of the items, the examinee was removed; or (3)
examinees younger than 14 years of age were removed. Examinee non-responses to items that were seen but not answered
were scored as incorrect.

Va l i d i t y & R e l i a b i l i t y R e p o r t 17
Test-Retest Reliability
Test-retest reliability refers to the stability of test scores across repeated administrations of the
test. A high level of test-retest reliability indicates that the examinee is likely to receive a similar
score every time he or she takes it—assuming the examinee’s actual skill in the domain being
measured has not changed. Test-retest reliability estimates for all iTEP Academic sections, and
the Overall score, were computed using a sample of 198 examinees who took iTEP Academic
twice in an operational environment (i.e., at a testing center for college admissions purposes).
Analyses were restricted to examinees with at least 5 days and less than 2 months between
testing occasions (average time elapsed: 24 days).

The test-retest values shown in Table 2 reflect the correlation between the Time 1 and Time
2 scores for the sample. Values can range from -1.0 to 1.0, with values at or exceeding .70
typically considered desirable. As can be seen, only the Overall score exceeds this threshold.
However, it should be noted that the sample used to compute the test-retest correlations was
an operational sample, and it could reasonably be assumed that at least some of the sample
had worked diligently to improve their performance between Time 1 and Time 2 testing occa-
sions; given the number of days between test administrations for the sample (up to 2 months;
24 days on average), this seems very likely. Had the test-retest estimates been computed on a
research sample and/or if the sample size of available data allowed for the analysis of a shorter
time period between testing occasions, the correlations would likely be higher. Therefore, the
values given in Table 2 can be considered lower-bound estimates of the true test-retest reliability
of iTEP Academic.

Table 2. Test-Retest Reliability Estimates for iTEP Academic

Section Test-Retest Reliability


Grammar .63
Listening .49
Reading .48
Writing .62
Speaking .64
Overall – Plus .77
Overall – Core .71

Note: The sample size for the analysis was N = 198. The Overall – Core score was approximated by removing
the Speaking and Writing section scores from the Overall scores of examinees who completed the longer iTEP
Academic-Plus.

18 iTEP Academic
Rater Agreement
The iTEP Academic Speaking and Writing sections are evaluated by a trained rater and as such,
it is necessary to estimate the accuracy of these judgments—specifically, the extent to which
the scores given by a rater are interchangeable with the scores of another. Evaluations of rater
agreement, as opposed to rater reliability, are more appropriate in cases where the examinee’s
absolute score is of interest rather than the examinee’s rank order position relative to other
examinees (LeBreton & Senter, 2008).

Tables 3 and 4 summarize a raw investigation of rater agreement using a sample of Speaking
and Writing ratings from six examinees obtained from eight raters during a training exercise.
The examinees completed either iTEP Academic or iTEP SLATE.

It should be noted that the results in Tables 3-6 likely reflect a lower-bound estimate of rater
agreement, as the cases used for the training exercise were purposely selected to be more
challenging to rate than a typical case.

Table 3. Raw Rater Agreement Analysis – Speaking Section

Rater Deviations from Average Score


Average Average Max.
Examinee Score R1 R2 R3 R4 R5 R6 R7 R8 Deviation Deviation
E1 1.79 .04 .54 .29 .04 .29 .46 .71 - .34 .71
E2 4.88 .63 .38 .63 .38 .13 .38 .63 .88 .50 .88
E3 2.53 .28 .97 .03 .22 1.53 .22 .72 .28 .53 1.53
E5 4.03 .22 .72 .53 .53 .22 .03 .53 .47 .41 .72
E6 3.50 .50 .00 1.50 - - - .25 .75 .60 1.50
Average 3.34 .33 .52 .59 .29 .54 .27 .57 .59 .47 1.07

Note: No Speaking section ratings were provided for Examinee 4 due to a technical issue with the audio recording.
The missing values occurred because the rater(s) did not provide a rating. Average Score: the examinee’s average
rating across all eight raters. Rater Deviations from Average Score: the absolute value of the difference between
each rater’s score and the Average Score for each examinee. Average Deviation: average Rater Deviation for each
examinee. Max Deviation: highest Rater Deviation value that was observed across all eight raters.

Va l i d i t y & R e l i a b i l i t y R e p o r t 19
Table 4. Raw Rater Agreement Analysis – Writing Section

Rater Deviations from Average Score


Average Average Max.
Examinee Score R1 R2 R3 R4 R5 R6 R7 R8 Deviation Deviation
E1 1.79 .04 .71 .29 .79 .04 .29 .71 - .41 .79
E2 3.81 .56 .69 .06 .31 .44 .56 .06 .44 .39 .69
E4 3.43 .32 - .07 .18 .18 .32 .18 .18 .20 .32
E5 4.41 .16 .66 .59 .09 .09 .09 .16 .09 .24 .66
E6 3.60 .15 .40 .85 - - - .15 .15 .34 .85
Average 3.41 .25 .61 .37 .34 .19 .32 .25 .21 .32 .66

Note: No Writing section ratings were provided for Examinee 3. The missing values occurred because the rater(s)
did not provide a rating. Average Score: the examinee’s average rating across all eight raters. Rater Deviations from
Average Score: the absolute value of the difference between each rater’s score and the Average Score for each
examinee. Average Deviation: average Rater Deviation for each examinee. Max Deviation: highest Rater Deviation
value that was observed across all eight raters.

As seen in Table 3, in all but two instances the raters’ Speaking scores for each examinee devi-
ated less than 1 point from the average rating across all raters (as a reminder, section scores
can range from 0 to 6). Across all raters and examinees, the average deviation was .47 points,
and the average maximum deviation was 1.07 points. These results suggest a moderately strong
agreement across raters.

As seen in Table 4, all of the raters’ Writing scores for each examinee deviated less than 1 point
from the average rating across all raters. Across all raters and examinees, the average deviation
was .32 points, and the average maximum deviation was .66 points. These results suggest a
strong agreement across raters.

Using the same data that were used for Tables 3 and 4, rater agreement was also estimated
using a version of the rWG agreement statistic (James, Demaree, and Wolf, 1984). The value of
rWG can theoretically range from 0 to 1, and represents the observed variability in scores among
raters relative to the amount of variability that would be present if all raters had assigned scores
completely at random. The formula for rWG is:
S 2x
r WG = 1 –
o 2E

Where S2X is the observed variance of ratings on the variable across raters and σ2E is the variance
expected if the ratings were completely random.

The specific version of rWG chosen for the analysis uses a value for σ2E that would occur if the
raters’ completely random scores came from a triangular (approximation of normal) distribution
(see LeBreton & Senter, 2008).

20 iTEP Academic
The closer an rWG value is to 1, the higher the agreement. There is no agreed-upon minimum
value that is considered acceptable for rWG, but as a benchmark, test developers might consider
.80 or .90 to be a minimally acceptable value for an application such as assigning ratings based
on a score rubric. To put these values in perspective, an rWG of .80 would suggest that 20% (1
– .80) of an average rater’s score across examinees was due to error, or factors other than the
examinee’s “true score” on the exercise.

The rWG agreement statistics are presented in Tables 5 and 6.

Table 5. rWG Rater Agreement Statistics – Speaking Section

Examinee Observed Variance Error Variance rWG


E1 .20 2.1 .91
E2 .34 2.1 .84
E3 .58 2.1 .72
E5 .24 2.1 .89
E6 .78 2.1 .63
Average .80

Note: No Speaking section ratings were provided for Examinee 4 due to a technical issue with the audio recording.

Table 6. rWG Rater Agreement Statistics – Writing Section

Examinee Observed Variance Error Variance rWG


E1 .30 2.1 .86
E2 .23 2.1 .89
E4 .06 2.1 .97
E5 .12 2.1 .94
E6 .24 2.1 .89
Average .91

Note: No Writing section ratings were provided for Examinee 3.

The results in Table 5 indicate moderately strong agreement amongst the raters. The minimum
rWG was observed for Examinee 6, with a value of .63. The average rWG across all examinees was
.80, indicating that 20% of the average rater’s score across examinees was due to factors other
than the examinee’s “true score” on the exercise.

The results in Table 6 indicate strong agreement amongst the raters. The minimum rWG was
observed for Examinee 1, with a value of .86. The average rWG across all examinees was .91,

Va l i d i t y & R e l i a b i l i t y R e p o r t 21
indicating that only 9% of the average rater’s score across examinees was due to factors other
than the examinee’s “true score” on the exercise.

Overall, the results of the rater agreement analyses suggest that ratings provided by any one iTEP
rater are likely to be a reliable indication of an examinee’s actual proficiency on the Speaking
and Writing sections.

Validity
The iTEP Academic examination was designed and developed to provide English language profi-
ciency scores that are valid for many types of educational decision making. The Standards define
validity as “the degree to which accumulated evidence and theory support specific interpretations
of test scores entailed by proposed uses of a test” (AERA et al., 2014, p. 184). In other words, the
term validity refers to the extent to which an assessment measures what it is intended to mea-
sure. Evidence for validity can, and should, come from multiple lines of investigation that together
converge to form a conclusion regarding the relative validity of the assessment, including:

1 Expert judgments regarding the extent to which the content of the assessment reflects the
real-world knowledge, skills, characteristics, or behaviors the assessment is designed to
measure (Content Validity)
2 An examination of the degree to which the assessment (or assessment section) is correlated
with theoretically similar measures and un-correlated with theoretically unrelated measures
(Convergent and Discriminant Validity; traditionally conceived of as the main contributors
to Construct Validity3)
3 An examination of the degree to which the assessment is correlated with the real-world
outcomes it is intended to measure, for example: adjustment to college, grades, or improve-
ment in language proficiency (Criterion Validity)

Content Validity
Content Validity, or content validation, refers to the process of obtaining expert judgments on
the extent to which the content of the assessment corresponds to the real-world knowledge,
skill, or behavior the assessment is intended to measure. For example, an assessment that asks
questions about an examinee’s knowledge of cooking techniques may be judged by experts to
be content valid for measuring that aspect of cooking skill, but it would not be content valid for
measuring the examinee’s athletic ability—even if it turned out that cooking assessment scores
were correlated with athletic ability.

According to the Standards (AERA et al., 2014), evidence for assessment validity based on test
content can be both logical and empirical and can include scrutiny of both the items/prompts
themselves as well as the assessment’s delivery method(s) and scoring.
3 The modern conception of Construct Validity refers not just to Convergent and Discriminant Validity, but to the accumulation
of all forms of evidence in support of an assessment’s validity (AERA et al., 2014).

22 iTEP Academic
Content-related validity evidence for iTEP Academic, for the purposes of academic decision-mak-
ing, can be demonstrated via a correspondence between the assessment’s content and relevant
college educational and social experiences. To ensure correspondence, developers conducted
a comprehensive curriculum review and met with educational experts to determine common
educational goals and the knowledge and skills emphasized in curricula across the country. This
information guided all phases of the design and development of iTEP Academic.

Content Validity evidence for iTEP Academic is also demonstrated through the use of trained
item writers who are experts in the field of education and language assessment and who have
substantial experience in item-writing. The content and quality of items submitted by item-writ-
ers is continually supervised, and feedback is provided in order to ensure ongoing adherence to
the content goals of the assessment and to avoid content-irrelevant test material. Some of the
critical steps taken to achieve this objective are summarized in Appendix C.

Finally, Content Validity evidence for iTEP Academic is shown via its correspondence with the
CEFR framework. iTEP mapped iTEP Academic to the CEFR framework through a process of
expert evaluation and judgment on the content of the assessment and associated scores.

Convergent and Discriminant Validity


Convergent and Discriminant Validity evidence is demonstrated through a pattern of high cor-
relations among sections that measure concepts that are known to be closely related, and lower
correlations among sections measuring unrelated concepts (AERA et al., 2014). The intercor-
relations among iTEP Academic sections are shown in Table 7. The examinee data analyzed are
the same as described in the Reliability section.

Table 7. iTEP Academic Section Intercorrelations

Section Listening Reading Writing Speaking Overall


Grammar .59 .57 .66 .57 .83
Listening – .55 .57 .54 .80
Reading – – .56 .50 .79
Writing – – – .82 .86
Speaking – – – – .82

Note: N = 16,425 for correlations involving Speaking or Writing; N = 17,760 for all other correlations.

The pattern of correlations within iTEP Academic provides preliminary evidence for the con-
vergent and discriminant validity of the assessment. Overall, the relatively strong correlations
between the majority of sections (i.e., in the .50-.60 range) indicates that each scale is likely
measuring related components of language proficiency, and the fact that the correlations do
not approach 1.0 indicates that each section likely measures a distinct element of proficiency.
Compared with the Grammar/Speaking correlation, the higher correlation between Grammar

Va l i d i t y & R e l i a b i l i t y R e p o r t 23
and Writing is conceptually logical given that more weight is placed on Grammar, by design,
when iTEP raters evaluate examinees’ writing ability than when evaluating their spoken ability.
The strong correlation between Speaking and Writing is also to be expected, given that these
skills are considered more advanced demonstrations of language proficiency that require expres-
sive, as opposed to receptive, language skills.

In addition to the internal examination of convergent and discriminant validity within the iTEP
Academic scales, preliminary analyses conducted by a iTEP partner suggested a .93 correlation
between iTEP scores and TOEFL® scores. The correlation indicates that iTEP scores are closely
aligned with those of other language proficiency tests.

24 iTEP Academic
REFERENCES
American Educational Research Association, American Psychological Association, & National Council
on Measurement in Education (1999). Standards for educational and psychological testing.
Washington, DC: American Psychological Association.

Association of Language Testers in Europe (2011). Manual for language test development and
examining. Cambridge: ALTE.

Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.

Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, and
Department of Justice. (1978). Uniform guidelines on employee selection procedures.

Gatewood, R.D. & Field, H.S. (2001). Human resource selection (5th ed.). Ohio: South-Westin.

James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and
without response bias. Journal of Applied Psychology, 69, 85-98.

LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and
interrater agreement. Organizational Research Methods, 11, 815-852.

Loevinger, J. (1954). The attenuation paradox in test theory. Psychological Bulletin, 51, 493—504.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill, Inc.

Society for Industrial and Organizational Psychology, Inc. (2003). Principles for the validation and use
of personnel selection procedures (4th ed.). Bowling Green, OH: SIOP.

Va l i d i t y & R e l i a b i l i t y R e p o r t 25
APPENDIX A: EXAMINEE PRE-ASSESSMENT
MODULES AND INSTRUCTIONS

1. Candidate’s government-issued photo ID is required and will be verified before beginning the test.
2. The iTEP Administrator will verify that all information provided on the Registration Form is identical to the
Candidate’s official ID document(s).
3. Reference materials/tools and other personal effects (e.g. dictionaries, mobile phones, audio recording devices,
pagers, notepaper, etc.) are not permitted in the room during the test.
4. Smoking, eating, or drinking is not permitted in the room during the test.
5. The iTEP Administrator reserves the right to dismiss a Candidate from the test or declare a Candidate’s test results
void if the Candidate violates any of the above conditions or fails to follow the Administrator’s instructions during
the test.
6. If for technical or any other reasons a given test is not able to be completed or results cannot be provided, iTEP
International’ and the iTEP Administrator’s liability shall be limited to providing a refund of fees received for said
test and, at the Candidate’s request, rescheduling a replacement test.

26 iTEP Academic
Va l i d i t y & R e l i a b i l i t y R e p o r t 27
28 iTEP Academic
APPENDIX B: iTEP ABILITY GUIDE

iTEP CEFR Listening Reading Writing Speaking


• Comprehends • Comprehends • Writes complex • Communicates
6.0 overall meaning virtually all documents such accurately and
and virtually aspects of a as research effectively on practically
all details of wide variety reports using all academic and social
lectures on of academic appropriate style topics in culturally
diverse topics material for and vocabulary appropriate ways
C2 • Understands non-specialists
English spoken • Reads at near-
• Grammar and
orthographic
• Pronunciation is close to
that of native speakers
MASTERY in a variety native speed accuracy is at
of non-native • Rarely requires near-native level
accents use of dictionary • Expresses
complex
relationships
5.5 between ideas

5.4 • Identifies • Understands • Vocabulary is • Vocabulary is strong in


attitude and main ideas strong in specialty specialty
purpose of and most of • Satisfies demands • Satisfies demands of
speakers the details of of most general most general academic
• Grasps main academic texts, academic tasks tasks with occasional
ideas and the journal articles, with occasional grammar and style

C1 majority of
supporting
details from
and abstracts
• Requires little
extra reading
grammar and
style mistakes
mistakes
• Exhibits fairly good
ADVANCED • Exhibits fairly organization and
academic time good organization development
lectures and development
• Is challenged by
complex social
and cultural
4.5 references

4.4 • Identifies main


• Utilizes • Writes reasonably • Begins to express
ideas and details contextual and coherent essays abstract concepts,
in conversation syntactic clues on familiar topics, especially on familiar
• Occasionally to interpret but with some topics
needs to ask for meaning grammatical • Fluency is occasionally
repetition or of complex weakness hampered by gaps
clarification sentences and • Does not have in vocabulary and
new vocabulary
B2 • Begins to complete grasp of grammar
determine the • Gathers most stylistic features • Expresses viewpoints in
attitudes of main ideas from • Vocabulary fairly long stretches of
UPPER textbooks and
NTERMEDIATE speakers frequently lacks discourse
articles, but has precision and
• Understands • Sometimes is asked to
an uneven grasp sophistication
main ideas of details repeat words or phrases
from academic
lectures, • Misinterprets
tut misses some abstract
significant content
details and cultural
3.5 references

Va l i d i t y & R e l i a b i l i t y R e p o r t 29
iTEP CEFR Listening Reading Writing Speaking

3.4 • Grasps the • Limited • Communicates • Manages day-today


general outline vocabulary basic ideas, communications with
of topics impedes speed but with weak peers and instructors,
discussed in • Grasps the gist organizational marked by frequent
an academic of material on structure and grammar and
setting familiar subjects, grammatical vocabulary errors

B1 • Unfamiliarity
with complex
and identifies
some significant
mistakes that

understanding
• Pronunciation requires
sometimes hinder significant effort from
INTERMEDIATE structures and details listeners
higher-level • Follows step-by- • Expresses him/
vocabulary step instructions herself with some
leaves major in exams, labs, circumlocution
gaps in and assignments on topics such as
understanding family, hobbies,
2.5 work, etc

2.4 • Maintains • Major vocabulary • Limited • Generates simple


comprehension gaps lead vocabulary results questions, greetings,
during to frequent in repetitive expressions of needs,
conversations on inaccurate or style and simple and preferences
familiar topics incomplete sentences • Pronunciation requires
• Relies heavily on comprehension, • Considerable significant effort from

A2 non-verbal cues
and repetition
and slow pace
• Understands
simplified material
effort required listeners
by the reader to • Pronunciation often
identify intended
ELEMENTARY • Understands obscures meaning
very basic • Begins to meaning
exchanges when determine the • Uses only basic
spoken slowly meaning of vocabulary
using simple words by familiar and simple
vocabulary surrounding grammatical
2.0 context structures

1.9 • Understands • Comprehends • Writes only • Capable of short simple


simple greetings, only highly short simple presentation on familiar
statements, and simplified sentences. often topic
questions when phrases or characterized • Responds to simple
spoken with sentences by errors that statements or questions
extra clarity • Recognizes obscure meaning
• Speech is marked with
• Follows simple familiar cohesive • Provides personal non-native stress and
familiar devices and basic details with intonation patterns
instructions pronouns correct spelling
and can copy • Communication is
• Frequently • Demonstrates understood for short
requires understanding familiar words
utterances
A1 repetition for of a few simple and phrases
comprehension grammatical and • Produces isolated • Pauses, false starts,
lexical structures words and and reformulation are
BEGINNER • Understands common
a few isolated • Recognizes the phrases
words or alphabet and • Communicates with
phrases spoken isolated words single words and short
slowly phrases at “survival
level”
• Intense listener effort
required
• Produces a few isolated
words and phrases
• Pronunciation is mostly
0.0 unintelligible

30 iTEP Academic
APPENDIX C: SUMMARY OF STEPS TO
MINIMIZE CONTENT-IRRELEVANT TEST
MATERIAL

• Implement best practices in item writing to reduce the likelihood that “test wise” test-takers
will be able to select the best answer, through cues in the test, without needing to understand
the test item itself (for example, by selecting the lengthiest option, eliminating options that
are saying the same thing in different ways)
• Avoid content that may influence test-takers’ performance on the test—items respect peo-
ple’s values, beliefs, identity, culture, and diversity.
• Topics on which a set of items may be based are submitted by item writers to iTEP; iTEP
pre-approves topics prior to item writing
• Assessment content reflects the domain and difficulty of knowledge of someone with the
educational level of a high school junior who expects to attend college. The content reflects
materials that an examinee would be expected to encounter in textbooks, journals, classroom
lectures, extra-curricular activities, and social situations involving students and professors.
Items do not reflect specialized knowledge.
• Write items at an appropriate reading level (no higher than grade 12; lower reading level for
easier items); avoid words that are used with low frequency
• Test items assess comprehension within the item, as opposed to common knowledge.
Passages establish adequate context for the topic, but then go on to introduce material that
is not generally known. Examinees should be able to gain sufficient new information from
the passage to answer the questions.
• Content does not unduly advantage examinees from particular regions of the world.

Va l i d i t y & R e l i a b i l i t y R e p o r t 31
iTEP International

+1.818.887.3888

www.iTEPexam.com

© 2016 iTEP International, LLC.

You might also like