Adaptive Testing & Compter Based Administration
Adaptive Testing & Compter Based Administration
Introduction
Assessment in education has always been a very important issue, and each
school has its own approach.
Increasingly, schools are turning towards adaptive tests as part of their whole-
school approach. What is very likely is that assessment, diagnosis, prognosis,
and placement by computer may soon completely replace paper and pencil
testing.
Adaptive test can help school establish a baseline measure of ability in a way
that is inclusive and can accommodate the full range of student abilities.
The idea of adaptive testing dates back to 1905 and the Stanford-Binet Scales.
These were designed to diagnose cognitive development in young children.
Understanding that students who were unable to answer an easy question were
unlikely to be able to answer a difficult one, Binet tailored the tests he gave by
rank ordering the items in terms of difficulty. He used different stopping rules for
ending the test session based on the pattern of a student’s responses. With
developments in technology, adaptive testing became feasible in large-scale
assessment.
The need for more objectivity in testing gradually led pedagogues to the use of
computers as precise measurement tools. The first tests computers were called
computer aided testing (Larson and Madsen, 1985). Computers were used as
word processors equipped with dictionary and/or a thesaurus, and students
were able to use the sources of references via computer during their writing test.
COURSE LECTURERS PROF. (MRS.) G. OSA-EDOH & DR. P. K. ADEOSUN 1
ADAPTIVE TESTING & COMPTER BASED ADMINISTRATION
Computers were also used for fast computation of grades, assisting the testers in
their calculation. Then, computer experts, computerized paper-and pencil tests
and turned them into electronic test. Such tests, however, showed no real
difference from conventional tests, only that they were administered through a
non-standard medium.
Since computer assisted testing did not offer a real advantage over traditional
paper and pencil tests, further research was conducted. One such project,
financed by the US Department of Education, developed the first computerized
adaptive language test (CALT) first issued in 1986 at Brigham Young University.
One major difference between computer assisted testing and computer adaptive
testing is that the latter tailors the test to the student’s level.
What is a Test?
CAT successively selects questions for the purpose of maximizing the precision of
the exam based on what is known about the examinee from previous
questions. From the examinee's perspective, the difficulty of the exam seems to
tailor itself to their level of ability. For example, if an examinee performs well on
an item of intermediate difficulty, they will then be presented with a more
difficult question. Or, if they performed poorly, they would be presented with a
simpler question. Compared to static multiple choice tests that nearly everyone
has experienced, with a fixed set of items administered to all examinees,
computer-adaptive tests require fewer test items to arrive at equally accurate
scores. (Of course, there is nothing about the CAT methodology that requires the
items to be multiple-choice; but just as most exams are multiple-choice, most
CAT exams also use this format.)
The pool of available items is searched for the optimal item, based on the
current estimate of the examinee's ability
The chosen item is presented to the examinee, who then answers it
correctly or incorrectly
The ability estimate is updated, based upon all prior answers
Steps 1–3 are repeated until a termination criterion is met
Nothing is known about the examinee prior to the administration of the first
item, so the algorithm is generally started by selecting an item of medium, or
medium-easy, difficulty as the first item.
Advantages of CAT
The long term benefits of computer adaptive testing over conventional paper-and-
pencil tests remain to be determined. What are the immediate advantages of
computer adaptive tests?
ii. Computer adaptive tests are shorter. Since computer adaptive tests
adapt to the examinee’s level, questions that are above or below the
examinee’s ability level will not be submitted. As such, a CAT can be
administered in a shorter period of time while still reaching precise
information about the student’s level. Madsen (1991) reported that “over
80% of students required fewer than 50% of the reading items normally
administered on paper-and-pencil test” (Madsen 1991, 250).
iii. Create a more positive attitude toward tests. Again, since computer
adaptive testing is shorter, due to its ability to focus on the examinee’s
level, students feel less bored with questions that are too easy and less
frustrated with questions that are too difficult since such questions are
not submitted. Madsen (1986) indicated that among students taking
both paper-and-pencil test and a computer adaptive, 81% expressed a
more positive attitude toward CAT.
vii. Improved test security: Since the ability level of every examinee is
different, and since every examinee is given an individualized test, no
information that would directly help other students can be passed
around.
Disadvantages (Canale 1986; Carton et al. 1991; Lange 1990; Tung 1986)
Other Issues
Pass-Fail
In many situations, the purpose of the test is to classify examinees into two or
more mutually exclusive and exhaustive categories. This includes the common
"mastery test" where the two classifications are "pass" and "fail," but also
includes situations where there are three or more classifications, such as
"Insufficient," "Basic," and "Advanced" levels of knowledge or competency. The
kind of "item-level adaptive" CAT described in this paper is most appropriate for
tests that are not "pass/fail" or for pass/fail tests where providing good feedback
is extremely important. Some modifications are necessary for a pass/fail CAT,
also known as a computerized classification test (CCT). For examinees with true
scores very close to the passing score, computerized classification tests will result
in long tests while those with true scores far above or below the passing score
will have shortest exams.
For example, a new termination criterion and scoring algorithm must be applied
that classifies the examinee into a category rather than providing a point
estimate of ability. There are two primary methodologies available for this. The
more prominent of the two is the sequential probability ratio test (SPRT). This
formulates the examinee classification problem as a hypothesis test that the
examinee's ability is equal to either some specified point above the cutscore or
another specified point below the cutscore. Note that this is a point hypothesis
formulation rather than a composite hypothesis formulation that is more
conceptually appropriate. A composite hypothesis formulation would be that the
examinee's ability is in the region above the cutscore or the region below the
cutscore.
item selection and classification situations of two or more cutscores (the typical
mastery test has a single cutscore).
ETS researcher Martha Stocking has quipped that most adaptive tests are
actually barely adaptive tests (BATs) because, in practice, many constraints are
imposed upon item choice. For example, CAT exams must usually meet content
specifications, a verbal exam may need to be composed of equal numbers of
analogies, fill-in-the-blank and synonym item types. CATs typically have some
form of item exposure constraints, to prevent the most informative items from
being over-exposed. Also, on some tests, an attempt is made to balance surface
characteristics of the items such as gender of the people in the items or the
ethnicities implied by their names. Thus CAT exams are frequently constrained
in which items it may choose and for some exams the constraints may be
substantial and require complex search strategies (e.g. linear programming) to
find suitable items.
informative items. This can be used throughout the test, or only at the
beginning. Another method is the Sympson-Hetter method, in which a random
number is drawn from U(0,1), and compared to a ki parameter determined for
each item by the test user. If the random number is greater than ki, the next
most informative item is considered.
Multidimensional
COMPONENTS OF CAT
There are five (5) technical components in building a CAT (the following is
adapted from Weiss & Kingsbury, 1984). It should be noted that this list does not
include practical issues, such as item pretesting or live field release.
A pool of items must be available for the CAT to choose from. Such items can be
created in the traditional way (i.e., manually) or through Automatic Item
Generation. The pool must be calibrated with a psychometric model, which is
used as a basis for the remaining four components. Typically, item response
theory is employed as the psychometric model. One reason item response theory
is popular is because it places persons and items on the same metric (denoted by
the Greek letter theta), which is helpful for issues in item selection.
Starting Point
Scoring Procedure
After an item is administered, the CAT updates its estimate of the examinee's
ability level. If the examinee answered the item correctly, the CAT will likely
estimate their ability to be somewhat higher, and vice versa. This is done by
using the item response function from item response theory to obtain a likelihood
function of the examinee's ability. Two methods for this are called maximum
likelihood estimation and Bayesian estimation.
Termination criterion
The CAT algorithm is designed to repeatedly administer items and update the
estimate of examinee ability. This will continue until the item pool is exhausted
unless a termination criterion is incorporated into the CAT. Often, the test is
terminated when the examinee's standard error of measurement falls below a
certain user-specified value, hence the statement above that an advantage is that
examinee scores will be uniformly precise or "equiprecise.” Other termination
criteria exist for different purposes of the test, such as if the test is designed only
to determine if the examinee should "Pass" or "Fail" the test, rather than
obtaining a precise estimate of their ability.
Whereas reliability essentially deals with measurement errors and test scores,
validity goes beyond measurement issues and deals with test interpretation and
use. Reliability is, of course, a prerequisite for validity, but nor sufficient in itself
to make a test valid. In examining validity, we raise questions such as: “Does the
test measure the ability and content which are supposed to be measured?”, “Is
the test biased?” “What is the impact of a specific test design?”
The issue of validity related to computer adaptive language testing has been
addressed through its many facets:
Content validity
Concurrent validity
Predictive validity
Construct validity
Face validity
Test bias
Content Validity
Concurrent Validity
Another aspect that may jeopardize concurrent validity for CAT is the fact that
test items may actually function differently, depending on whether a test is
administered in a paper-and-pencil mode or via computer: “A potential threat to
test validity centers around the possibility that test items may actually function
differently depending on the mode of presentation” (Henning 1991, 214).
Predictive Validity
When examining predictive validity, we look at the degree to which a test can
predict students’ future performance (Bachman 1991). In Larson’s report (1989)
of a Spanish computerized adaptive placement exam, the predictive validity was
established by inquiring – once courses were underway – about the proportion of
students who had been placed appropriately:
Only three of the [179] teachers interviewed reported that their students
had been placed too high. The majority of those who indicated their
student(s) had been poorly placed said the placement should have been
one course higher, meaning that, for the most part, the errors in placement
seemed to be conservative (Larson 1989, 284).
Overall, in this study, 79.9 percent of the teachers indicated that their students
had been appropriately oriented, which represents a high predictive coefficient.
Construct Validity
When examining construct validity, we look at how well a test measures the
ability which is supposed to be measured (Bachman 1991). In their report of a
computer adaptive reading proficiency test, Kaya-Carton, et al. indicated that the
first step for establishing construct validity was to provide an operational
definition of what reading proficiency is, in order to isolate the elements to be
measured. However, as mentioned earlier in this paper, since language
proficiency is the result of more than one construct, the unidimensionality of CAT
raises the issue of compatibility between the potential of computer adaptive
testing and the assessment of language performance (Lange 1990; Madsen 1991;
Canale 1986; Tung 1986). Canale suggests that the unidimensionality of Item
Response Theory seriously limits the construct validity of CAT.
Face Validity
Face validity is the public acceptability of a test as valid, a belief which is not
grounded on its theoretical value but rather on its surface credibility (Bachman
1991). Keeping this definition in mind, CAT is actually developing strong face
validity from the very fact that computers are known to be objective and accurate
graders.
The face validity of CAT, however, has been put into question for various reasons
(Henning 1991):
Pacing differences: Most CATs so far are not uniformly paced, and
examinees take various amounts of time to answer test questions. For this
reason, the face validity of CAT becomes questionable: the lack of
homogeneous pacing is sometimes viewed as unfairness instead of
individualization.
Test length: since a test stops when an examinee’s ability level has been
accurately estimated, tests of ten vary in length across examines. Again,
this may raise a sense of unfairness.
If one person’s score is based on 25 items while another person’s score on the
same test is based on 35 items encountered, even though the ability estimates
derived can be shown to have identical accuracy attached to them, there is a
potential threat to face validity. Any given examinee could object that the same
number of opportunities (items) was not afforded to every examinees (Henning
1991, 218).
In order to mitigate such objections, students, teachers, and the public in general
should be clearly informed that CAT is criterion-referenced, not norm-referenced.
The difference is crucial. Norm-referenced tests are interpreted in relation to the
results of a group which constitute the reference point as well as the norm;
criterion-referenced tests, however, are interpreted with respect to a specific level
of ability, an approach which is part of computer adaptive language testing based
on the recognition that every foreign language student is potentially different.
Test bias can be detected by investigating errors nor directly related to the ability
being tested (Bachman 1991). In computer adaptive testing, test bias has been
reported to stem essentially from computer anxiety. CAT can be biased in favour
of examinees who are already familiar with computers. Henning (1991) points
out, however that the extent to which computer anxiety differs from test anxiety
is not clear.
Since computer adaptive testing tailors test to the students’ level, it certainly
allows teachers to have a better assessment of the students’ personal ability.
Computers allow individualized reports, not only in terms of scores but also in
terms of strengths and weaknesses. What did the student accomplish? What was
the student expected to accomplish? To answer these questions, CAT provides
both a summative and a formative evaluation: It not only assess the students’
progress in terms of grades, but it also evaluates how effective both the students’
learning and the instructor’s teaching have been. If weaknesses are similar
across examinees, instructional procedures may have to be changed. If
weaknesses are specific to some students, the student’s learning approach may
need to be looked at. In either case, computers are certainly excellent tools to
locate such types of information in a short period of time.
Computer adaptive testing is still as it were in its infancy and will hopefully
follow the same route as computer assisted instruction (CAI): from electronic
workbooks, CAI has reached the era of live action simulations; hopefully, from
computer adaptive testing, CAT will probably move toward “computer adaptive
task based assessment.”
CONCLUSION
CAT also offer the advantage of tailoring tests to the student’s level of ability. The
item sequencing of CAT is based on a continuum of difficulty; items are
calibrated according to their difficulty index (also called item easiness index). CAT
tests are administered according to pre-established indices of increment and
decrement. However CAT is fraught with some challenges as the handling of
open-ended question and construct validity.
REFERENCES