Language Test Reliability: A Test Should Contain
Language Test Reliability: A Test Should Contain
Reliability: Same results under the same condition no matter who, where,
when the test is taken
Validity: Scale to measure the size of the
Usability or Practicality: Not too difficult, practical to use
Sources of Variance
u Meaningful Variance
To gain the above goal items should be related to the purpose of designed test &
students’ knowledge on topic, thus on a test will be defined here as that variance which
is directly attributable to the testing purposes.
Organizational Competence
Grammatical Competence
Vocabulary
Morphology
Syntax
Phonology/graphemes
Textual Competence
Cohesion
Rhetorical organization
Pragmatic competence
Illocutionary Competence
Ideational functions
Manipulative functions
Heuristic functions
Imaginative functions
Sociolinguistic Competence
Sensitivity to naturalness
Such effects are undesirable because they are creating variance in the students’
scores that is unrelated to the purpose of the test. In the remainder of this
chapter, I will cover ways of estimating the effects of error variance on the overall
variance in a set of test scores. Knowing about the relative reliability of a test can
help me decide the degree to which I should be concerned about ail the potential
sources of measurement error presented in Table 8.2 .)
Reliability of NRTs
In general, test reliability is defined as the extent to which the results can be considered
consistent or stable. For example, if language teachers administer a placement test to
their students on one occasion, they would like the scores to be very much the same if
they were to administer the same test again one week later. Since most language
teachers are responsible language professionals, they want the placement of their
students to be as accurate and consistent as possible so they can responsibly serve their
students’ language learning needs. The degree to which a test is consistent, or reliable,
can be estimated by calculating a reliability coefficient .
Test-retest Reliability
Of the three basic reliability strategies, test-retest reliability is the one most appropriate
for estimating the stability of a test over time. The first step in this strategy is to
administer whatever test is involved two times to a group of students. The testing
sessions should be far enough apart time-wise so that students are not likely to
remember the items on the test, yet close enough together so that the students have not
changed in any fundamental way . This reliability estimate can then be interpreted as
the percent of reliable variance on the test.
By comparing two sets of scores for a single assessment (such as two rater scores for the
same person). After having two sets of scores for a group of students, we can determine
how similar they are by computing a statistic known as the reliability coefficient.
2. Equivalent-Forms Reliability:
Situation: Testing of same people on different but comparable forms of the test. (Forms
A & B)
Procedure: correlate the scores from the two tests which yields a coefficient of
equivalence.
This approach is very similar to the equivalent-forms technique except that, in this case,
the equivalent forms are created from the single test being analyzed by dividing it into
two equal parts. The test is usually split on the basis of odd- and even-numbered items.
The odd-numbered and even-numbered items are scored separately as though they were
two different forms. A correlation coefficient is then calculated for the two sets of scores.
If all other things are held constant, a longer test will usually be more reliable than a
short one, and the correlation calculated between the odd-numbered and even
numbered items must therefore be adjusted to provide a coefficient that represents the
full-test reliability. This adjustment of the half-test correlation to estimate the full-test
reliability is accomplished by using the Spearman-Brown prophecy formula:
All items in the test should be homogenous. And there should be a relationship among
them.
Split – Half Reliability
In split-half reliability we randomly divide all items that purport to measure the same
construct into two sets. We administer the entire instrument to a sample of people and
calculate the total score for each randomly divided half. the split-half reliability
estimate, as shown in the figure, is simply the correlation between these two total
scores.
Cronbach Alplpha
k = the number of items we want to estimate the reliability for divided by the number of
items we have reliability for.
It is used only if the item scores are other than 0 & 1. (Such as Likert scale). )This is
advisable for essay items, problem solving and 5-scaled items. ; based on 2 or more
parts of the test, requires only one administration of the test.
Kuder-Richardson Formulas
Kuder and Richardson believed that all items in a test are designed to measure a single
trait. KR21 is the most practical, frequently used and convenient method of estimating
reliability.
where raters make judgments and give scores for the language produced by students.
Raters usually are necessary when testing students’ productive skills (speaking and
writing) as in composition, oral interviews, role plays, etc. Testers most often rely on
interrater and intrarater reliabilities in such situations.
Inter-rater Reliability
Intra-rater Reliability
NB: Intra-rater reliability makes it possible to determine the degree to which the results
obtained by a measurement procedure can be replicated.
For any test, the higher the reliability estimate, the lower the error
The standard error or measurement is the average standard deviation of the error
variance over the number of people in the sample.
We never know the t rue score
All tests scores contain some error
Can be used to estimate a range within which a true score would likely fall.
The higher the reliability the lower the standard measurement error, hence the
greater confidence that we can put in the accuracy of the test score of an
individual.
We may determine the likelihood of the true score being within those limits by
knowing the S.E.M. and by understanding the normal curve.
FACTORS AFFECTING THE RELIABILITY OF NRTS
Consider the test-retest and equivalent-forms strategies. A quick glance back at the K-
R20 and K-R21 formulas will also indicate that, as the standard deviation goes down
relative to all other factors, so do these internal-consistency estimates. In short, all the
strategies for reliability discussed in Chapter 8 are fine for NRTs because they are very
sensitive to the magnitude of the standard deviation, and a relatively high standard
deviation is one result of developing a norm-referenced test that effectively spreads
students out into a normal distribution. However, those same reliability strategies may
be quite inappropriate for CRTs because CRTs are not developed for the purpose of
producing variance in scores.
Notice in the previous paragraph that the terms agreement and dependability are used
with reference to CRTs in lieu of the term reliability. In this book, the terms agreement
and dependability are used exclusively for estimates of the consistency of CRTs, while
the term reliability is reserved for NRT consistency estimates. This distinction helps
teachers and testers keep the notions of NRT reliability separate from the ideas of CRT
agreement and dependability. The agreement coefficient provides an estimate of the
proportion of students who have been consistently classified as masters and non-
masters on two administrations of a CRT. To apply this approach, the test should be
administered twice, such that enough time has been allowed between the
administrations for students to forget the test, but not so much time that they have
learned any substantial amount.
Kappa Coefficient
The kappa coefficient (k) was developed to adjust for this problem of a chance lower
limit by adjusting to the proportion of consistency in classifications beyond that which
would occur by chance alone. The adjustment is given in the following formula:
The kappa coefficient is an estimate of the classification agreement that occurred
beyond what would be expected by chance alone and can be interpreted as a percentage
of agreement by moving the decimal two places to the right. Since kappa represents the
percentage of classification agreement beyond chance, it is usually lower than the
agreement coefficient. Like the agreement coefficient, it has an upper limit of 1.00, but
unlike the agreement coefficient with its chance lower limit, the kappa coefficient has
the more familiar lower limit of .00.
Once the tester has the standardized cutpoint score and an internal-consistency
reliability estimate in hand, it is just a matter of checking the appropriate table. In either
table, you can find the value of the respective coefficient by looking in the first column
for the z value closest to the obtained value, and scanning across that row until reaching
the column headed by the reliability coefficient closest to the observed reliability value.
Where the row for the z value meets the column for the reliability coefficient, an
approximate value is given for the threshold agreement of the CRT in question.
Only the phi dependability index is presented here because it is the only squared-
error loss agreement index that can be estimated using a single test administration, and
because Brennan has provided a short-cut formula for calculating this index that can be
based on raw score test statistics:
All the threshold loss and squared-error loss agreement coefficients described
previously have been criticized because they are dependent in one way or another on the
cut-score. Alternative approaches, called domain score estimates of dependability, have
the advantage of being independent of the cut-score. However, in principle, they apply
to domain-referenced interpretations rather than to all criterion-referenced
interpretations. Domain-referenced tests (DRTs) are defined here as a type of CRT that
is distinguished primarily by the ways in which items are sampled. For DRTs, the items
are sampled from a general, but well-defined, domain of behaviors (e.g., overall business
English ability), rather than from individual course objectives (e.g., the course objectives
of a specific intermediate level business English class), as is often the case in what might
be called objectives-referenced tests (ORTs), The results on a DRT can therefore be used
to describe a student’s status with regard to the domain in a manner similar to the way
in which ORT results are used to describe the student’s status on small subtests for each
course objective.
CONFIDENCE INTERVALS
One last statistic in this section on CRT dependability isthe confidence interval (CI). The
CI functions for CRTs in a manner analogous to the standard error of measurement
(SEM) that I described in Chapter 8 for NRTs.
The Phi(lambda) Coefficient
1. Begin those calculations working to the right of the first parenthesis by dividing 1 by
the isolated result of the number of items (AF36) minus one, and isolate that result in
parentheses.
2. Then multiply the mean of the proportion scores (AH32) times the isolated result of 1
minus the mean of the proportion score (AH32) and isolate the result in parentheses.
3. Subtract the result of Step 2 minus the variance of the proportion scores (AH34) and
isolate that result in parentheses.
4. Then subtract the mean of the proportion scores (AH32) minus the cut-point (AH40)
and isolate the result in parentheses.
5. Multiply the result of Step 4 times itself and isolate the result in parentheses.
6. Add the result of Step 5 to the variance of the proportion scores (AH34) and isolate
the result in parentheses.
8. Multiply the result of Step 1 times the result of Step 7, and isolate the result in
parentheses.
10. The final result of .8247101 shown in Cell AH42 of Screen 9.3 can now be rounded to
.82.Naturally, you will want to save these results, probably under a new file name, so
you don’t lose them if something goes wrong with your computer.
1. Begin by multiplying the number of examinees, 30 in this case, times the variance of
the proportion scores (AH34), and isolate the result in parentheses.
2. Then subtract 1 from the number of examinees and isolate the result in parentheses.
3. Divide the result of Step 1 by the result of Step 2 and isolate the result in parentheses.
4. Multiply the result of Step 3 times the K-R20 (with seven places to the right of the
decimal in AH38) and hit the enter key to get the Phi(top) result in CellAH43.
Next, I calculate the Phi(error) in its linear algebra equivalent as shown in Cell AH44 of
Screen 9.2, which shows =((AH32*(1-AH32))-AH34)/(AF36-1). Step-by-step, the
calculations for the Phi(error) value are:
1. Multiply the mean of the proportion scores (AH32) times the isolated result of 1
minus the mean of the proportion score (AH32).
2. Subtract the result of Step 1 minus the variance of the proportion scores (AH34) and
isolate that result in parentheses.
3. Then divide the result of Step 2 by the isolated result of the number of items (AF36)
minus one and hit the enter key to get the value of the Phi(error).
To calculate the phi coefficient and CI, I begin by labeling the two in Cells AG45 and
AG46, respectively. I calculate the phi coefficient in Cell AH45 by dividing the Phi(top)
(AH43) by the isolated result of the Phi(top) (AH43) plus the Phi(error) (AH44) and
hitting the enter key using the following: =AH43/(AH43+AH44).As shown in Screen
9.2, to calculate the CI in Cell AH46,1 simply take the square root of the Phi(error) and
hit the enter key using the following: =SQRT(AH44).The results of all these calculations
for Phi(top), Phi(error), phi, and CI are shown in Cells AH43 to AH46 in Screen 9.3,
which are very similar to the results obtained in the formulas in the text above. The
slight differences are very minor and are due to rounding.
3. A test made up of items that test similar language material will tend to be more
consistent than a test assessing a wide variety of material;
4. A test with items that have relatively high difference indexes, or B-indexes, will tend
to be more consistent than a test with items that have low ones;
5. A test that is clearly related to the objectives of instruction will tend to be more
consistent than a test that is not obviously related to what the students have learned.