CLASSICAL TEST THEORY: An Introduction To Linear Modeling Approach To Test and Item Analysis
CLASSICAL TEST THEORY: An Introduction To Linear Modeling Approach To Test and Item Analysis
those involved in making Kuder-Richardson of measurement. The smaller the standard error of
Formulas [6]. measurement the more certain is the accuracy with
which the attribute measured which also tell us the
2.1. What is Classical Test Theory? individual score is close to the true score.
Classical test theory has been used for decades to Conversely, the larger the standard error of
determine reliability and other characteristics of measurement, the less certain is the accuracy with
measurement instruments. According to [1] Classical which an attribute is measured.
test theory is a theory about test scores that
introduces three concepts (1) test score (often called The standard error of measurement is represented
the observed score), (2) true score, and (3) error and calculated with the formula:
score. Within this framework, various models have
been formulated. Example, in what is often referred (2)
to as the "classical test model,"
Where: SEM=standard error of measurement
(1) Sx = standard deviation of test scores
Rxx= reliability coefficient
This is a simple linear model that links the Small SEM indicates high reliability
observable test score(X) to the sum of two
unobservable variables, true score (T) and error score The Standard errors of measurement are used to
(E). Because the true score is not easily observable, create confidence intervals around specific observed
instead, the true score must be estimated from the scores [8]. The upper and lower bound of the
individual’s responses on a set of test items. confidence interval approximate the value of the true
Therefore the equation is not solvable unless some score.
simplifying assumptions are made.
The observed score in CTT is assumed to be
The major assumptions underlines the CTT are: true measured with error. However, in developing
scores and error scores are uncorrelated, the average measures, the goal of CTT is to minimize this error
error score of the examinees is zero, and error scores [9]. The Importance of a test's reliability and
on the parallel tests are uncorrelated. According to calculating the reliability coefficient increases, in
[7] the assumption of classical test theory is that, that case. If reliability coefficient is known, error
each individual examinee has a true score variance can be estimated. The square root of error
(unobservable) which would be obtained if there variance is determined as a standard error of
were no errors in measurement. However, because measurement and helps to define the confidence
the instruments used are imperfect, the score interval in order to have a more realistic estimation
observed for each individual may differ from an of the true score [10].
individual’s true ability. This difference between the
observed score and the true score results from 2.2. CTT Statistics and Item Analysis
measurement error. Error is often assumed to be a Classical test analysis utilizes traditional item and
random variable having a normal distribution. The sample dependent statistics. These include item
Classical test theory’s implication for examinees is difficulty and item discrimination estimates,
that tests are fallible imprecise tools. The score distractor analyses, item-test inter correlations, and a
obtained by an individual is called the individual’s variety of related statistics. Most of the psychometric
true score. This means that even with the repeated analyses have focused on examinee assessment at the
application of the same test, the true score for an test score level, rather than at the item level.
individual will not change. This CTT’s observed Classical test analysis also typically includes a
score is always the true score influenced by some measure for the reliability of scores (i.e., Cronbach’s
degree of error, the influence of this error on the Alpha), difficulty of the test item and Discrimination.
observe score can be positive or negative. Item Analysis is a set of statistical procedures that
focus on the selection of items that maximizes score
[8] stated that, theoretically, the standard deviation of reliability. The major classical analysis statistics are.
the distribution of random errors for each examinee 1. Difficulty (item level statistic); 2. Discrimination
tells about the magnitude of measurement error. (item level statistic) and 3. Reliability (test level
Usually, it is assumed that the distribution of random statistic).
errors will be the same for all test takers. The
standard deviation of errors is uses as the basic i Item Difficulty
measure of error in Classical test theory. In practice, Item difficulty in classical theory is the first item
the reliability of the test and standard deviation of the characteristic to be determined. Item difficulty is
observed score are used to estimate the standard error simply the proportion of examinees taking the test,
who got an item or answer it correctly. The larger the while providing enough cases for analysis" The
percentage getting an item correctly, the easier the discrimination index, D, is given as
item is. The higher the difficulty value, the easier the
item is understood to be. To compute the item (4)
difficulty index, divide the number of examinees
answering the item correctly by the total number of With Pu being the proportion of correct responses for
examinees answering item. An item answered the upper group and Pi being the proportion of
correctly by 75% of the examinees would have a correct responses for the lower group, Since its
difficulty index or p-value, of .75, whereas an item proportion ranges from -1 to +1, a negative index
answered correctly by 40% of the examinees would indicates that the larger portion of the lower group
have a lower item difficulty or p-value, of .40 [11]. answered the item correctly while a positive index
The item difficulty is denoted as p and is indicates that a higher proportion of the upper group
symbolically given as: got the item correctly [15].
b. Discrimination coefficients
(3)
There are two indicators of the item's discrimination
effectiveness; these are; point biserial correlation and
Where P = is the difficulty of a certain item biserial correlation coefficient. The choice of
R = is the number of examinees who get that item correlation depends on the kind of question we want
correct and to answer. One of the major shortcomings of the
N = is the total number of examinees. discrimination index, D is that, only 54% (27% upper
+ 27% lower) are used to compute the item
A general guideline for the interpretation of an item discrimination and 46% of the examinees ignored.
difficulty index is provided in the following table; Similarly, the advantage of using discrimination
see, for example, [12]; [13] [14] among others coefficients in determining the discriminating power
over the discrimination index is that every examinee
Table 1: Item difficulty indices interpretation [14]
taking the test is used to compute the discrimination
Difficulty Index (p) Interpretation
coefficients. A point-biserial correlation coefficient
P ≤ 0.30 Difficult (rpbi) is defined by:
0.31 ≤ 0.70 Moderately difficult
P> 0.70 Easy
(5)
ii Item Discrimination
Item discrimination refers to the difference in correct Where: Mp = whole-test mean for students
responses between the low and the high scoring answering item correctly,
students. It is the ability of a test item to discriminate Mq = whole-test mean for students answering
between higher ability and lower ability examinees item incorrectly,
[12]. For the item difficulty, a group that answered St = standard deviation for whole test,
the item correctly, and one that did not is created. p = proportion of students answering correctly
This statistic focuses on determining the correct q = proportion of students answering
respondents or examinees get the item right or wrong incorrectly [13].
in a test. In essence, the aim of item discrimination is
to eliminate or dropped or modified items that do not A Point biserial correlation (r_pbi) coefficient ranges
function well in the tested group [15]. The index of from -1 to +1. A high point-biserial coefficient
discrimination to determine the discriminating power means that students with higher total scores are
of an item can be computed using two indices: the students selecting the correct response, and students
item discrimination index, D, and Item selecting incorrect responses to an item are
discrimination coefficient associated with lower total scores. According to the
value of r_pbi, item can discriminate between high-
a. Item Discrimination Index (D) ability and low-ability examinees. Very low or
This method can be applied to compute a simple negative point-biserial coefficients help in
measure of the discriminating power of an item using identifying defective test items [15].
the extreme groups [11]. In calculating the D index,
first ranks order the students by their test scores. A summary of the widely used [16] criteria and
Next, separate the top 27% of the students and the guidelines for categorizing discrimination indices in
27% at the bottom for the analysis. As stated by [13] item and test analysis is used in this study.
"27% is used because it has shown that this value
will maximize differences in normal distributions
Table 2.2: Interpretation of Discrimination Indices [18] Others methods are as Split Half and Kuder-
Discrimination Quality of an Item Richardson-20 and 21 (KR-20 and 21).
Index However In educational research using classical test
D ≥ 0.40 Item is functioning quite approach, internal consistency estimates are the
satisfactorily easiest to obtain which indicate the extent to which
0.30 ≤ D ≤ 0.39 Good item; little or no revision is each item correlates with other items. This is
required measured on a scale of 0-1. The higher the
0.20 ≤ D ≤ 0.29 Item is marginal and need revision coefficient the higher the item reliability, internal
D ≤ 0.19 Poor item; should be eliminated or consistency is arrived at by using split-half, Kuder-
completely revised Richardson-20 and 21 and Cronbach alpha [18].
(6)
Figure 1: Classical test Reliability Where: rhh: correlation between the two halves
of the test
a. Reliability as Equivalence Procedure:
[19] pointed out that, Reliability as equivalence is of a) Divide the test into two equal halves
two forms: parallel or alternate form and inter-rater b) Calculate the correlation coefficient
form. Estimating reliability using parallel or alternate between the two halves
form requires the developing two forms of a test or c) Calculate the Spearman-Brown reliability
instrument using the same content domain, same estimate
number of items, same test specifications, same item Spearman-Brown formula will give an estimate of
format as well as a similar difficulty and maximum reliability that can be expected (upper
discrimination indices. bound estimate)
Where: k: represent number of items on the test; 2.5. Limitations of Classical Test Theory
si2: sum of the variances of the different While classical test methods have proven to be very
parts of the test (item i) and useful and are still widely used among practitioners
sx2: variance of the test scores in test construction and analysis process. [1] mention
Cronbach’s α can be shown to provide a lower bound that, the two classical item statistics; item difficulty
for reliability under rather mild assumptions. Thus, and item discrimination that form the cornerstones of
the reliability of test scores in a population is always many classical test and item analyses are group
higher than the value of Cronbach’s α in that dependent (depend on the sample). Thus, the P and
population. 0.7-0.8 is an acceptable value for D or r-values depend on the students’ sample in
Cronbach’s α; values substantially lower indicate an which they are obtained. In terms of discrimination
unreliable scale [23]. indices, higher values will tend to be obtained from
heterogeneous samples and lower values from
2.3. Item Selection in Classical Test Theory homogeneous samples. Similarly, in terms of item
In classical test theory item analysis consists of difficulty indices, higher values will be obtained
determining sample-specific parameters and from the samples examinees of above-average ability
eliminating items based on the statistical criteria or and lower values from examinee samples of low or
set standards. A poor item in the entire test is below-average ability [24]. “Such sample
identified by an item difficulty index that is too low dependency relationships reduce the overall utility of
(p<0.30) or too high (p> 0.70), or a low item these statistics” [4].
discrimination indices, such that rpbi≤ 0.20 [12].
According to [1] in test development, items are Another weakness of classical test theory is that its
selected on the basis of these two characteristics: applications are test dependent or “test-based”. Test
item difficulty and item discrimination. An item with difficulty directly affects the resultant test scores.
the highest discrimination parameters is normally Higher knowledge scores are directly associated with
prioritized in item selection, however, the choice of tests composed of relatively easy items, and low
item difficulty and discrimination is usually informed knowledge scores can be attributed to a test
by the purpose of the test and the anticipated ability composed of items that are more difficult. The true
distribution of the group of people for whom the test score model, upon which much of classical test
is intended. Example, where the purpose of a test is theory is based, permits no consideration of
to select a group of high-ability students for the examinee responses to any specific item. Thus, no
award of a scholarship, here, the items that are quite basis exists to predict how a given examinee will
difficult are generally chosen for the entire perform on a particular test item [4].This shows that
population of the test takers. the examinee ability depends on the test item
Example norm-referenced achievement tests are difficulty
designed to differentiate between examinees with [15] wrote that classical test reliability is an indicator
of the quality of a set of test scores; hence, reliability
As it has been mentioned before, the main purpose of [3] Marcoulides, G. A. (1999). Generalizability theory:
the psychometric process and usage of different Picking up where the Rasch IRT model leaves off? In S. E.
measurement approaches or theories is to determine Embretson and S. L. Hershberger (Eds.), the new rules of
maximum information about an individual. This measurement: What every psychologist and educator
valuable information is accessible by different should know. Mahwah, NJ: Erlbaum, pp. 129-152
methods, if valid, theoretic mathematical background
of implementation is used and a reliable atmosphere [4] Schumacker, R. E. (2010). Classical Test Analysis.
is satisfied. CTT is a scientific framework which has http://appliedmeasurementassociates.com/ama/assets/File/
a pioneer role in educational measurement and CLASSICAL_TEST_ANALYSIS.pdf. Retrieved on 13
August, 2014.
psychometric process. Essential rules of this theory
are discussed and presented in this study. CTT has
[5] Allen, M. J., & Yen, W. M. (1979). Introduction to
served the measurement community for decades; due
Measurement Theory. Monterey, CA: Brooks/Cole
to its weaknesses IRT has witnessed an exponential Publishing Company.
growth in recent decades [26]. Therefore, this study
presented the main principles of CTT and their [6] Traub, R.E., & Fisher, C.W. (1997).On The
effects on the educational measurement process. Equivalence of Constructed Responses and Multiple-
Besides depicting the simplicity of the CTT model Choice Tests. Applied Psychological Measurement, 1, 355-
from multiple points of view, various limitations of 370.
the model were highlighted. These limitations are
detailed in item, person and ability level. [7] Magno, C. (2009). Demonstrating the Difference
between Classical Test Theory and Item Response Theory
Despite the shortcomings attributed to CTT it is using Derived Test Data. The International Journal of
Educational and Psychological Assessment, Vol.1, Issue 1.
recommended that, Classical test theory approach of
Pp. 1-11
item analysis should be maintained in test
development and evaluation, because of its
[8] Kaplan, R. M. & Saccuzo, D. P. (1997). Psychological
superiority and simplicity in the investigation of Testing: Principles, Applications and Issues. Pacific
reliability and in minimizing measurement errors. Grove: Brooks Cole Pub. Company
Secondly, achievement tests used to in examining
students’ achievement compared to educational [9] McBride, N. L. (2001). An Item Response Theory
standards should be made to pass through all the Analysis of the Scales From The International Personality
processes of standardization and validation. Item Pool and the Neo Personality Inventory-Revised.