Chapter 6edited
Chapter 6edited
3. External environment - The external environment may include room temperature, noise level,
depth of instruction, exposure to materials, and quality of instruction, which could affect changes
in the responses of examinees in a test.
There are different ways in determining the reliability of a test. The specific kind of reliability will
depend on the (1) variable you are measuring, (2) type of test, and (3) number of versions of the
test.
The different types of reliability are indicated and how they are done.
Notice in the third column that statistical analysis is needed to determine the test reliability.
Method in
What statistics is used?
Testing
Reliability How is this reliability done?
1. Test-retest Correlate the test scores from the
You have a test, and you need to administer it at first and the next administration.
one time to a group of examinees. Administer it Significant and positive correlation
again at another time to the "same group" of indicates that the test has temporal
examinees. There is a time interval of not more stability overtime. Correlation
than 6 months between the first and second refers to a statistical procedure
administration of tests that measure stable where linear relationship is
characteristics, such as standardized aptitude expected for two variables. You
tests. The post-test can be given with a may use Pearson Product Moment
minimum time interval of 30 minutes. The Correlation or Pearson r because
responses in the test should more or less be the test data are usually in an interval
same across the two points in time. scale (refer to a statistics book for
Pearson r).
2. Parallel There are two versions of a test. The items need Correlate the test results for the
Forms to exactly measure the same skill. Each test first form and the second form.
version is called a "form." Administer one form
at one time and the other form to another time to
the "same" group of participants. The responses
on the two forms should be more or less the
same.
Parallel forms are applicable if there are two Significant and positive correlation
versions of the test. This is usually done when coefficient are expected. The
the test is repeatedly used for different groups, significant and positive correlation
such as entrance examinations and licensure indicates that the responses in the
examinations. Different versions of the test two forms are the same or
are given to a different group of examinees. consistent. Pearson r is usually
3. Split-Half Administer a test to a group of examinees. The Correlate the two sets of scores
items need to be split into halves, usually using using Pearson
the odd-even technique. In this technique, get the r. After the correlation, use another
sum of the points in the odd-numbered items and formula called SpearmanBrown
correlate it with the sum of points of the even- Coefficient. The correlation
numbered items. Each examinee will have two coefficient The correlation
scores coming from the same test. The scores on coefficient obtained using Pearson r
each set should be close or consistent. and Spearman Brown should be
significant and positive to mean
Split-half is applicable when the test has a large that the test has internal consistency
number of items. reliability.
1. Linear regression
➢ This is demonstrated when you have two variables that are measured, such as two set of scores in
a test taken at two different times by the same participants.
➢ When the two scores are plotted in a graph (with X- and Y-axis), they tend to form a straight
line
➢ The straight line formed for the two sets of scores can produce a linear regression.
➢ When a straight line is formed, we can say that there is a correlation between the two sets of
scores. This can be seen in the graph shown. This correlation is shown in the graph given.
➢ The graph is called a scatterplot.
➢ Each point in the scatterplot is a respondent with two scores (one for each test)
2. Computation of Pearson r correlation
➢ The index of the linear regression is called a correlation coefficient.
➢ When the points in a scatterplot tend to fall within the linear line, the correlation is said to be
strong.
➢ When the direction of the scatterplot is directly proportional, the correlation coefficient will have
a positive value. If the line is inverse, the correlation coefficient will have a negative value.
➢ The statistical analysis used to determine the correlation coefficient is called the Pearson
r.
Suppose that a teacher gave the spelling of two-syllable words with 20 items for Monday and
Tuesday. The teacher wanted to determine the reliability of two set of scores by computing for the
Pearson r.
Formula:
LESSON 6: ESTABLISHING TEST VALIDITY AND RELIABILITY
r = 0.80
The value of a correlation coefficient does not exceed 1.00 or -1.00. A value of 1.00 and -1.00
indicates perfect correlation. In test of reliability though, we aim for high positive correlation to mean
that there is consistency in the way the student answered the tests taken.
➢ When the value of the correlation coefficient is positive, it means that the higher the scores
in X, the higher the scores in Y. This is called a positive correlation.
➢ When the value of the correlation coefficient is negative, it means that the higher the scores
in X, the lower the scores in Y, and vice versa. This is called a negative correlation.
➢ When the same test is administered to the same group of participants, usually a positive
correlation indicates reliability or consistency of the scores.
Suppose that five students answered a checklist about their hygiene with a scale of 1 to 5, where in the
following are the corresponding scores:
5 - always, 4 - often, 3 - sometimes, 2 - rarely, 1 - never
The checklist has five items. The teacher wanted to determine if the items have internal
consistency.
The internal consistency of the responses in the attitude toward teaching is 0.10, indicating
low internal consistency.
The consistency of ratings can also be obtained using a coefficient of concordance. The Kendall's w
coefficient of concordance is used to test the agreement among raters.
Below is a performance task demonstrated by five students rated by three raters.
The rubric used a scale of 1 to 4, where in 4 is the highest and 1 is the lowest.
Sum of
Rater Rater 2 Rater
Five demonstrations Ratings
1 3
4 4 3 11 2.6 6.76
3 2 3 8 -0.4 0.16
4
c 3 4 11 2.6 6.76
3 3 2 8 -0.4 0.16
1 1 2 4 -4.4 19.36
The scores given by the three raters are first computed by summing up the
total ratings for each demonstration. The mean is obtained for the sum of ratings (R = 8.4). The
mean is subtracted from each of the Sum of Ratings (D). Each difference is squared (D), then
the sum of squares is computed (iD2= 33.2). The mean and summation of squared difference is
substituted in the Kendall's w formula. In the formula, m is the numbers of raters. A Kendall's w
coefficient value of 0.38 indicates the agreement of the three raters in the five demonstrations.
There is moderate concordance among the three raters because the value is far from 1.00.
Validity When the test is presented well, The test items and layout are reviewed and tried out
free of errors,and administered on a small
well group of respondents. A manual for
administration can be made as a guide for the
test administrator.
Predictive A measure should predict a future A correlation coefficient is obtained where the X-
Validity criterion. Example is an entrance variable is used as the
exam predicting the grades of the predictor and the Y-variable as the criterion.
students after the first semester.
Construct The Components or factors of the The Pearson r can be used to correlate the items for
Validity test should contain items that are each factor. However, there is a technique called
strongly correlated. factor analysis to determine which items are highly
correlated to form a factor.
Convergent When the components or factors Correlation is done for the factors of the test.
Validity of a test are hypothesized to
have a positive correlation
Divergent Validity When the components or factors Correlation is done for the factors of the test.
of a test are hypothesized to have
a negative correlation. An
example to correlate are the
scores in a test on intrinsic and
extrinsic motivation.
STEPS
1. Get the total score of each student and arrange the scores from highest to lowest.
2. Get the top 27% (upper group) and below 27% (lower group) of the examinees.
3. Count the number of the examinees in the higher group (pH)and lower group (pL)who got
each item correctly.
4. Compute for the Difficulty Index of each item. (It is a measure of the proportion of examinees
who answered the item correctly.)
5. Compute the Discrimination Index. ( This is a measure of how well an item is able to
distinguish between examinees who are knowledgeable and those who are not or between
masters and non-masters.
FORMULA:
➢ The difficulty index is obtained using the formula:
Item difficulty= (pH+pL)/2
INTERPRETATION
ITEM ANALYSIS
COMPUTATION
Discrimination Very Good Good Item Good Item Good Item Very Good Item
Item
References:
1. Ubiña-Balagtas, M., David, A.J Golla, E, Magno, C.j & Valladolid, V. (2020)
Establishing Test Validity and Reliability. In Assessment in Learning 1 (pp.96-
118). Rex Book Store, Inc.