AIL Unit 3
AIL Unit 3
CONTENT
1. Validity
2. Reliability
3. Fairness
4. Positive Consequences
5. Practicality and Efficiency
MODULE OUTCOMES
In this module you will be able to:
1. discuss validity and reliability;
2. compute and interpret the validity coefficient and reliability coefficient;
3. discuss fairness, positive consequences and practicality & efficiency
Lesson 1: VALIDITY
Lesson Outcomes:
At the end of the lesson, the learners must have:
1. discussed validity and its types; and
2. computed and interpreted the validity coefficient
ACTIVATE:
The quality of the assessment instrument and method used in education is very
important since the evaluation and judgment that the teacher gives on a student are
based on the information he obtains using these instruments. Accordingly, teachers
follow a number of procedures to ensure that the entire assessment is valid and reliable.
ACQUIRE:
b. Predictive validity.
A type of validation that refers to a measure of the extent to which a
student's current test result can be used to estimate accurately the outcome of the
student's performance at later time. It is appropriate for tests designed to assess
student's future status on a criterion.
3. Construct Validity.
This is used to ensure that the measure is actually measure of what it is intended
to measure (construct), and not the other variables. Using panel of experts who are
familiar with the construct is a way in which this type of validity can be assessed. The
experts can examine the items and decide what that specific item is intended to
measure. Students can be involved in this process to obtain their feedback.
1. Validity refers to the decisions we make, and not to the test itself or to the
measurement.
2. Like reliability, validity is not an all-or-nothing concept; it is never totally
absent or absolutely perfect.
3. A validity estimate, called a validity coefficient, refers to specific type of
validity. It ranges between 0 and 1.
4. Validity can never be finally determined; it is specific to each administration
of the test.
Validity Coefficient
The validity coefficient is the computed value of Pearson Product Moment
Correlation Coefficient rxy.
Pearson Product Moment Correlation Coefficient (rxy) Formula:
(𝒏) (∑ 𝒙𝒚)−(∑ 𝒙) (∑ 𝒚)
rxy =
√[(𝒏)(∑ 𝒙𝟐 )− (∑ 𝒙)𝟐 ][(𝒏)(∑ 𝒚𝟐 )−(∑ 𝒚)𝟐 ]
where:
r – correlation coefficient
n – refers to the number of “pairs” your data has
x – values of the x-variable in a sample
𝑥̅ – mean of the value of the x-variable
y- values of the y-variable in a sample
𝑦̅ –mean of the values of the y-variable
In theory, the validity coefficient has values like the correlation that ranges from 0
to 1. In practice, most of the validity scores are usually small and they range from
0.3 to 0.5, few exceed 0.6 to 0.7. Hence, there is a lot of improvement in most of our
psychological measurement.
Lesson Outcomes:
At the end of the lesson, the learners must have:
1. discussed reliability and its types; and
2. computed and interpreted reliability coefficient of a test.
ACTIVATE:
Reliability and validity are closely related. To better understand this relationship, let's
step out of the world of testing and onto a bathroom scale.
If the scale is reliable it tells you the same weight every time you step on it
as long as your weight has not actually changed. However, if the scale is
not working properly, this number may not be your actual weight. If that is
the case, this is an example of a scale that is reliable, or consistent, but not
valid. For the scale to be valid and reliable, not only does it need to tell you
the same weight every time you step on the scale, but it also has to measure your actual weight.
Switching back to testing, the situation is essentially the same. A test
can be reliable, meaning that the test-takers will get the same score no matter
when or where they take it, within reason of course. But that doesn't mean that
it is valid or measuring what it is supposed to measure. A test can be reliable
without being valid. However, a test cannot be valid unless it is reliable.
In order for assessments to be sound, they must be free from bias
and distortion. Reliability and validity are two concepts that are important for defining
and measuring bias and distortion. Reliability refers to the extent to which assessments
are consistent. Instruments such as classroom tests and national standardized exams
should be reliable- it should not make any difference whether a student takes the
assessment in the morning or afternoon; one day or the next.
ACQUIRE:
Types of Reliability
1. Test-retest Method.
Test-retest reliability can be used to assess how well a method resists these
factors over time. The smaller the difference between the two sets of results, the higher
the test-retest reliability.
• Ensure that all questions or test items are based on the same theory and
formulated to measure the same thing.
3. Inter-rater Reliability
It is a measure of reliability used to assess the degree to which different judges
or raters agree in their assessment decisions. Inter-rater reliability is useful because
human observers will not necessarily interpret answers the same way; raters may
disagree as to how well certain responses or material demonstrate knowledge of the
construct or skill being assessed.
• Inter-rater reliability might be employed when different judges are evaluating the
degree to which art portfolios meet certain standards. Inter-rater reliability is
especially useful when judgments be more likely when evaluating artwork as
opposed to math problems.
• Split-half Method. Administer test once and score two equivalent halves of
the test. To split the test into halves that are equivalent, the usual
procedure is to score the even-numbered and the odd-numbered test item
separately. This provides two scores for each student. The results of the
test scores are correlated using the Spearman-Brown formula and this
correlation coefficient provides a measure of internal consistency. It
indicates the degree to which consistent results are obtained from two
2𝑟
halves of the test. The formula is: rot = 1+𝑟𝑜𝑒 .The details of this formula will
𝑜𝑒
be discussed in later lessons.
• Kuder-Richardson Formula. Administer the test once. Score the total
test and apply the Kuder-Richardson formula. The Kuder-Richardson 20
formula is applicable only in situations where students' responses are
scored dichotomously, and therefore, is most useful with traditional test
items that are scored as right or wrong, true or false, and yes or no type.
KR-20 formula estimates of reliability provide information whether the
degree to which the item in the test measure is of the same characteristic,
it is an assumption that all items are of equal in difficulty. (A statistical
procedure used to estimate coefficient alpha, 'a correlation coefficient is
given.) Another formula for testing the internal consistency of a test is the
KR-21 formula, which is not limited to test items that are scored
dichotomously.
When you devise a set of questions or ratings that will be combined into an overall
score, you have to make sure that all of the items really do reflect the same thing. If
responses to different items contradict one another, the test might be unreliable.
• Take care when devising questions or measures: those intended to reflect the
same concept should be based on the same theory and carefully formulated.
RELIABILITY COEFFICIENT
Reliability coefficient is a measure of the amount of error associated with the test
scores.
Description of Reliability Coefficient
a. The range of the reliability coefficient is from 0 to 1.0.
b. The acceptable range value 0.60 or higher.
c. The higher the value of the reliability coefficient, the more reliable the overall
test scores.
d. Higher reliability indicates that the test items measure the same thing.
Example, knowledge of solving number problem in algebra subject.
2. Spearman-Brown Formula
𝟐𝒓𝒐𝒆
rot = 𝟏+𝒓𝒐𝒆
k = number of items
p = proportion of the students who got the item correctly (index of
difficulty)
s2 = variance of the total score
𝒌 𝒙 ̅̅̅
̅ (𝒌− 𝒙)
KR21 = c[𝟏 − ]
𝒌−𝟏 𝒌𝒔𝟐
k = number of
items
𝑥̅ = mean value
s2 = variance of the total score
APPLY:
Let us discuss the steps in solving the reliability coefficient using the different
methods of establishing the reliability of the given tests using the different examples.
Example 1: Prof. Gwen conducted a test to her 10 students in Elementary Statistics class
twice after one-day interval. The test given after one day is exactly the same test given
the first time. Scores below were gathered in the first test (FT) and second test (ST).
Using test-retest method, is the test reliable? Show the complete solution.
Student FT ST
1 36 38
2 26 34
3 38 38
4 15 27
5 17 25
6 28 26
7 32 35
8 35 36
9 12 19
10 35 38
Using the Pearson r formula, find the ∑x, ∑y, ∑xy, ∑x2, ∑y2.
(𝑛) (∑ 𝑥𝑦)−(∑ 𝑥) (∑ 𝑦)
rxy =
√[(𝑛)(∑ 𝑥 2 )− (∑ 𝑥)2 ][(𝑛)(∑ 𝑦 2 )−(∑ 𝑦)2 ]
(10) (9 192)−(274)(316)
rxy =
√[(10)(8 33 2)−(274)2 ][(10)(10 400)−(316)2 ]
91 920−86 584
rxy =
√(83 320−75 076) (104 000−99 856)
5 336
rxy =
√ 8 244)(4 144)
(
5 336
rxy =
√34 163 136
5 336
rxy = 5 844.92
rxy = 0.91
Analysis:
The reliability coefficient using the Pearson r = 0.91, means that it has a very
high reliability. The scores of the 10 students conducted twice with one-day interval are
consistent. Hence, the test has a very high reliability.
Example 2: Prof. Glenn conducted a test to his 10 students in his Mathematics class
two times after one-week interval. The test given after one week is the parallel form of
the test during the first time the test was conducted. Scores below were gathered in the
first test (FT) and second test or parallel test (PT). Using equivalent or parallel form
method, is the test reliable? Show the complete solution, using the Pearson r formula.
Student FT PT
1 12 20
2 20 22
3 19 23
4 17 20
5 25 25
6 22 20
7 15 19
8 16 18
9 23 25
10 21 24
Using the Pearson r formula, find the ∑x, ∑y, ∑xy, ∑x2, ∑y2.
(𝑛) (∑ 𝑥𝑦)−(∑ 𝑥) (∑ 𝑦)
rxy =
√[(𝑛)(∑ 𝑥 2 )− (∑ 𝑥)2 ][(𝑛)(∑ 𝑦 2 )−(∑ 𝑦)2 ]
(10) (4 174)−(190)(216)
rxy =
√[(10)(3 754)−(190)2][(10)(4 724)−(216)2 ]
41 740−41 040
rxy =
√(37 540−36 100) (47 240 − 46 656 )
700
rxy =
√ 1 440)(584)
(
700
rxy =
√840 960
700
rxy = 917.04
rxy = 0.76
Analysis:
The reliability coefficient using the Pearson r = 0.76, means that it has a high
reliability. The scores of the 10 students conducted twice with one-week interval are
Consistent. Hence, the test has a high reliability.
Example 3: Prof. Edwin Santos conducted a test to his 10 students in his Chemistry
class. The test was given only once. The scores of the students in odd and even items
below were gathered, (O) odd items and (E) even items. Using split-half method, is the
test reliable? Show the complete solution.
𝟐𝒓
Use the formula rot = 𝟏+𝒓𝒐𝒆 to find the reliability of the whole test, find the ∑x, ∑y, ∑xy,
𝒐𝒆
∑x2, ∑y2 to solve the reliability of the odd and even test items.
Odd (x) Even (y) xy x2 y2
15 20 300 225 400
19 17 323 361 289
20 24 480 400 576
25 21 525 625 441
20 23 460 400 529
18 22 396 324 484
19 25 475 361 625
26 24 624 676 576
20 18 360 400 324
18 17 306 324 289
2 2
∑x = 200 ∑𝐲 = 211 ∑xy = 4249 ∑x = 4 096 ∑y = 4 533
(𝒏) (∑ 𝒙𝒚)−(∑ 𝒙) (∑ 𝒚)
rxy =
√[(𝒏)(∑ 𝒙𝟐 )− (∑ 𝒙)𝟐 ][(𝒏)(∑ 𝒚𝟐 )−(∑ 𝒚)𝟐 ]
(𝟏𝟎) (𝟒 𝟐𝟒𝟗)−(𝟐𝟎𝟎)(𝟐𝟏𝟏)
rxy =
√[(𝟏𝟎)(𝟒 𝟎𝟗𝟔)−(𝟐𝟎𝟎)𝟐][(𝟏𝟎)(𝟒 𝟓𝟑𝟑)−(𝟐𝟏𝟏)𝟐 ]
𝟒𝟐 𝟒𝟗𝟎−𝟒𝟐 𝟐𝟎𝟎
rxy =
√ 𝟒𝟎 𝟗𝟔𝟎−𝟒𝟎 𝟎𝟎𝟎) (𝟒𝟓 𝟑𝟑𝟎− 𝟒𝟒 𝟓𝟐𝟏 )
(
𝟐𝟗𝟎
rxy =
√(𝟗𝟔𝟎)(𝟖𝟎𝟗)
𝟐𝟗𝟎
rxy =
√𝟕𝟕𝟔 𝟔𝟒𝟎
𝟐𝟗𝟎
rxy = 𝟖𝟖𝟏.𝟐𝟕
rxy = 0.33
Find the reliability of the original test using the formula:
𝟐𝒓
rot = 𝟏+𝒓𝒐𝒆
𝒐𝒆
𝟐 (𝟎.𝟑𝟑)
rot = 𝟏+𝟎.𝟑𝟑
𝟎.𝟔𝟔
rot = 𝟏.𝟑𝟑
rot = 0.50
Analysis:
Solve the mean and the standard deviation of the scores using the table below.
1 16 256
2 25 625
3 35 1 225
4 39 1 521
5 25 625
6 18 324
7 19 361
8 22 484
9 33 1 089
10 36 1 296
11 20 400
12 17 289
13 26 676
14 35 1 225
15 39 1 521
n = 15 ∑x = 405 ∑ x2 = 11 917
𝒏 (∑ 𝒙𝟐 )− (∑𝒙)𝟐
Standard Deviation Formula: s2 = 𝒏 (𝒏−𝟏)
Applying the formula, we have:
𝟏𝟓 (𝟏𝟏 𝟗𝟏𝟕)− (𝟒𝟎𝟓)𝟐
s2 = 𝟏𝟓 (𝟏𝟓−𝟏)
𝟏𝟕𝟖 𝟕𝟓𝟓− 𝟏𝟔𝟒 𝟎𝟐𝟓
s2 = 𝟏𝟓 (𝟏𝟒)
𝟏𝟒 𝟕𝟑𝟎
s2 = 𝟐𝟏𝟎
s2 = 70.14
∑𝐱
Mean (x̅) Formula: 𝐧
𝟒𝟎𝟓
𝐱̅ = 𝟏𝟓
𝐱̅ = 27
Solve for the reliability coefficient using the Kuder-Richardson 21 formula.
k x̅ (k− x̅)
KR21 = [1 − ]
k−1 ks2
40 27 (40− 27)
KR21 = [1 − ]
40−1 40 (70.14)
40 27 (13)
KR21 = [1 − ]
39 40 (70.14)
351
KR21 = 1.03 [1 − ]
2 805.60
ASSESS:
B. Using what you’ve learned about reliability, brainstorm possible answers to this
problem.
ACQUIRE:
Fairness means the test item should not have any biases. It should not be
offensive to any examinee subgroup. A test can only be good if it is fair to all the
examinees.
An assessment procedure needs to be fair. This means many things. First,
students need to know exactly what the learning targets are and what method of
assessment will be used. If students do not know what they are supposed to be
achieving, then they could get lost in the maze of concepts being discussed in class.
Likewise, students have to be informed how their progress will be assessed in order to
allow them to strategize and optimize their performance.
Second, assessment has to be viewed as an opportunity to earn rather than an
opportunity to weed out poor and slow earners. The goal should be that of diagnosing
the learning process rather than the learning object.
Third, fairness also implies freedom from teacher-stereo-typing. Some
examples of stereotyping include: boys are better than girls in Mathematics or girls are
better than boys in language. Such stereotyped images and thinking could lead to
unnecessary and unwanted biases in the way that teachers assess their students.
No one wants to use an assessment tool with obvious stereotyping or offensive material,
of course. But it's easy to assess in ways that inadvertently favor some students over others.
Effective assessment processes yield evidence and conclusions that are meaningful,
appropriate, and fair to all relevant subgroups of students (Lane, 2012; Linn, Baker, & Dunbar,
1991). The following tips minimize the possibility of inequities.
1. Don't rush.
Assessments that are thrown together at the last minute invariably include flaws that
greatly affect the fairness, accuracy, and usefulness of the resulting evidence.
This helps ensure that the tools are clear, that they appear to assess what you want
them to, and that they don't favor students of a particular background.
(Excerpted and adapted from Assessing Student Learning: A Common Sense Guide, 3rd
Edition by Linda Suskie. Copyright © 2018, Wiley.)
Practices that lead to fairness in these areas are considered separately below. Fairness
in assessment arises from good practice in four phases of testing:
a. Writing
b. Administering
c. Scoring
d. Interpreting assessments.
a. Writing assessments. Base assessments on course objectives. Students expect a
test to cover what they have been learning. They also have a right to a test that neither
“tricks” them into wrong answers nor rewards them if they can get a high score through
guessing or bluffing.
Cover the full range of thinking skills and processes. Assuming instruction has included
higher order thinking, so should an assessment based on that instruction prompt
students to use the material intellectually, not merely repeat memorized knowledge.
Further, if a teacher’s tests cover only memorization, the students will emphasize only
memorization of facts in their preparation.
Cover course content proportionally to coverage in instruction. The content areas on the
test should be representative of what students have studied. The best guide to
appropriate proportions is the relative amounts of instructional time spent on those
topics.
Test what is important for students to know and be able to do rather than isolated trivia.
The best guide to the content most appropriate for the test is to cover what is important
for students to come away with from the course. When writing a test, ask yourself
whether each task is what other teachers would agree is important when teaching that
course. Better, asks a colleague to review your draft test.
Avoid contexts and expressions that are more familiar and/or intriguing to some
students than to others. One challenge in writing tests is to make sure none of your
students are advantaged or disadvantaged because of their different backgrounds. For
example, music, sports, or celebrity-related examples might be appealing to some
students but not others. Language or topics should not be used if they are more well-
known or interesting to some students than to others. If that proves impossible, then at
least make sure the items that favor some students are balanced with others that favor
the rest.
b. Giving assessments. Make sure students have had equal opportunities to learn the
material on the assessment. Whether or not students have learned as much as they
can, at least they should have had equal chances to do so. If some students are given
extra time or materials that are withheld from others, the others likely will not feel they
have been treated fairly.
Announce assessments in plenty of time for students to prepare for them. Since
students’ learning styles differ, some will keep up to date with their studying and others
will prefer to put in extra effort when it is most needed. Surprise assessments reward
the former and punish the latter. But these styles are not part of the material to be
learned. Not only is it more fair to announce assessments in advance, it also serves as
a motivator for students to study.
Make sure students are familiar with the formats they will be using to respond. If some
students are not comfortable with the types of questions on an assessment, they will not
have an equal chance to show what they can do. If that might be the case, some
practice with the format beforehand is recommended to help them succeed.
Give plenty of time. Most tests in education do not cover content that will eventually be
used under time pressure. Thus, most assessments should reward quality instead of
speed. Only by allowing enough time so virtually all students have an opportunity to
answer every question will the effects of speed be eliminated as a barrier to
performance.
c. Scoring assessments. Make sure the rubric used to score responses awards full
credit to an answer that is responsive to the question asked as opposed to requiring
more information than requested for full credit. If the question does not prompt the
knowledgeable student to write an answer that receives full credit, then it should be
changed. It is unfair to reward some students for doing more than has been requested
in the item; not all students will understand the real (and hidden) directions since they
have not been told.
Base grades on several assessment formats. Since students differ in their preferred
assessment formats, some are advantaged by selected-response tests, others by essay
tests, others by performance assessments, and still others by papers and projects.
Base grades on several assessments over time. As with assessment formats, grades
should also depend on multiple assessments taken at different times.
Make sure factors that could have resulted in atypical performance for a student are
used to minimize the importance of the student’s score on that assessment. If it is
known that a student has not done her or his best, then basing a grade or other
important decision on that assessment is not only unfair; it is inaccurate. Retrieved
from:
https://wps.ablongman.com/ab_slavin_edpsych_8/38/9954/2548362.cw/content/index.ht
ml
APPLY:
Example:
Suppose a high stakes math test that must be passed contained a lot of word
problems based on competitive sports examples that many more boys than girls were
familiar with. The girls may have a lower performance than the boys because they are
less familiar with the sports contexts of the word problems, not because they are less
skilled in math (Popham 128).
Suppose in most test items in an exam, men are shown in high-paying,
respectful career, and women are portrayed in low-paying jobs. Many women would be
offended by this and may perform less well than they normally would have (Popham
128 - 129).
ASSESS:
2. If a question ask about is attending an amusement park that costs a lot, have
all your students had the chance to visit it? Is it fair to asked questions to
economically challenged students that do not get that experience? Why?
3. If a question states or insinuates girls are not good at sports with in the
content, there will be some upset students while taking the test. Do you take a
test well while being upset and offended? Why?
Lesson 4: POSITIVE CONSEQUENCES, PRACTICALITY, AND
EFFICIENCY
Lesson Outcomes:
At the end of the lesson, the learners must have discussed positive consequences and
practicality and efficiency in assessment.
ACTIVATE:
ACQUIRE:
A. Positive Consequences
1. Ease of administration
The test should be easy to administer so that the tester may easily administer. For this
purpose there should be simple and contain clear instructions. There should be little
number of subsets and appropriate (not too long) time for administering the test.
5. Cost of testing
A test should be economical in terms of preparation, administration and scoring.
Limitations
• There are chances of giving wrong directions to students by untrained teachers while
constructing or administering the tests.
• If time is reduced for taking test, the reliability of the test is reduced.
(Retrieved from: https://acadstuff.blogspot.com/2017/06/practicalityusability-
characteristic-of_73.html:)
APPLY:
Give one example of a positive consequence in assessment.
Give one example where practicality and efficiency is considered in assessment.
ASSESS:
UNIT REFERENCES
Russel, M.K. & Airasian, P.W. (2011). Classroom Assessment: Concepts and
Applications, 7th ed. McGraw-Hill Education