0% found this document useful (0 votes)
160 views

Chapter 6edited

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views

Chapter 6edited

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

LESSON 6: ESTABLISHING TEST VALIDITY AND RELIABILITY

Desired Significant Learning Outcomes


In this lesson, you are expected to:
➢ Use procedures and statistical analysis to
establish test validity and reliability;
➢ Decide whether a test is valid or
reliable and;
➢ Decide which test items are easy and
difficult.

What is test reliability?


Reliability is the consistency of the responses to measure under three conditions:
(1) When retested on the same person;
(2) When retested on the same measure; and
(3) Similarity of responses across items that measure the same characteristic.
In the first condition, consistent response is expected when the test is given to the same participants. In
the second condition, reliability is attained if the responses to the same test is consistent with the same
test or its equivalent or another test that measures but measures the same characteristic when
administered at a different time. In the third condition, there is reliability when the person responded in
the same way or consistently across items that measure the same characteristic.
There are different factors that affect the reliability of a measure. The reliability of a measure can
be high or low, depending on the following factors:
1 . The number of items in a test - The more items a test has, the likelihood of reliability is high.
The probability of obtaining consistent scores is high because of the large pool of items.

2. Individual differences of participants Every participant possesses characteristics that affect


their performance in a test, such as fatigue, concentration, innate ability, perseverance, and
motivation. These individual factors change over time and affect the consistency of the answers
in a test.

3. External environment - The external environment may include room temperature, noise level,
depth of instruction, exposure to materials, and quality of instruction, which could affect changes
in the responses of examinees in a test.

What are the different ways to establish test reliability?

There are different ways in determining the reliability of a test. The specific kind of reliability will
depend on the (1) variable you are measuring, (2) type of test, and (3) number of versions of the
test.
The different types of reliability are indicated and how they are done.
Notice in the third column that statistical analysis is needed to determine the test reliability.
Method in
What statistics is used?
Testing
Reliability How is this reliability done?
1. Test-retest Correlate the test scores from the
You have a test, and you need to administer it at first and the next administration.
one time to a group of examinees. Administer it Significant and positive correlation
again at another time to the "same group" of indicates that the test has temporal
examinees. There is a time interval of not more stability overtime. Correlation
than 6 months between the first and second refers to a statistical procedure
administration of tests that measure stable where linear relationship is
characteristics, such as standardized aptitude expected for two variables. You
tests. The post-test can be given with a may use Pearson Product Moment
minimum time interval of 30 minutes. The Correlation or Pearson r because
responses in the test should more or less be the test data are usually in an interval
same across the two points in time. scale (refer to a statistics book for
Pearson r).

Test-retest is applicable for tests that measure


stable variables, such as aptitude and
psychomotor measures (e.g., typing test, tasks in
physical education).

2. Parallel There are two versions of a test. The items need Correlate the test results for the
Forms to exactly measure the same skill. Each test first form and the second form.
version is called a "form." Administer one form
at one time and the other form to another time to
the "same" group of participants. The responses
on the two forms should be more or less the
same.

Parallel forms are applicable if there are two Significant and positive correlation
versions of the test. This is usually done when coefficient are expected. The
the test is repeatedly used for different groups, significant and positive correlation
such as entrance examinations and licensure indicates that the responses in the
examinations. Different versions of the test two forms are the same or
are given to a different group of examinees. consistent. Pearson r is usually

used for this analysis.


LESSON 6: ESTABLISHING TEST VALIDITY AND RELIABILITY

3. Split-Half Administer a test to a group of examinees. The Correlate the two sets of scores
items need to be split into halves, usually using using Pearson
the odd-even technique. In this technique, get the r. After the correlation, use another
sum of the points in the odd-numbered items and formula called SpearmanBrown
correlate it with the sum of points of the even- Coefficient. The correlation
numbered items. Each examinee will have two coefficient The correlation
scores coming from the same test. The scores on coefficient obtained using Pearson r
each set should be close or consistent. and Spearman Brown should be
significant and positive to mean
Split-half is applicable when the test has a large that the test has internal consistency
number of items. reliability.

4. Test of This procedure involves determining if the A statistical analysis called


Internal scores for each item are consistently answered Cronbach's alpha or the Kuder
Consistency by the examinees. After administering the test to Richardson is used to determine
Using Kuder a group of examinees, it is necessary to the internal consistency of the
Richardson determine and record the scores for each item. items. A Cronbach's alpha value
and The idea here is to see if the responses per item
of 0.60 and above indicates that
Cronbach's are consistent with each other. This technique
will work well when the assessment tool has a the test items have internal
Alpha
large number of items. It is also applicable for consistency.
Method
scales and inventories (e.g.,

Likert scale from "strongly agree" to "strongly


disagree").

5. Inter-rater This procedure is used to determine the A statistical analysis called


Reliability consistency of multiple raters when using Kendall's tau coefficient of
rating scales and rubrics to judge concordance is used to determine
performance. The reliability here refers to if the ratings provided by multiple
the similar or Consistent ratings provided by raters agree with each other.
more than one rater or judge when they use Significant Kendall's tau value
an assessment tool.
indicates that the raters concur or
agree with each other in their
ratings.
Inter-rater is applicable when the
assessment requires the use of multiple
raters.

What are the different ways to establish test reliability?

1. Linear regression
➢ This is demonstrated when you have two variables that are measured, such as two set of scores in
a test taken at two different times by the same participants.
➢ When the two scores are plotted in a graph (with X- and Y-axis), they tend to form a straight
line
➢ The straight line formed for the two sets of scores can produce a linear regression.
➢ When a straight line is formed, we can say that there is a correlation between the two sets of
scores. This can be seen in the graph shown. This correlation is shown in the graph given.
➢ The graph is called a scatterplot.
➢ Each point in the scatterplot is a respondent with two scores (one for each test)
2. Computation of Pearson r correlation
➢ The index of the linear regression is called a correlation coefficient.
➢ When the points in a scatterplot tend to fall within the linear line, the correlation is said to be
strong.
➢ When the direction of the scatterplot is directly proportional, the correlation coefficient will have
a positive value. If the line is inverse, the correlation coefficient will have a negative value.
➢ The statistical analysis used to determine the correlation coefficient is called the Pearson
r.
Suppose that a teacher gave the spelling of two-syllable words with 20 items for Monday and
Tuesday. The teacher wanted to determine the reliability of two set of scores by computing for the
Pearson r.
Formula:
LESSON 6: ESTABLISHING TEST VALIDITY AND RELIABILITY

Substitute the values in the formula:

r = 0.80
The value of a correlation coefficient does not exceed 1.00 or -1.00. A value of 1.00 and -1.00
indicates perfect correlation. In test of reliability though, we aim for high positive correlation to mean
that there is consistency in the way the student answered the tests taken.

3. Difference between a positive and a negative correlation

➢ When the value of the correlation coefficient is positive, it means that the higher the scores
in X, the higher the scores in Y. This is called a positive correlation.
➢ When the value of the correlation coefficient is negative, it means that the higher the scores
in X, the lower the scores in Y, and vice versa. This is called a negative correlation.
➢ When the same test is administered to the same group of participants, usually a positive
correlation indicates reliability or consistency of the scores.

4. Determining the strength of a correlation


The strength of the correlation also indicates the strength of the reliability of the test. This is
indicated by the value of the correlation coefficient. The closer the value to 1.00 or
-1.00, the stronger is the correlation. Below is the guide:
0.80-1.00 Very strong relationship

0.6-0.79 Strong relationship

0.40-0.59 Substantial/marked relationship

0.2-0.39 Weak relationship

0.00-0.19 Negligible relationship

5. Determining the significance of the correlation


The correlation obtained between two variables could be due to chance. In order to
determine if the correlation is free of certain errors, it is tested for significance. When a
correlation is significant, it means that the probability of the two variables being related is free of
certain errors.
In order to determine if a correlation coefficient value is significant, it is compared with an
expected probability of correlation coefficient values called a critical value. When the value
computed is greater than the critical value, it means that the information obtained has more than
95% chance of being correlated and is significant.
Another statistical analysis mentioned to determine the internal consistency of the test is
Cronbach's alpha. Follow the procedure to determine the internal consistency.
LESSON 6: ESTABLISHING TEST VALIDITY AND RELIABILITY

Suppose that five students answered a checklist about their hygiene with a scale of 1 to 5, where in the
following are the corresponding scores:
5 - always, 4 - often, 3 - sometimes, 2 - rarely, 1 - never
The checklist has five items. The teacher wanted to determine if the items have internal
consistency.

The internal consistency of the responses in the attitude toward teaching is 0.10, indicating
low internal consistency.
The consistency of ratings can also be obtained using a coefficient of concordance. The Kendall's w
coefficient of concordance is used to test the agreement among raters.
Below is a performance task demonstrated by five students rated by three raters.
The rubric used a scale of 1 to 4, where in 4 is the highest and 1 is the lowest.

Sum of
Rater Rater 2 Rater
Five demonstrations Ratings
1 3

4 4 3 11 2.6 6.76
3 2 3 8 -0.4 0.16

4
c 3 4 11 2.6 6.76

3 3 2 8 -0.4 0.16

1 1 2 4 -4.4 19.36

x = 8.4 ZD2 = 33.2


Ratings

The scores given by the three raters are first computed by summing up the
total ratings for each demonstration. The mean is obtained for the sum of ratings (R = 8.4). The
mean is subtracted from each of the Sum of Ratings (D). Each difference is squared (D), then
the sum of squares is computed (iD2= 33.2). The mean and summation of squared difference is
substituted in the Kendall's w formula. In the formula, m is the numbers of raters. A Kendall's w
coefficient value of 0.38 indicates the agreement of the three raters in the five demonstrations.
There is moderate concordance among the three raters because the value is far from 1.00.

What is test validity?


A measure is valid when it measures what it is supposed to measure. If a quarterly exam is
valid, then the contents should directly measure the objectives of the curriculum. If a scale that
measures personality is composed of five factors, then the scores on the five factors should have
items that are highly correlated. If an entrance exam is valid, it should predict students' grades after
the first semester.

What are the different ways to establish test validity?


There are different ways to establish test validity.

Type of Validity Definition Procedure


Content When the items represent the The items are compared with the objectives of
Validity domain being measured the program. The items need to measure
directly the objectives (for achievement) or
definition (for scales). A reviewer conducts
the checking.

Validity When the test is presented well, The test items and layout are reviewed and tried out
free of errors,and administered on a small
well group of respondents. A manual for
administration can be made as a guide for the
test administrator.

Predictive A measure should predict a future A correlation coefficient is obtained where the X-
Validity criterion. Example is an entrance variable is used as the
exam predicting the grades of the predictor and the Y-variable as the criterion.
students after the first semester.

Construct The Components or factors of the The Pearson r can be used to correlate the items for
Validity test should contain items that are each factor. However, there is a technique called
strongly correlated. factor analysis to determine which items are highly
correlated to form a factor.

Concurrent When two br more The scores on the measures should be


Validity measures are present for correlated.
each examinee that
measure the same characteristic

Convergent When the components or factors Correlation is done for the factors of the test.
Validity of a test are hypothesized to
have a positive correlation
Divergent Validity When the components or factors Correlation is done for the factors of the test.
of a test are hypothesized to have
a negative correlation. An
example to correlate are the
scores in a test on intrinsic and
extrinsic motivation.

How to determine if an item is easy or difficult?


➢ An item is difficult if the majority of students are unable to provide the correct
answer.
➢ The item is easy if the majority of the students are able to answer correctly
➢ An item can discriminate if the examinees who score high in the test can answer the items
correctly more than examinees who got low scores.
ITEM ANALYSIS: Process of examining student’s responses to individual test items in order to assess
the quality of those items and test on a whole (Mehta,2011)
● An excellent question can separate the performing from the non-performing students

STEPS
1. Get the total score of each student and arrange the scores from highest to lowest.
2. Get the top 27% (upper group) and below 27% (lower group) of the examinees.
3. Count the number of the examinees in the higher group (pH)and lower group (pL)who got
each item correctly.
4. Compute for the Difficulty Index of each item. (It is a measure of the proportion of examinees
who answered the item correctly.)
5. Compute the Discrimination Index. ( This is a measure of how well an item is able to
distinguish between examinees who are knowledgeable and those who are not or between
masters and non-masters.
FORMULA:
➢ The difficulty index is obtained using the formula:
Item difficulty= (pH+pL)/2

➢ The index discrimination is obtained using formula:


Item discrimination= pH-pL

INTERPRETATION

DIFFICULTY INDEX REMARK

0.76 or higher Easy Item

0.25 to 0.75 Average Item

0.24 or lower Difficult Item

INDEX DISCRIMINATION REMARK

0.40 and above Very Good Item

0.30 - 0.39 Good Item

0.20 - 0.29 Reasonably Good Item

0.10 - 0.19 Marginal Item

Below 0.10 Poor Item


EXAMPLE
This is the result of a test given to a total of 10 students. (N=10)

ITEM ANALYSIS

COMPUTATION

Item 1 Item 2 Item 3 Item 4 Item 5

=(0.67+0)/2 =(0.67+0.33)/2 =(2.00+0.67)/2 =(1.00+0.33)/2 =(1.00+0.33)/2

Index of 0.33 0.50 0.83 0.50 0.67


Difficulty

Item Difficult Average Easy Average Average


Difficulty
Item 1 Item 2 Item 3 Item 4 Item 5

=0.67-0 =0.67-0.33 =2.00-0.67 =1.00-0.33 =1.00-0.3

Discrimination 0.67 0.33 0.33 0.33 0.67


Index

Discrimination Very Good Good Item Good Item Good Item Very Good Item
Item

References:

1. Ubiña-Balagtas, M., David, A.J Golla, E, Magno, C.j & Valladolid, V. (2020)
Establishing Test Validity and Reliability. In Assessment in Learning 1 (pp.96-
118). Rex Book Store, Inc.

2. Application of IRT Using the Rasch Model in Constructing Measures: https://


www.slideshare.net/crimgn/the-application-of-irt-using-the-rasch-model-
presnetation

3. Reliability and Validity in Student Assessment: https://www.youtube.com/


watch?v=gzv8Cm1jC4M

You might also like