Week4 2 Testing
Week4 2 Testing
05/10/24 1
Aim is...
Using the information about reliability in
evaluating, interpreting and improving
psychological tests.
Reliability information alone is NOT
enough
Relationship between reliability and
validity
05/10/24 2
Using the Reliability Coefficient
It provides a relative measure of the accuracy of
test scores.
It doesn’t provide an indication of how accurate
test scores really are, in absolute terms.
A score of 110 from an intelligence test. Is it
really higher than the average score (i.e.100)?
How much variability should we expect on the
basis of measurement error? RC doesn’t say this
in concrete terms!
So, we need to know the size of the standard
error of measurement.
05/10/24 3
Using the Reliability Coefficient vs
SEM
Reliability coefficients are most useful in
comparing the scores produced by different tests.
The standard error of measurement (SEM) is
more useful when interpreting test scores.
05/10/24 4
SEM – Standard Error of
Measurement
SEM – measure of how much the
individual’s score is likely to differ from the
individual’s true test score.
SEM – 2 factors;
The reliability of the test (rxx)
The variability of test scores (X )
05/10/24 5
SEM – Standard Error of
Measurement
A spelling test has a reliability coefficient of .84
and a standard deviation of 10, then
SEM=
05/10/24 6
SEM – Standard Error of
Measurement
For testing purposes, SEM is more useful
than reliability coefficients.
We can use SEM to create a confidence
interval around a users score.
confidence interval= SEM X 1.96 (confidence
level %95)
As reliability increases, SEM decreases
As a test becomes more reliable, we can feel
more confident that an individual’s observed
score is close to the individual’s true score.
05/10/24 7
SEM – Standard Error of
Measurement
05/10/24 9
Confidence Intervals
Example of Confidence Interval:
“Johnny’s FSIQ is 113 (between 108 and 118
with 95% confidence).”
The SEM and confidence intervals remind us
that scores are not perfect.
05/10/24 10
SEM – Standard Error of
Measurement
If a person's true score is 110 on a test with a standard
error of measurement of 3.7 and a mean of 100, we
would expect 95% of the person's test scores to fall
within
a) 102.75 - 117.25
b) 92.75 - 107.25
c) 90.75 - 120.25
d) 100-110
A mean of 100, a standard error of 3.7
3.7 x 1.96= 7.25
Range is 92.75 and 107.25
05/10/24 11
Reliability and Validity
A reliable test is NOT necessarily valid.
A test can be reliable, not yet valid.
05/10/24 12
Reliability and Validity
Measurement errors would decrease the correlation
between two tests, X and Y
(in other words, the validity of predictions.)
‘correction for attenuation’
A method of estimating the true correlation between X and Y
given the correlation between two unreliable measures of
X and Y is by using the correction for attenuation.
05/10/24 13
Reliability and Validity
If the reliability of tests are increased,
the validity of tests would also
be expected to increase.
Example (a) shows what an unreliable test would look like. Example (b) shows what a
reliable but invalid test would look like. It is similar to a rifle that has its sights mis-
aligned. The high degree of reliability is shown by the consistency of the strikes. The
lack of validity is shown by the fact that the missiles are missing their target, the
bullseye. For example, a job satisfaction test given to unskilled workers may measure
literacy skills rather than job satisfaction if the test is written in complex language. In
psychometric terms, the test is not measuring what it was intended to measure.
Example (c) is what a valid and reliable test would look like: the missiles hit the mark
and they hit it consistently.
05/10/24 15
Special Issues
Speed test vs. power tests
Speed:A test in which items are trivially easy
60 seconds for a 100-item test.
05/10/24 16
Special Issues
Speed test vs. power tests
A pure speed test should have an odd-even split-half
reliability of about 1.0
05/10/24 17
Selecting a Reliability
Coefficient
If a test is to be administered multiple times:
Test-Retest Reliability
Tests to be administered one time:
Homogeneous content – coefficient alpha
Heterogeneous content – split-half coefficient
05/10/24 18
How Reliable Should Tests Be?
05/10/24 19
Reliability of Composite and
Difference Scores
Composite scores
When scores are combined
to form a composite
For example, IQs are
typically composite scores
The reliability of composite
scores is typically better
than the individual scores
in composite
05/10/24 20
Reliability of Composite and
Difference Scores
Difference scores
Involves calculating the difference between two
scores
The reliability of difference scores is typically
lower than the individual scores
05/10/24 21