0% found this document useful (0 votes)
10 views

Lecture 2

This document discusses the goodness of measures in terms of reliability and validity. Reliability refers to the consistency of a measure and is assessed through test-retest reliability, alternate-form reliability, and internal consistency reliability. Validity refers to what a measure actually measures and is assessed through face validity, content validity, and construct validity. Reliability and validity are important for determining the quality of a measure.

Uploaded by

addis zewd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 2

This document discusses the goodness of measures in terms of reliability and validity. Reliability refers to the consistency of a measure and is assessed through test-retest reliability, alternate-form reliability, and internal consistency reliability. Validity refers to what a measure actually measures and is assessed through face validity, content validity, and construct validity. Reliability and validity are important for determining the quality of a measure.

Uploaded by

addis zewd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

GOODNESS OF MEASURES:

RELIABILITY AND VALIDITY


Reliability
• Definition
• The degree of stability exhibited when a
measurement is repeated under identical
conditions.
• Lack of reliability may arise from:
 divergences between observers or
instruments of measurement or
instability of the attribute being measured.
Assessment of reliability
• Reliability is assessed in 3 forms
– Test-retest reliability
– Alternate-form reliability
– Internal consistency reliability
Test-retest reliability
• Most common form in surveys
• Measured by having the same respondents
complete a survey at two different points in time
to see how stable the responses are.
• Usually quantified with a correlation coefficient (r
value)
• In general, r values are considered good if
r  0.70
• If data are recorded by an observer, you can have
the same observer make two separate
measurements.
• The comparison between the two measurements
is intra-observer reliability.
• Be careful about test-retest with items or scales
that measure variables likely to change over a
short period of time, such as happiness, anxiety,
etc.
• If you do it, make sure that you test-retest over
very short periods of time.
• Potential problem with test-retest is the practice
effect.
– Individuals become familiar with the items and
simply answer based on their memory of the
last answer.
• What effect does this have on your reliability
estimates?
• It inflates the reliability estimate.
Alternate-form reliability
• Use differently worded forms to measure the
same attribute.
• Questions or responses are reworded or their
order is changed to produce two items that are
similar but not identical.
• Be sure that the two items address the same
aspect of behavior with the same vocabulary and
the same level of difficulty.
– Items should differ in wording only.
• It is common to simply change the order of the
response alternatives.
– This forces respondents to read the response
alternatives carefully and thus reduces practice
effect.
Example: Assessment of Depression
Circle one item
Version A:
During the past 4 weeks, I have felt downhearted:
Every day 1
Some days 2
Never 3

Version B:
During the past 4 weeks, I have felt downhearted:
Never 1
Some days 2
Every day 3
You could also change the wording of the response
alternatives without changing the meaning.

Example: Assessment of urinary function


Version A:
During the past week, how often did you usually empty
your bladder?
1 to 2 times per day
3 to 4 times per day
5 to 8 times per day
12 times per day
More than 12 times per day
Version B:
During the past week, how often did you usually empty
your bladder?
Every 12 to 24 hours
Every 6 to 8 hours
Every 3 to 5 hours
Every 2 hours
More than every 2 hours
• You could also change the actual wording of the
question.
– Be careful to make sure that the two items are
equivalent.
– Items with different degrees of difficulty do not
measure the same attribute.
– What might they measure?
• Reading comprehension or cognitive
function.
Example: Assessment of loneliness
Version A:
How often in the past month have you felt alone in the world?
Every day
Some days
Occasionally
Never
Version B:
During the past 4 weeks, how often have you felt a sense of
loneliness?
All of the time
Sometimes
From time to time
Never
• Practice effects may occur even when alternate
forms are used.
• Even though you use different questions on the
parallel form, participants may respond similarly
on the second test because they are familiar with
your question format.
Internal consistency reliability
• Applied not to one item, but to groups of items that are
thought to measure different aspects of the same concept.
• Cronbach’s coefficient alpha
– Measures internal consistency reliability among a
group of items combined to form a single scale.
– It is a reflection of how well the different items
complement each other in their measurement of
different aspects of the same variable or quality.
– Interpret like a correlation coefficient (0.70 is good)
Example: Assessment of physical function
Limited a Limited a Not
lot little limited
Vigorous activities, such as running, lifting heavy 1 2 3
objects, participating in strenuous sports

Moderate activities, such as moving a table, 1 2 3


pushing a vacuum cleaner, bowling, or playing golf

Lifting or carrying groceries 1 2 3

Climbing several flights of stairs 1 2 3

Bending, kneeling, or stooping 1 2 3

Walking more than a mile 1 2 3

Walking several blocks 1 2 3

Walking one block 1 2 3

Bathing or dressing yourself 1 2 3


Calculation of Cronbach’s coefficient alpha
Example: Assessment of emotional health
During the past month: Yes No
Have you been a very nervous person? 1 0
Have you felt downhearted and blue? 1 0
Have you felt so down in the dumps that
nothing could cheer you up? 1 0
Results
Summed
Patient Item 1 Item 2 Item 3 scale score
1 0 1 1 2

2 1 1 1 3

3 0 0 0 0

4 1 1 1 3

5 1 1 0 2

Percentage
positive 3/5=.6 4/5=.8 3/5=.6
Calculations
Mean score=2

Sample variance=


CC alpha  1 

(% pos) i (%neg ) i   k 

 Var   k  1 
 
 (.6)(.4)  (.8)(.2)  (.6)(.4)   3 
 1      0.86
 1.5  2 

Conclude that this scale has good reliability


• If internal consistency is low you can add
more items or re-examine existing items for
clarity.
Validity
• Definition
• Validity is often defined as the extent to which an
instrument measures what it purports to measure.
• It is important in determining whether the
statements in the instrument are relevant to the
study.
Assessment of validity
• Validity is measured in three forms
– Face validity
– Content validity
– Construct validity
Face validity
• Face validity is related to checking whether the
instrument looks as if it measures what it is
supposed to measure.
• To establish face validity, investigators seek
experts to review the instrument for grammar,
organization, appropriateness, and confirmation
that it appears to flow logically.
Content validity
• Subjective measure of how appropriate the items
seem to a set of reviewers who have some
knowledge of the subject matter.
– Usually consists of an organized review of the
survey’s contents to ensure that it contains
everything it should and doesn’t include
anything that it shouldn’t.
• There is no universally accepted standard indicator
of content validity.
• However, calculating the content validity index
(CVI) is one of the most popular ways to evaluate
content validity.
• The CVI is based on experts’ rating of item
relevance.
• A minimum of 3 to 10 experts is recommended to
review the content validity index (CVI) of the
instrument.
• In this process, reviewers are informed about the
objective of the study and the operational
definitions of the constructs.
• They are asked to rate each item using a 4-point
scale (1 = not relevant; 2 = somewhat relevant; 3 =
quite relevant; 4 = highly relevant).
• Responses for each item are then dichotomized:
questions that received a 1 or 2 are given a zero,
and items that scored 3 or 4 are given 1 point.
• The Item-level Content Validity Index (I-CVI) is
computed by totaling all the points for each item
and then divided by the total number of experts.
• For example, if an item was marked as quite
relevant or highly relevant by 5 of the 6 experts,
the item had a total score of 5, which was divided
by the number of experts (5/6 = .83).
• Then, the Scale-level Content Validity Index (S-
CVI) is obtained by averaging all the Item-level
Content Validity Indexes of the instruments.
• Experts proposed that an Item-level Content
Validity Index (I-CVI) of 1.00 is ideal when there
are five or fewer experts, while an I-CVI of .83 or
higher is recommended when there are more than
five experts.
• However, most of the scholars argue that an I-CVI
greater than .78 would be acceptable overall.
Construct validity
• Most valuable and most difficult measure of
validity.
• Basically, it is a measure of how meaningful the
scale or instrument is when it is in practical use.
• Convergent: whether two instruments measure the
same construct.
• Divergent: whether different instruments measure
different constructs.

You might also like