Health Measurement Scales. ISBN 9780199685219, 978-0199685219
Health Measurement Scales. ISBN 9780199685219, 978-0199685219
Visit the link below to download the full version of this book:
https://cheaptodownload.com/product/health-measurement-scales-5th-edition-full-p
df-download/
Health
Measurement
Scales
A practical guide
to their development
and use
FIFTH EDITION
David L. Streiner
Department of Psychiatry, University of Toronto, Toronto,
Ontario, Canada
Department of Psychiatry and Behavioural Neurosciences
Department of Clinical Epidemiology and Biostatistics
McMaster University, Hamilton, Ontario, Canada
Geoffrey R. Norman
Department of Clinical Epidemiology and Biostatistics,
McMaster University, Hamilton, Ontario, Canada
and
John Cairney
Department of Family Medicine, McMaster University
Hamilton, Ontario, Canada
3
3
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Oxford University Press 2015
The moral rights of the authors have been asserted
First edition published in 1989
Second edition published in 1995
Third edition published in 2003
Reprinted 2004
Fourth edition published in 2008
Fifth edition published in 2015
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2014939328
ISBN 978–0–19–968521–9
Printed in Great Britain by
Clays Ltd, St Ives plc
Oxford University Press makes no representation, express or implied, that the
drug dosages in this book are correct. Readers must therefore always check
the product information and clinical procedures with the most up-to-date
published product information and data sheets provided by the manufacturers
and the most recent codes of conduct and safety regulations. The authors and
the publishers do not accept responsibility or legal liability for any errors in the
text or for the misuse or misapplication of material in this work. Except where
otherwise stated, drug dosages and recommendations are for the non-pregnant
adult who is not breast-feeding
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
Dedication
Now that this edition of the book is done,
we will finally emerge from our respective writing places
and rejoin our families. As partial repayment for
the suffering that this will cause,
we dedicate this book to our wives:
The most noticeable change in the fifth edition is on the cover—the addition of a third
author, Dr John Cairney. The need for one was obvious to us: Streiner has retired three
times (so far), and Norman is getting to the stage where he is contemplating his first.
This book has been successful beyond what we could ever have imagined, and we want
to see it continue and keep up with developments in the field, but, to paraphrase the
film mogul, Samuel Goldwyn, ‘Include us out’. The solution was to involve a younger
colleague, Cairney. A word about why him.
When the Faculty of Health Sciences at McMaster University began 45 years ago, it
pioneered a new form of teaching, called problem-based learning, with an emphasis on
small-group tutorials. These sessions were led by ‘non-expert tutors’, whose role was
to guide the students’ learning, rather than formal teaching. Consequently, knowledge
of the content area wasn’t required, and indeed, Streiner, a clinical psychologist by
training, tutored sessions in cardiology (after spending a weekend learning where the
heart was located). Although both of us were in at the beginning and helped shape the
learning environment (and, in fact, Norman was, until recently, the Assistant Dean
of Educational Research and Development), we never subscribed to the notion of the
non-expert tutor in the courses we taught; we always believed you needed in-depth
content knowledge in order to lead a class of learners. Eventually, the medical pro-
gramme also learned this profound lesson (thereby ensuring that none of us had to
waste our time and theirs tutoring medical students).
We then moved into more profitable educational ventures, related to teaching what
we knew, like measurement. Over the many years we taught measurement theory and
scale development, we had very few co-tutors. One of the rare exceptions was John,
who grew from filling in for us with the odd lecture (and some were very odd) to now
running the course. So we leave the book in good hands.
But, what has changed in the book? In addition to updating all of the sections to
incorporate new research, we have made some major additions and revisions. Another
very noticeable addition is a flow chart. Based on suggestions we received from read-
ers, this is a map of the road to scale construction, as well as a guide of where to find
the relevant information. The chapter on generalizability theory has been pretty well
rewritten to reconcile the approach we developed with other approaches, as well as
to extend the method to more complicated designs involving stratification and unbal-
anced designs. We have also eliminated some of the more obtuse sections to make it
more accessible to readers. The introduction to the chapter on item response theory
has been extensively revised to hopefully better describe the rationale and technique
of calibrating the items. The chapter on ethics has been expanded to deal with more
problem areas that may arise when researchers are developing scales and establish-
ing their psychometric properties. The ‘Methods of administration’ chapter has been
viii PREFACE TO THE FIFTH EDITION
expanded to deal with technologies that were only just appearing on the scene when
the last revision was written: Web-based administration, the use of smart phones, and
how mobile phones are changing what can and cannot be done in surveys. There
is now a section in the ‘Measuring change’ chapter explaining how to determine if
individual patients have improved by a clinically and statistically significant degree; a
technique called the ‘reliable change index’.
This and previous editions have benefitted greatly from the comments we have
received from readers, and we continue to welcome them. Please write to us at:
[email protected]
[email protected]
[email protected].
D.L.S.
G.R.N.
J.C.
Contents
Appendices
Appendix A Where to find tests 357
Appendix B A (very) brief introduction to factor analysis 375
Author Index 381
Subject Index 391
Chapter 1
Introduction to health
measurement scales
Introduction to measurement
The act of measurement is an essential component of scientific research, whether in
the natural, social, or health sciences. Until recently, however, discussions regard-
ing issues of measurement were noticeably absent in the deliberations of clinical
researchers. Certainly, measurement played as essential a role in research in the
health sciences as those in other scientific disciplines. However, measurement in the
laboratory disciplines presented no inherent difficulty. Like other natural sciences,
measurement was a fundamental part of the discipline and was approached through
the development of appropriate instrumentation. Subjective judgement played a
minor role in the measurement process; therefore, any issue of reproducibility or val-
idity was amenable to a technological solution. It should be mentioned, however, that
expensive equipment does not, of itself, eliminate measurement errors.
Conversely, clinical researchers were acutely aware of the fallibility of human judge-
ment as evidenced by the errors involved in processes such as radiological diagnosis
(Garland 1959; Yerushalmy 1955). Fortunately, the research problems approached
by many clinical researchers—cardiologists, epidemiologists, and the like—frequently
did not depend on subjective assessment. Trials of therapeutic regimens focused on
the prolongation of life and the prevention or management of such life-threatening
conditions as heart disease, stroke, or cancer. In these circumstances, the measure-
ment is reasonably straightforward. ‘Objective’ criteria, based on laboratory or tissue
diagnosis where possible, can be used to decide whether a patient has the disease, and
warrants inclusion in the study. The investigator then waits an appropriate period of
time and counts those who did or did not survive—and the criteria for death are rea-
sonably well established, even though the exact cause of death may be a little more
difficult.
In the past few decades, the situation in clinical research has become more com-
plex. The effects of new drugs or surgical procedures on quantity of life are likely to be
marginal indeed. Conversely, there is an increased awareness of the impact of health
and healthcare on the quality of human life. Therapeutic efforts in many disciplines
of medicine—psychiatry, respirology, rheumatology, oncology—and other health
professions—nursing, physiotherapy, occupational therapy—are directed equally if
not primarily to the improvement of quality, not quantity of life. If the efforts of
these disciplines are to be placed on a sound scientific basis, methods must be devised
to measure what was previously thought to be unmeasurable and to assess in a
2 INTRODUCTION TO HEALTH MEASUREMENT SCALES
reproducible and valid fashion those subjective states, which cannot be converted into
the position of a needle on a dial.
The need for reliable and valid measures was clearly demonstrated by Marshall et al.
(2000). After examining 300 randomized controlled trials in schizophrenia, they found
that the studies were nearly 40 per cent more likely to report that treatment was effect-
ive when they used unpublished scales rather than ones with peer-reviewed evidence
of validity; and in non-drug studies, one-third of the claims of treatment superiority
would not have been made if the studies had used published scales.
The challenge is not as formidable as it may seem. Psychologists and educators have
been grappling with the issue for many years, dating back to the European attempts
at the turn of the twentieth century to assess individual differences in intelligence
(Galton, cited in Allen and Yen 1979; Stern 1979). Since that time, particularly since
the 1930s, much has been accomplished so that a sound methodology for the devel-
opment and application of tools to assess subjective states now exists. Unfortunately,
much of this literature is virtually unknown to most clinical researchers. Health sci-
ence libraries do not routinely catalogue Psychometrica or The British Journal of
Statistical Psychology. Nor should they—the language would be incomprehensible to
most readers, and the problems of seemingly little relevance.
Similarly, the textbooks on the subject are directed at educational or psychological
audiences. The former is concerned with measures of achievement applicable to
classroom situations, and the latter is focused primarily on personality or aptitude
measures, again with no apparent direct relevance. In general, textbooks in these dis-
ciplines are directed to the development of achievement, intelligence, or personality
tests.
By contrast, researchers in health sciences are frequently faced with the desire to
measure something that has not been approached previously—arthritic pain, return
to function of post-myocardial infarction patients, speech difficulties of aphasic stroke
patients, or clinical competence of junior medical students. The difficulties and ques-
tions that arose in developing such instruments range from straightforward (e.g. How
many boxes do I put on the response?) to complex (e.g. How do I establish whether the
instrument is measuring what I intend to measure?). Nevertheless, to a large degree,
the answers are known, although they are frequently difficult to access.
The intent of this book is to introduce researchers in health sciences to these con-
cepts of measurement. It is not an introductory textbook, in that we do not confine
ourselves to a discussion of introductory principles and methods; rather, we attempt
to make the book as current and comprehensive as possible. The book does not
delve as heavily into mathematics as many books in the field; such side trips may
provide some intellectual rewards for those who are inclined, but frequently at the
expense of losing the majority of readers. Similarly, we emphasize applications, rather
than theory so that some theoretical subjects (such as Thurstone’s law of compara-
tive judgement), which are of historical interest but little practical importance, are
omitted. Nevertheless, we spend considerable time in the explanation of the concepts
underlying the current approaches to measurement. One other departure from cur-
rent books is that our focus is on those attributes of interest to researchers in health
sciences—subjective states, attitudes, response to illness, etc.—rather than the topics
A ROADMAP TO THE BOOK 3
Can an existing
No Yes
instrument be
used/modified?
Appendix A
Generate items
Chapters 3 & 4
Has it been
Yes assessed
Use it with
comparable
groups?
Test items
Chapters 4–6
No
Revise items
Pilot study
items with
Do items sample from
require target
Yes revision? population
No
Reliability and
generalizability Do the results
studies No suggest the
Chapters 8 & 9 instrument is
appropriate?
will be sampling in our study. Appendix A lists many resources for finding existing
scales. For example, there are many scales that measure depression. The selection of
a particular scale will be in part based upon whether there is evidence concerning
its measurement properties specifically in the context of our target population (e.g. a
depression measure for children under the age of 12). Review of the literature is how
we determine this. It is not uncommon at this stage to find that we have ‘near’ hits
(e.g. a scale to measure depression in adolescents, but one that has not been used to
measure depression in children). Here, we must exercise judgement. If we believe that
only minor modifications are required, then we can usually proceed without entering
into a process of item testing. An example here comes from studies of pain. Pain scales
can often be readily adapted for specific pain sites of clinical importance (hands, feet,
A ROADMAP TO THE BOOK 5
knees, etc.). Absence of a specific scale measuring thumb pain on the right hand for
adults aged 60–65 years, however, is not justification for the creation of a new scale.
There are instances though, where we cannot assume that a scale, which has been
tested in one population, can necessarily be used in a different population without first
testing this assumption. Here we may enter the process of item generation and testing
if our pilot results reveal that important domains have been missed, or if items need to
be substantially modified to be of use with our study population (on our diagram, this
is the link between the right and left sides, or the pathway connecting ‘No’ regarding
group to the item testing loop).
Once we have created a new scale (or revised an existing one for our population),
we are ready to test reliability (Chapters 8 and 9) and validity (Chapter 10). Of course,
if the purpose of the study is not the design and evaluation of a scale, researchers
typically proceed directly to designing a study to answer a research question or test a
hypothesis of association. Following this path, reporting of psychometric properties of
the new or revised scale will likely be limited to internal consistency only, which is the
easiest, but extremely limited, psychometric property to assess (see the section entitled
‘Kuder–Richardson 20 and coefficient α’ in Chapter 5). However, study findings, while
not explicitly stated as such, can nevertheless provide important information on val-
idity. One researcher’s cross-sectional study is another’s construct validation study: it
all depends on how you frame the question. From our earlier example, if the results
of a study using our new scale to measure depression in children shows that girls have
higher levels of depression than boys, or that children who have been bullied are more
likely to be depressed than children who have not, this is evidence of construct valid-
ity (both findings show the scale is producing results in an expected way, given what
we know about depression in children). If the study in question is a measurement
study, then we would be more intentional in our design to measure different aspects
of both reliability and validity. The complexity of the design will depend on the scale
(and the research question). If, for example, we are using a scale that requires multiple
raters to assess a single object of measurement (e.g. raters evaluating play behaviour in
children, tutors rating students), and we are interested in assessing whether these rat-
ings are stable over time, we can devise quite complex factorial designs, as we discuss
in Chapter 9, and use generalizability (G) theory to simultaneously assess different
aspects of reliability. As another example, we may want to look at various aspects
of reliability—inter-rater, test–retest, internal consistency; it then makes sense to do
it with a single G study so we can look at the error resulting from the various fac-
tors. In some cases, a cross-sectional survey, where we compare our new (or revised)
measure to a standard measure by correlating it with other measures hypothesized to
be associated with the construct we are trying to measure, may be all that is needed.
Of course, much has been written on the topic of research design already (e.g. Streiner
and Norman, 2009) and an extended discussion is beyond the scope of this chap-
ter. We do, however, discuss different ways to administer scales (Chapter 13). Choice
of administration method will be determined both by the nature of the scale and
by what aspects of reliability and validity we are interested in assessing. But before
any study is begun, be aware of ethical issues that may arise; these are discussed in
Chapter 14.
6 INTRODUCTION TO HEALTH MEASUREMENT SCALES
Of course, as with all research, our roadmap is iterative. Most of the scales that
have stood the test of time (much like this book) have been revised, re-tested, and
tested again. In pursuit of our evaluation of reliability and validity, we often find that
our scale needs to be tweaked. Sometimes, the required changes are more dramatic.
As our understanding of the construct we are measuring evolves, we often need to
revise scales accordingly. At any rate, it is important to dispel the notion that once a
scale is created, it is good for all eternity. As with the research process in general, the
act of research often generates more questions than answers (thereby also ensuring
our continued employment). When you are ready to write up what you have found,
check Chapter 15 regarding reporting guidelines for studies of reliability and validity.
Further reading
Colton, T.D. (1974). Statistics in medicine. Little Brown, Boston, MA.
Freund, J.E., and Perles, B.M. (2005). Modern elementary statistics (12th edn). Pearson, Upper
Saddle River, NJ.
Huff, D. (1954). How to lie with statistics. W.W. Norton, New York.
Norman, G.R. and Streiner, D.L. (2003). PDQ statistics (3rd edn). PMPH USA, Shelton, CT.
Norman, G.R. and Streiner, D.L. (2014). Biostatistics: The bare essentials (4th edn). PMPH USA,
Shelton, CT.
References
Allen, M.J. and Yen, W.M. (1979). Introduction to measurement theory. Brooks Cole, Monterey,
CA.
Garland, L.H. (1959). Studies on the accuracy of diagnostic procedures. American Journal of
Roentgenology, 82, 25–38.
Marshall, M., Lockwood, A., Bradley, C., Adams, C., Joy, C., and Fenton, M. (2000).
Unpublished rating scales: A major source of bias in randomised controlled trials of
treatments for schizophrenia. British Journal of Psychiatry, 176, 249–52.
Streiner, D.L. and Norman, G.R. (2009). PDQ epidemiology (3rd edn). PMPH USA, Shelton,
CT.
Yerushalmy, J. (1955). Reliability of chest radiography in diagnosis of pulmonary lesions.
American Journal of Surgery, 89, 231–40.
Chapter 2
Basic concepts
Critical review
Having located one or more scales of potential interest, it remains to choose whether
to use one of these existing scales or to proceed to development of a new instru-
ment. In part, this decision can be guided by a judgement of the appropriateness of
the items on the scale, but this should always be supplemented by a critical review of
the evidence in support of the instrument. The particular dimensions of this review
are described in the following sections.
the scale was too long, or the responses were not in a preferred format. As we have
indicated, this judgement should comprise only one of several used in arriving at an
overall judgement of usefulness, and should be balanced against the time and cost of
developing a replacement.
Reliability
The concept of reliability is, on the surface, deceptively simple. Before one can obtain
evidence that an instrument is measuring what is intended, it is first necessary to
gather evidence that the scale is measuring something in a reproducible fashion. That
is, a first step in providing evidence of the value of an instrument is to demonstrate
that measurements of individuals on different occasions, or by different observers, or
by similar or parallel tests, produce the same or similar results.
That is the basic idea behind the concept—an index of the extent to which meas-
urements of individuals obtained under different circumstances yield similar results.
However, the concept is refined a bit further in measurement theory. If we were con-
sidering the reliability of, for example, a set of bathroom scales, it might be sufficient
to indicate that the scales are accurate to ±1 kg. From this information, we can easily
judge whether the scales will be adequate to distinguish among adult males (prob-
ably yes) or to assess weight gain of premature infants (probably no), since we have
prior knowledge of the average weight and variation in weight of adults and premature
infants.
Such information is rarely available in the development of subjective scales. Each
scale produces a different measurement from every other. Therefore, to indicate that a
particular scale is accurate to ±3.4 units provides no indication of its value in measur-
ing individuals unless we have some idea about the likely range of scores on the scale.
To circumvent this problem, reliability is usually quoted as a ratio of the variability
between individuals to the total variability in the scores; in other words, the reliability
is a measure of the proportion of the variability in scores, which was due to true dif-
ferences between individuals. Thus, the reliability is expressed as a number between
0 and 1, with 0 indicating no reliability, and 1 indicating perfect reliability.
An important issue in examining the reliability of an instrument is the manner in
which the data were obtained that provided the basis for the calculation of a reliabil-
ity coefficient. First of all, since the reliability involves the ratio of variability between
subjects to total variability, one way to ensure that a test will look good is to conduct
the study on an extremely heterogeneous sample, for example, to measure knowledge
of clinical medicine using samples of first-year, third-year, and fifth-year students.
Examine the sampling procedures carefully, and assure yourself that the sample used
in the reliability study is approximately the same as the sample you wish to study.
Second, there are any number of ways in which reliability measures can be obtained,
and the magnitude of the reliability coefficient will be a direct reflection of the
particular approach used. Some broad definitions are described as follows:
1. Internal consistency. Measures of internal consistency are based on a single admin-
istration of the measure. If the measure has a relatively large number of items
addressing the same underlying dimension, e.g. ‘Are you able to dress yourself?’,