100% found this document useful (30 votes)
421 views

Health Measurement Scales. ISBN 9780199685219, 978-0199685219

ISBN-10: 9780199685219. ISBN-13: 978-0199685219. Health Measurement Scales Full PDF DOCX Download

Uploaded by

margieebartal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (30 votes)
421 views

Health Measurement Scales. ISBN 9780199685219, 978-0199685219

ISBN-10: 9780199685219. ISBN-13: 978-0199685219. Health Measurement Scales Full PDF DOCX Download

Uploaded by

margieebartal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Health Measurement Scales

Visit the link below to download the full version of this book:
https://cheaptodownload.com/product/health-measurement-scales-5th-edition-full-p
df-download/
Health
Measurement
Scales
A practical guide
to their development
and use
FIFTH EDITION

David L. Streiner
Department of Psychiatry, University of Toronto, Toronto,
Ontario, Canada
Department of Psychiatry and Behavioural Neurosciences
Department of Clinical Epidemiology and Biostatistics
McMaster University, Hamilton, Ontario, Canada

Geoffrey R. Norman
Department of Clinical Epidemiology and Biostatistics,
McMaster University, Hamilton, Ontario, Canada
and

John Cairney
Department of Family Medicine, McMaster University
Hamilton, Ontario, Canada

3
3
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Oxford University Press 2015
The moral rights of the authors have been asserted
First edition published in 1989
Second edition published in 1995
Third edition published in 2003
Reprinted 2004
Fourth edition published in 2008
Fifth edition published in 2015
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2014939328
ISBN 978–0–19–968521–9
Printed in Great Britain by
Clays Ltd, St Ives plc
Oxford University Press makes no representation, express or implied, that the
drug dosages in this book are correct. Readers must therefore always check
the product information and clinical procedures with the most up-to-date
published product information and data sheets provided by the manufacturers
and the most recent codes of conduct and safety regulations. The authors and
the publishers do not accept responsibility or legal liability for any errors in the
text or for the misuse or misapplication of material in this work. Except where
otherwise stated, drug dosages and recommendations are for the non-pregnant
adult who is not breast-feeding
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
Dedication
Now that this edition of the book is done,
we will finally emerge from our respective writing places
and rejoin our families. As partial repayment for
the suffering that this will cause,
we dedicate this book to our wives:

Betty, Pam, and Monique


Preface to the fifth edition

The most noticeable change in the fifth edition is on the cover—the addition of a third
author, Dr John Cairney. The need for one was obvious to us: Streiner has retired three
times (so far), and Norman is getting to the stage where he is contemplating his first.
This book has been successful beyond what we could ever have imagined, and we want
to see it continue and keep up with developments in the field, but, to paraphrase the
film mogul, Samuel Goldwyn, ‘Include us out’. The solution was to involve a younger
colleague, Cairney. A word about why him.
When the Faculty of Health Sciences at McMaster University began 45 years ago, it
pioneered a new form of teaching, called problem-based learning, with an emphasis on
small-group tutorials. These sessions were led by ‘non-expert tutors’, whose role was
to guide the students’ learning, rather than formal teaching. Consequently, knowledge
of the content area wasn’t required, and indeed, Streiner, a clinical psychologist by
training, tutored sessions in cardiology (after spending a weekend learning where the
heart was located). Although both of us were in at the beginning and helped shape the
learning environment (and, in fact, Norman was, until recently, the Assistant Dean
of Educational Research and Development), we never subscribed to the notion of the
non-expert tutor in the courses we taught; we always believed you needed in-depth
content knowledge in order to lead a class of learners. Eventually, the medical pro-
gramme also learned this profound lesson (thereby ensuring that none of us had to
waste our time and theirs tutoring medical students).
We then moved into more profitable educational ventures, related to teaching what
we knew, like measurement. Over the many years we taught measurement theory and
scale development, we had very few co-tutors. One of the rare exceptions was John,
who grew from filling in for us with the odd lecture (and some were very odd) to now
running the course. So we leave the book in good hands.
But, what has changed in the book? In addition to updating all of the sections to
incorporate new research, we have made some major additions and revisions. Another
very noticeable addition is a flow chart. Based on suggestions we received from read-
ers, this is a map of the road to scale construction, as well as a guide of where to find
the relevant information. The chapter on generalizability theory has been pretty well
rewritten to reconcile the approach we developed with other approaches, as well as
to extend the method to more complicated designs involving stratification and unbal-
anced designs. We have also eliminated some of the more obtuse sections to make it
more accessible to readers. The introduction to the chapter on item response theory
has been extensively revised to hopefully better describe the rationale and technique
of calibrating the items. The chapter on ethics has been expanded to deal with more
problem areas that may arise when researchers are developing scales and establish-
ing their psychometric properties. The ‘Methods of administration’ chapter has been
viii PREFACE TO THE FIFTH EDITION

expanded to deal with technologies that were only just appearing on the scene when
the last revision was written: Web-based administration, the use of smart phones, and
how mobile phones are changing what can and cannot be done in surveys. There
is now a section in the ‘Measuring change’ chapter explaining how to determine if
individual patients have improved by a clinically and statistically significant degree; a
technique called the ‘reliable change index’.
This and previous editions have benefitted greatly from the comments we have
received from readers, and we continue to welcome them. Please write to us at:
[email protected]
[email protected]
[email protected].
D.L.S.
G.R.N.
J.C.
Contents

1 Introduction to health measurement scales 1


Introduction to measurement 1
A roadmap to the book 3
2 Basic concepts 7
Introduction to basic concepts 7
Searching the literature 7
Critical review 8
Empirical forms of validity 10
The two traditions of assessment 14
Summary 17
3 Devising the items 19
Introduction to devising items 19
The source of items 20
Content validity 25
Generic versus specific scales and the ‘fidelity versus bandwidth’ issue 28
Translation 30
4 Scaling responses 38
Introduction to scaling responses 38
Some basic concepts 38
Categorical judgements 39
Continuous judgements 41
To rate or to rank 64
Multidimensional scaling 65
5 Selecting the items 74
Introduction to selecting items 74
Interpretability 74
Face validity 79
Frequency of endorsement and discrimination 80
Homogeneity of the items 81
Multifactor inventories 91
When homogeneity does not matter 92
Putting it all together 94
x CONTENTS

6 Biases in responding 100


Introduction to biases in responding 100
The differing perspectives 100
Answering questions: the cognitive requirements 101
Optimizing and satisficing 104
Social desirability and faking good 106
Deviation and faking bad 111
Minimizing biased responding 111
Yea-saying or acquiescence 115
End-aversion, positive skew, and halo 115
Framing 118
Biases related to the measurement of change 119
Reconciling the two positions 121
Proxy reporting 121
Testing the items 122
7 From items to scales 131
Introduction to from items to scales 131
Weighting the items 131
Missing items 134
Multiplicative composite scores 135
Transforming the final score 138
Age and sex norms 143
Establishing cut points 145
Receiver operating characteristic curves 149
Summary 156
8 Reliability 159
Introduction to reliability 159
Basic concepts 159
Philosophical implications 161
Terminology 164
Defining reliability 164
Other considerations in calculating the reliability
of a test: measuring consistency or absolute agreement 167
The observer nested within subject 169
Multiple observations 170
Other types of reliability 171
Different forms of the reliability coefficient 172
Kappa coefficient versus the ICC 178
CONTENTS xi

The method of Bland and Altman 179


Issues of interpretation 180
Improving reliability 185
Standard error of the reliability coefficient and sample size 187
Reliability generalization 192
Summary 196
9 Generalizability theory 200
Introduction to generalizability theory 200
Generalizability theory fundamentals 202
An example 204
The first step—the ANOVA 204
Step 2—from ANOVA to G coefficients 207
Step 3—from G study to D study 212
ANOVA for statisticians and ANOVA for psychometricians 212
Confidence intervals for G coefficients 214
Getting the computer to do it for you 214
Some common examples 215
Uses and abuses of G theory 224
Summary 225
10 Validity 227
Introduction to validity 227
Why assess validity? 227
Reliability and validity 228
A history of the ‘types’ of validity 229
Content validation 232
Criterion validation 233
Construct validation 235
Responsiveness and sensitivity to change 244
Validity and ‘types of indices’ 244
Biases in validity assessment 245
Validity generalization 250
Summary 250
11 Measuring change 254
Introduction to measuring change 254
The goal of measurement of change 254
Why not measure change directly? 255
Measures of association—reliability and sensitivity to change 256
Difficulties with change scores in experimental designs 261
xii CONTENTS

Change scores and quasi-experimental designs 262


Measuring change using multiple observations: growth curves 264
How much change is enough? 268
Summary 269
12 Item response theory 273
Introduction to item response theory 273
Problems with classical test theory 273
The introduction of item response theory 275
A note about terminology 275
Item calibration 276
The one-parameter model 280
The two- and three-parameter models 282
Polytomous models 284
Item information 286
Item fit 287
Person fit 289
Differential item functioning 289
Unidimensionality and local independence 290
Test information and the standard error of measurement 294
Equating tests 295
Sample size 296
Mokken scaling 296
Advantages 297
Disadvantages 299
Computer programs 300
13 Methods of administration 304
Introduction to methods of administration 304
Face-to-face interviews 304
Telephone questionnaires 307
Mailed questionnaires 312
The necessity of persistence 317
Computerized administration 319
Using e-mail and the Web 322
Personal data assistants and smart phones 328
From administration to content: the impact
of technology on scale construction 329
Reporting response rates 331
14 Ethical considerations 340
Introduction to ethical considerations 340
CONTENTS xiii

Informed consent 341


Freedom of consent 344
Confidentiality 345
Consequential validation 346
Summary 347
15 Reporting test results 349
Introduction to reporting test results 349
Standards for educational and psychological testing 350
The STARD initiative 352
GRRAS 354
Summary 354

Appendices
Appendix A Where to find tests 357
Appendix B A (very) brief introduction to factor analysis 375
Author Index 381
Subject Index 391
Chapter 1

Introduction to health
measurement scales

Introduction to measurement
The act of measurement is an essential component of scientific research, whether in
the natural, social, or health sciences. Until recently, however, discussions regard-
ing issues of measurement were noticeably absent in the deliberations of clinical
researchers. Certainly, measurement played as essential a role in research in the
health sciences as those in other scientific disciplines. However, measurement in the
laboratory disciplines presented no inherent difficulty. Like other natural sciences,
measurement was a fundamental part of the discipline and was approached through
the development of appropriate instrumentation. Subjective judgement played a
minor role in the measurement process; therefore, any issue of reproducibility or val-
idity was amenable to a technological solution. It should be mentioned, however, that
expensive equipment does not, of itself, eliminate measurement errors.
Conversely, clinical researchers were acutely aware of the fallibility of human judge-
ment as evidenced by the errors involved in processes such as radiological diagnosis
(Garland 1959; Yerushalmy 1955). Fortunately, the research problems approached
by many clinical researchers—cardiologists, epidemiologists, and the like—frequently
did not depend on subjective assessment. Trials of therapeutic regimens focused on
the prolongation of life and the prevention or management of such life-threatening
conditions as heart disease, stroke, or cancer. In these circumstances, the measure-
ment is reasonably straightforward. ‘Objective’ criteria, based on laboratory or tissue
diagnosis where possible, can be used to decide whether a patient has the disease, and
warrants inclusion in the study. The investigator then waits an appropriate period of
time and counts those who did or did not survive—and the criteria for death are rea-
sonably well established, even though the exact cause of death may be a little more
difficult.
In the past few decades, the situation in clinical research has become more com-
plex. The effects of new drugs or surgical procedures on quantity of life are likely to be
marginal indeed. Conversely, there is an increased awareness of the impact of health
and healthcare on the quality of human life. Therapeutic efforts in many disciplines
of medicine—psychiatry, respirology, rheumatology, oncology—and other health
professions—nursing, physiotherapy, occupational therapy—are directed equally if
not primarily to the improvement of quality, not quantity of life. If the efforts of
these disciplines are to be placed on a sound scientific basis, methods must be devised
to measure what was previously thought to be unmeasurable and to assess in a
2 INTRODUCTION TO HEALTH MEASUREMENT SCALES

reproducible and valid fashion those subjective states, which cannot be converted into
the position of a needle on a dial.
The need for reliable and valid measures was clearly demonstrated by Marshall et al.
(2000). After examining 300 randomized controlled trials in schizophrenia, they found
that the studies were nearly 40 per cent more likely to report that treatment was effect-
ive when they used unpublished scales rather than ones with peer-reviewed evidence
of validity; and in non-drug studies, one-third of the claims of treatment superiority
would not have been made if the studies had used published scales.
The challenge is not as formidable as it may seem. Psychologists and educators have
been grappling with the issue for many years, dating back to the European attempts
at the turn of the twentieth century to assess individual differences in intelligence
(Galton, cited in Allen and Yen 1979; Stern 1979). Since that time, particularly since
the 1930s, much has been accomplished so that a sound methodology for the devel-
opment and application of tools to assess subjective states now exists. Unfortunately,
much of this literature is virtually unknown to most clinical researchers. Health sci-
ence libraries do not routinely catalogue Psychometrica or The British Journal of
Statistical Psychology. Nor should they—the language would be incomprehensible to
most readers, and the problems of seemingly little relevance.
Similarly, the textbooks on the subject are directed at educational or psychological
audiences. The former is concerned with measures of achievement applicable to
classroom situations, and the latter is focused primarily on personality or aptitude
measures, again with no apparent direct relevance. In general, textbooks in these dis-
ciplines are directed to the development of achievement, intelligence, or personality
tests.
By contrast, researchers in health sciences are frequently faced with the desire to
measure something that has not been approached previously—arthritic pain, return
to function of post-myocardial infarction patients, speech difficulties of aphasic stroke
patients, or clinical competence of junior medical students. The difficulties and ques-
tions that arose in developing such instruments range from straightforward (e.g. How
many boxes do I put on the response?) to complex (e.g. How do I establish whether the
instrument is measuring what I intend to measure?). Nevertheless, to a large degree,
the answers are known, although they are frequently difficult to access.
The intent of this book is to introduce researchers in health sciences to these con-
cepts of measurement. It is not an introductory textbook, in that we do not confine
ourselves to a discussion of introductory principles and methods; rather, we attempt
to make the book as current and comprehensive as possible. The book does not
delve as heavily into mathematics as many books in the field; such side trips may
provide some intellectual rewards for those who are inclined, but frequently at the
expense of losing the majority of readers. Similarly, we emphasize applications, rather
than theory so that some theoretical subjects (such as Thurstone’s law of compara-
tive judgement), which are of historical interest but little practical importance, are
omitted. Nevertheless, we spend considerable time in the explanation of the concepts
underlying the current approaches to measurement. One other departure from cur-
rent books is that our focus is on those attributes of interest to researchers in health
sciences—subjective states, attitudes, response to illness, etc.—rather than the topics
A ROADMAP TO THE BOOK 3

such as personality or achievement familiar to readers in education and psychology.


As a result, our examples are drawn from the literature in health sciences.
Finally, some understanding of certain selected topics in statistics is necessary to
learn many essential concepts of measurement. In particular, the correlation coefficient
is used in many empirical studies of measurement instruments. The discussion of reli-
ability is based on the methods of repeated measures analysis of variance. Item analysis
and certain approaches to test validity use the methods of factor analysis. It is not by
any means necessary to have detailed knowledge of these methods to understand the
concepts of measurement discussed in this book. Still, it would be useful to have some
conceptual understanding of these techniques. If the reader requires some review of
statistical topics, we have suggested a few appropriate resources in Appendix A.

A roadmap to the book


In this edition, we thought it might be useful to try to provide a roadmap or guide to
summarize the process of scale construction and evaluation (testing). While individ-
ual aspects of this are detailed in the chapters, the overall process can easily be lost in
the detail. Of course in doing so, we run the risk that some readers will jump to the
relevant chapters, and conclude there is not much to this measurement business. That
would be a gross misconception. The roadmap is an oversimplification of a complex
process. However, we do feel it is a valuable heuristic device, which while necessarily
lacking in detail, may nevertheless help to shed some light on the process of meas-
urement. A process, which to those who are new to the field, can often seem like the
darkest of black boxes (see Fig. 1.1).
Our road map begins the same way all research begins—with a question. It may
be derived from the literature or, as is sometimes the case in health research, from
clinical observation. Our question in this case, however, is explicitly connected to a
measurement problem: do we have a measure or scale that can be used in pursuit
of answering our research question? As we will discuss in Chapter 2, there are two
important considerations here: are there really no existing measures that we can use?
This of course can arise when the research question involves measuring a novel con-
cept (or an old one) for which a measure does not currently exist. Our position, always,
is not to bring a new scale into the world unless it is absolutely necessary. However, in
situations where one is truly faced with no other option but to create a new scale,
we begin the process of construction and testing—item generation, item testing, and
re-testing in Fig. 1.1. This is reviewed in detail in Chapters 3 to 6. Chapter 7 covers
the transformation of items to scale. While not explicitly included in this diagram, we
assume that once the items are devised and tested, then a scale will be created by com-
bining items and further tested. It is here we concern ourselves with dimensionality
of the scale (whether our new scale is measuring one thing, or many). We discuss one
common test of dimensionality, factor analysis, in Appendix B.
A far more common occurrence, however, is that we have a scale (or many scales)
that can potentially be used for our research project. Following the top right side of
the roadmap in Fig. 1.1, the important question now is whether we have a scale that
has already been used in similar populations (and for similar purposes) to the one we
4 INTRODUCTION TO HEALTH MEASUREMENT SCALES

Gap identified from


research and/or clinical
practice
Chapter2

Can an existing
No Yes
instrument be
used/modified?
Appendix A

Generate items
Chapters 3 & 4
Has it been
Yes assessed
Use it with
comparable
groups?
Test items
Chapters 4–6

No

Revise items
Pilot study
items with
Do items sample from
require target
Yes revision? population

No

Reliability and
generalizability Do the results
studies No suggest the
Chapters 8 & 9 instrument is
appropriate?

Write it up Validity studies


Chapter 15 Chapter 10
Yes

Fig. 1.1 HMS flowchart.

will be sampling in our study. Appendix A lists many resources for finding existing
scales. For example, there are many scales that measure depression. The selection of
a particular scale will be in part based upon whether there is evidence concerning
its measurement properties specifically in the context of our target population (e.g. a
depression measure for children under the age of 12). Review of the literature is how
we determine this. It is not uncommon at this stage to find that we have ‘near’ hits
(e.g. a scale to measure depression in adolescents, but one that has not been used to
measure depression in children). Here, we must exercise judgement. If we believe that
only minor modifications are required, then we can usually proceed without entering
into a process of item testing. An example here comes from studies of pain. Pain scales
can often be readily adapted for specific pain sites of clinical importance (hands, feet,
A ROADMAP TO THE BOOK 5

knees, etc.). Absence of a specific scale measuring thumb pain on the right hand for
adults aged 60–65 years, however, is not justification for the creation of a new scale.
There are instances though, where we cannot assume that a scale, which has been
tested in one population, can necessarily be used in a different population without first
testing this assumption. Here we may enter the process of item generation and testing
if our pilot results reveal that important domains have been missed, or if items need to
be substantially modified to be of use with our study population (on our diagram, this
is the link between the right and left sides, or the pathway connecting ‘No’ regarding
group to the item testing loop).
Once we have created a new scale (or revised an existing one for our population),
we are ready to test reliability (Chapters 8 and 9) and validity (Chapter 10). Of course,
if the purpose of the study is not the design and evaluation of a scale, researchers
typically proceed directly to designing a study to answer a research question or test a
hypothesis of association. Following this path, reporting of psychometric properties of
the new or revised scale will likely be limited to internal consistency only, which is the
easiest, but extremely limited, psychometric property to assess (see the section entitled
‘Kuder–Richardson 20 and coefficient α’ in Chapter 5). However, study findings, while
not explicitly stated as such, can nevertheless provide important information on val-
idity. One researcher’s cross-sectional study is another’s construct validation study: it
all depends on how you frame the question. From our earlier example, if the results
of a study using our new scale to measure depression in children shows that girls have
higher levels of depression than boys, or that children who have been bullied are more
likely to be depressed than children who have not, this is evidence of construct valid-
ity (both findings show the scale is producing results in an expected way, given what
we know about depression in children). If the study in question is a measurement
study, then we would be more intentional in our design to measure different aspects
of both reliability and validity. The complexity of the design will depend on the scale
(and the research question). If, for example, we are using a scale that requires multiple
raters to assess a single object of measurement (e.g. raters evaluating play behaviour in
children, tutors rating students), and we are interested in assessing whether these rat-
ings are stable over time, we can devise quite complex factorial designs, as we discuss
in Chapter 9, and use generalizability (G) theory to simultaneously assess different
aspects of reliability. As another example, we may want to look at various aspects
of reliability—inter-rater, test–retest, internal consistency; it then makes sense to do
it with a single G study so we can look at the error resulting from the various fac-
tors. In some cases, a cross-sectional survey, where we compare our new (or revised)
measure to a standard measure by correlating it with other measures hypothesized to
be associated with the construct we are trying to measure, may be all that is needed.
Of course, much has been written on the topic of research design already (e.g. Streiner
and Norman, 2009) and an extended discussion is beyond the scope of this chap-
ter. We do, however, discuss different ways to administer scales (Chapter 13). Choice
of administration method will be determined both by the nature of the scale and
by what aspects of reliability and validity we are interested in assessing. But before
any study is begun, be aware of ethical issues that may arise; these are discussed in
Chapter 14.
6 INTRODUCTION TO HEALTH MEASUREMENT SCALES

Of course, as with all research, our roadmap is iterative. Most of the scales that
have stood the test of time (much like this book) have been revised, re-tested, and
tested again. In pursuit of our evaluation of reliability and validity, we often find that
our scale needs to be tweaked. Sometimes, the required changes are more dramatic.
As our understanding of the construct we are measuring evolves, we often need to
revise scales accordingly. At any rate, it is important to dispel the notion that once a
scale is created, it is good for all eternity. As with the research process in general, the
act of research often generates more questions than answers (thereby also ensuring
our continued employment). When you are ready to write up what you have found,
check Chapter 15 regarding reporting guidelines for studies of reliability and validity.

Further reading
Colton, T.D. (1974). Statistics in medicine. Little Brown, Boston, MA.
Freund, J.E., and Perles, B.M. (2005). Modern elementary statistics (12th edn). Pearson, Upper
Saddle River, NJ.
Huff, D. (1954). How to lie with statistics. W.W. Norton, New York.
Norman, G.R. and Streiner, D.L. (2003). PDQ statistics (3rd edn). PMPH USA, Shelton, CT.
Norman, G.R. and Streiner, D.L. (2014). Biostatistics: The bare essentials (4th edn). PMPH USA,
Shelton, CT.

References
Allen, M.J. and Yen, W.M. (1979). Introduction to measurement theory. Brooks Cole, Monterey,
CA.
Garland, L.H. (1959). Studies on the accuracy of diagnostic procedures. American Journal of
Roentgenology, 82, 25–38.
Marshall, M., Lockwood, A., Bradley, C., Adams, C., Joy, C., and Fenton, M. (2000).
Unpublished rating scales: A major source of bias in randomised controlled trials of
treatments for schizophrenia. British Journal of Psychiatry, 176, 249–52.
Streiner, D.L. and Norman, G.R. (2009). PDQ epidemiology (3rd edn). PMPH USA, Shelton,
CT.
Yerushalmy, J. (1955). Reliability of chest radiography in diagnosis of pulmonary lesions.
American Journal of Surgery, 89, 231–40.
Chapter 2

Basic concepts

Introduction to basic concepts


One feature of the health sciences literature devoted to measuring subjective states
is the daunting array of available scales. Whether one wishes to measure depression,
pain, or patient satisfaction, it seems that every article published in the field has used a
different approach to the measurement problem. This proliferation impedes research,
since there are significant problems in generalizing from one set of findings to another.
Paradoxically, if you proceed a little further in the search for existing instruments to
assess a particular concept, you may conclude that none of the existing scales is quite
right, so it is appropriate to embark on the development of one more scale to add to
the confusion in the literature. Most researchers tend to magnify the deficiencies of
existing measures and underestimate the effort required to develop an adequate new
measure. Of course, scales do not exist for all applications; if this were so, there would
be little justification for writing this book. Nevertheless, perhaps the most common
error committed by clinical researchers is to dismiss existing scales too lightly, and
embark on the development of a new instrument with an unjustifiably optimistic and
naive expectation that they can do better. As will become evident, the development of
scales to assess subjective attributes is not easy and requires considerable investment
of both mental and fiscal resources. Therefore, a useful first step is to be aware of any
existing scales that might suit the purpose. The next step is to understand and apply
criteria for judging the usefulness of a particular scale. In subsequent chapters, these
will be described in much greater detail for use in developing a scale; however, the
next few pages will serve as an introduction to the topic and a guideline for a critical
literature review.
The discussion that follows is necessarily brief. A much more comprehensive set of
standards, which is widely used for the assessment of standardized tests used in psych-
ology and education, is the manual called Standards for Educational and Psychological
Tests that is published jointly by the American Educational Research Association, the
American Psychological Association, and the National Council on Measurement in
Education (AERA/APA/NCME 1999). Chapter 15 of this book summarizes what scale
developers should report about a scale they have developed, and what users of these
scales should look for (e.g. for reporting in meta-analyses).

Searching the literature


An initial search of the literature to locate scales for measurement of particular
variables might begin with the standard bibliographic sources, particularly Medline.
8 BASIC CONCEPTS

However, depending on the application, one might wish to consider bibliographic


reference systems in other disciplines, particularly PsycINFO for psychological
scales and ERIC (which stands for Educational Resource Information Center:
<http://eric.ed.gov/>) for instruments designed for educational purposes.
In addition to these standard sources, there are a number of compendia of measur-
ing scales, both in book form and as searchable, online resources. These are described
in Appendix A. We might particularly highlight the volume entitled Measuring
Health: A Guide to Rating Scales and Questionnaires (McDowell 2006), which is a crit-
ical review of scales designed to measure a number of characteristics of interest to
researchers in the health sciences, such as pain, illness behaviour, and social support.

Critical review
Having located one or more scales of potential interest, it remains to choose whether
to use one of these existing scales or to proceed to development of a new instru-
ment. In part, this decision can be guided by a judgement of the appropriateness of
the items on the scale, but this should always be supplemented by a critical review of
the evidence in support of the instrument. The particular dimensions of this review
are described in the following sections.

Face and content validity


The terms face validity and content validity are technical descriptions of the judge-
ment that a scale looks reasonable. Face validity simply indicates whether, on the face
of it, the instrument appears to be assessing the desired qualities. The criterion repre-
sents a subjective judgement based on a review of the measure itself by one or more
experts, and rarely are any empirical approaches used. Content validity is a closely
related concept, consisting of a judgement whether the instrument samples all the
relevant or important content or domains. These two forms of validity consist of a
judgement by experts whether the scale appears appropriate for the intended purpose.
Guilford (1954) calls this approach to validation ‘validity by assumption’, meaning the
instrument measures such-and-such because an expert says it does. However, an expli-
cit statement regarding face and content validity, based on some form of review by an
expert panel or alternative methods described later, should be a minimum prerequisite
for acceptance of a measure.
Having said this, there are situations where face and content validity may not be
desirable, and may be consciously avoided. For example, in assessing behaviour such
as child abuse or excessive alcohol consumption, questions like ‘Have you ever hit your
child with a blunt object?’ or ‘Do you frequently drink to excess?’ may have face valid-
ity, but are unlikely to elicit an honest response. Questions designed to assess sensitive
areas are likely to be less obviously related to the underlying attitude or behaviour and
may appear to have poor face validity. It is rare for scales not to satisfy minimal stand-
ards of face and content validity, unless there has been a deliberate attempt from the
outset to avoid straightforward questions.
Nevertheless, all too frequently, researchers dismiss existing measures on the basis
of their own judgements of face validity—they did not like some of the questions, or
CRITICAL REVIEW 9

the scale was too long, or the responses were not in a preferred format. As we have
indicated, this judgement should comprise only one of several used in arriving at an
overall judgement of usefulness, and should be balanced against the time and cost of
developing a replacement.

Reliability
The concept of reliability is, on the surface, deceptively simple. Before one can obtain
evidence that an instrument is measuring what is intended, it is first necessary to
gather evidence that the scale is measuring something in a reproducible fashion. That
is, a first step in providing evidence of the value of an instrument is to demonstrate
that measurements of individuals on different occasions, or by different observers, or
by similar or parallel tests, produce the same or similar results.
That is the basic idea behind the concept—an index of the extent to which meas-
urements of individuals obtained under different circumstances yield similar results.
However, the concept is refined a bit further in measurement theory. If we were con-
sidering the reliability of, for example, a set of bathroom scales, it might be sufficient
to indicate that the scales are accurate to ±1 kg. From this information, we can easily
judge whether the scales will be adequate to distinguish among adult males (prob-
ably yes) or to assess weight gain of premature infants (probably no), since we have
prior knowledge of the average weight and variation in weight of adults and premature
infants.
Such information is rarely available in the development of subjective scales. Each
scale produces a different measurement from every other. Therefore, to indicate that a
particular scale is accurate to ±3.4 units provides no indication of its value in measur-
ing individuals unless we have some idea about the likely range of scores on the scale.
To circumvent this problem, reliability is usually quoted as a ratio of the variability
between individuals to the total variability in the scores; in other words, the reliability
is a measure of the proportion of the variability in scores, which was due to true dif-
ferences between individuals. Thus, the reliability is expressed as a number between
0 and 1, with 0 indicating no reliability, and 1 indicating perfect reliability.
An important issue in examining the reliability of an instrument is the manner in
which the data were obtained that provided the basis for the calculation of a reliabil-
ity coefficient. First of all, since the reliability involves the ratio of variability between
subjects to total variability, one way to ensure that a test will look good is to conduct
the study on an extremely heterogeneous sample, for example, to measure knowledge
of clinical medicine using samples of first-year, third-year, and fifth-year students.
Examine the sampling procedures carefully, and assure yourself that the sample used
in the reliability study is approximately the same as the sample you wish to study.
Second, there are any number of ways in which reliability measures can be obtained,
and the magnitude of the reliability coefficient will be a direct reflection of the
particular approach used. Some broad definitions are described as follows:
1. Internal consistency. Measures of internal consistency are based on a single admin-
istration of the measure. If the measure has a relatively large number of items
addressing the same underlying dimension, e.g. ‘Are you able to dress yourself?’,

You might also like