Hammersley Some Notes On Reliability
Hammersley Some Notes On Reliability
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://www.jstor.org/page/
info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content
in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship.
For more information about JSTOR, please contact [email protected].
Wiley and BERA are collaborating with JSTOR to digitize, preserve and extend access to British Educational Research Journal.
http://www.jstor.org
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions
British
Educational
ResearchJournal,
Vol.13,No. 1, 1987 73
MARTYNHAMMERSLEY, SchoolofEducation,TheOpenUniversity
Variations
in Definition
Here is a sampleof the definitionsof the terms'reliability'
and 'validity'to be
foundin themethodologicalliterature[2]:
(1) Reliability
is theagreement betweentwoefforts to measurethesame
traitthrough maximally similarmethods.Validityis represented in the
agreement betweentwo attemptsto measurethe same traitthrough
maximally differentmethods. (Campbell& Fiske,1967:277)
(2) The validityofa measuring instrument is defined
as theproperty ofa
measurethatallowstheresearcher to say thattheinstrument measures
whathe saysit measures.. . The reliabilityof a measuring instrumentis
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions
74 M. Hammersley
oftheinstrument
definedas theability tomeasure thepheno-
consistently
menon tomeasure.
itis designed (Black& Champion,1976:222 and 234)
(3) Reliability
refersto thereproducibility of themeasurements. Can we
relyon ourownabilityto obtainverysimilardataagain;thatis,howgood
is our intra-observerreliability?
Otherobserversshouldalso be able to
replicateour measurements whichmeans,in part,thatwe shouldhave
goodinter-observer reliability.
Thisis,ofcourse,oftendifficult, sinceskill
in observationdevelopsthrough practice . . . Hollenbeck (1978)concluded
thatreliabilityconsistsof bothstability and accuracy.However,thisis
trueonlyof interobserver . . an observermaybe reliable
reliability. but
stillhave poor accuracyas long as precison (stability)is maintained.
Therefore, intra-observerreliabilityis solelya measureof stability (or
precision)whereasaccuracyaffectsvalidity... However,accuracywill
almostcertainly affectinter-observerreliability since fewobserversare
likelyto havethesamebiases.Anaccuracy criterion canbe established by
usingan 'expert'observer or theconsensusofseveralobservers. (Lehner,
1979: 130)
(is)theextentto whichrepetition
(4) Reliability ofthestudywouldresult
in thesamedataand conclusions. (Goode & Hatt,1952: 153)
(5) The goal of anyscientificmeasurement operationor procedure is to
arriveat thebestpossibleestimate ofthetruevalueofsomedimensional
qualityof a naturalphenomenon. To theextentthatthisgoalis achieved
it is saidthatthemeasurement is accurateorvalid.Accuracy orvalidity of
theresultstherefore becomestheyardstick forgauging thequalityofany
measurement procedure.For purposesof clarity, accuracy (or validity)
maybe definedas the extentto whichobtainedmeasuresapproximate
valuesof the'true'stateof nature.. . Reliabilityrefersto thecapacityof
theinstrument to yieldthesame measurement valuewhenbrought into
repeatedcontactwiththe same stateof nature.Thus,thismeaningof
reliabilityis concerned withthestability of measuredvaluesundercon-
stantconditions. (Johnston & Pennypacker, 1980: 190 and 191)
is theaccuracy
(6) Reliability orprecisionof a measuring instrument...
The commonest definition ofvalidityis epitomized by thequestion:Are
we measuring whatwe thinkwe are measuring? The emphasisin this
questionis on whatis being measured.For example,a teacherhas
constructed a testto measureunderstanding procedures
of scientific and
has includedon thetestonly factual
itemsaboutscientific procedures.The
testis notvalidbecause,whileit mayreliably measurethepupils'factual
knowledge procedures,
of scientific it does notmeasuretheirunderstand-
ingof suchprocedures. In otherwords,it maymeasurewhatit measures
quite well,but it does not measurewhat the teacherintendedit to
measure.(Kerlinger, 1964:430 and 444-5)
(7) A measureis reliableto theextentthattheaveragedifference between
two measurements independently obtainedin the same classroomis
smallerthantheaveragedifference betweentwomeasurements obtained
classrooms
in different . . . A measureis validto theextent thatdifferences
in scoresyieldedby it reflect actualdifferencesin behaviour notdiffer-
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions
'Validity'
and'Reliability' 75
ences in impressions made on different observers.(Medley& Mitzel,
1963: 150)
One sourceof problems is thelackof a standardized terminology, so thatseveral
termsareusedto refer to eachofthedifferent aspectsofthemeasurement process.
Indeed,sometimes different authorsuse thesametermto refer to different things;
and eventhesame authormayuse a termto denotedifferent thingson different
occasions.An exampleis the term'measure'whichcan referto a measuring
instrument or to a particularmeasurement score.Forthepurposesofmyargument
hereI shalluse thefollowing termsand definitions:
measurement: theprocessbywhichan observer appliesan instrument to objects
in orderto gaugethepresence/magnitude ofa property [3];
property: thefeature ofobjectswhichis to be measured;
instrument:a proceduredevelopedto measurethe presence/magnitude of a
property in theobjects;
objects: the phenomena(people,lessons,tasks,etc.)whosepossessionof
theproperty is to be assessed;
scores: theresultsofthemeasurement process;
occasion: thetimeand placewheretheinstrument is appliedto producethe
scores;
observer: thepersonwhocarriesoutthemeasurement.
Usingthisterminology, let us look now at the majordiscrepancies amongthe
definitionsofvalidityand reliability citedabove.
(a) Arereliability and validityconcerned withall aspectsof a studyor do they
relateonlyto theprocessof measurement? Whilemostdefinitions takethelatter
position,someimplytheformer. For example,Goode & Hatt(1952: 153) define
reliability
as "theextentto whichrepetition of thestudywouldresultin thesame
data and conclusions".In otherwordstheyidentify it withreplication, and this
clearlyinvolvesmorethanmeasurement.
In thecase oftheterm'validity', thereis theproblem oftherelationship between
twotypologies: criterion,
predictive, concurrent, content,face,and construct valid-
ityon theone hand;internal, external, populationand ecologicalvalidityon the
other.The formerrefersto measurement, the latterto the whole processof
assessingthetruthof explanatory claims.In addition,thetermvalidityis some-
timesused to referto the assessmentof arguments in termsof whetherthey
conform to legitimatedeductive canons.
(b) Arevalidityand reliability properties ofinstruments, observers, orofparticu-
lar scores?Goode & Hatttreatreliability as a feature ofdataand conclusions. For
themostpart,though, reliability seemsto be viewedas a property ofinstruments
and/orobservers. Validityis sometimes ascribedto instruments (Black& Cham-
pion,Kerlinger, Medley& Mitzel),sometimes to observers (Lehner),sometimes to
scores(Johnston & Pennypacker).
(c) Arevalidityand reliability to be defined in termsoftherelationship between
scoresand variationin theproperty beingmeasured? (Call theserealistdefinitions).
Or are theyto be definedin termsoftherelationships amongscoresproducedby
thesameand/ordifferent instruments? (Call thesenominalist definitions)
[4]. Most
definitionsofvalidityarerealist, claiming, forexample,thatvalidity represents the
extentto whichan instrument measurestheproperty it is intendedto measure.
However, there are exceptions.For instance,"validityis representedin
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions
76 M. Hammersley
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions
'Validity'
and'Reliability' 77
property. If theywereto be consistent withtheirnominalist definition,
no such
problemof interpretation wouldarise,theconclusion wouldbe thatthevalidityof
thescoresis lowor zero.In a similarway,Kerlinger (1964) movesthrough various
definitionsof reliability
withoutaddressing the issueof therelationships among
them.
Anothercommonpracticeis to conflatedefinitions of reliability
in termsof
consistency ofscoreswithdefinitionsin termsofrandomerror:
Reliability
concerns theextentto whichmeasurements areconsistent
and
repeatable.Thus,a highlyreliablemeasureis one thatdoes notfluctuate
greatlybecauseofrandomerror.(Zeller& Carmines,1980: 17)
We havetwodefinitions ofreliability
herewhichdo notmatchone another. While
randomerrorwillproduceinconsistency in scores,so willcertainkindsofsystema-
tic error.For example,wherescoresproducedby two observers are affectedby
biaseswhichoperatein oppositedirections inconsistencies betweenthescoresof
theobservers forthesameobjectswillresult.
AnAttempt
at Clarification
I have triedto showthatthereis some inconsistency in theusageof theterms
'reliability'
and 'validity'.At theriskof addingto theconfusion, I wantto tryto
clarify theconceptsunderlying theseterms.
It is important to beginbymakinga cleardistinction betweengoalsand means,
betweenwhatit is aboutthemeasurement processwe are tryingto assessand the
strategies we use to assessit. Onlywhenwe are clearaboutwhatit is we wantto
assesscan we deviseeffective strategiesforachieving
that.
Ourprimary concernin measurement mustsurelybe whether thesetofscoreswe
haveproducedaccurately reflectsthepresence/magnitude ofthetarget propertyin
theobjectswe havemeasured.Thisis whatmostwriters seemto meanbyvalidity
[5]. Thereare a numberof typesof threatto measurement validity,but we can
distinguish two main sources.If we thinkof measurement as involving, at its
simplest, a relationshipbetweena variablewhichis notdirectly observable andone
thatis, theremay be inaccuraciesin the recording of scoresof the observable
variable(we mightreferto thisas theproblemof 'accuracy')and theremaybe
errors arisingfromimperfect correlation betweentheobservedand theunobserved
variables(thisis oftenreferred to as theproblemof'constructvalidity').
However,validityis not our onlygoal. We are also ofteninterested in the
precision withwhichany particular scorecapturesthe magnitudeof the target
property in an object.Precisionconcernsthedelicacyof themeasurement scale
employed.We can measurethe lengthof a largeobject in termsof metres,
centimetres or evenmillimetres. In thatorderthesescalesrepresent an increasing
degreeof precision. Notethatthisis independent oftheaccuracy of themeasure-
ment.On thisusagea scoremaybe veryprecisebuthighly inaccurate. How precise
we wantour measurement to be willdependupon our purposes,but it willalso
dependupon the level of validitywhichcan be obtainedat different levelsof
precision. Otherthings beingequal,themoreprecisethescale,themoredifficult it
is to achievehighlevelsofvalidity.And,indeed,thereis oftena temptation to be
moreprecisethanthe level of validitywithwhichan objectcan be measured
justifies.
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions
78 M. Hammersley
a thirdgoal?It maybe,butonlyifdefined
Is reliability in realistterms.Achieving
consistency ofscoresacrossoccasionsis ofno valuein itself, itonlyhasvalueas an
indicatorof validity.If, on the otherhand,we treatreliability as a property of
instruments notof scores,and defineit as theabilityofan instrument consistently
to producevalidscores,thenassessing ofinstruments
thereliability anddeveloping
reliableinstruments are clearlyimportant goals.On thisdefinition we can havea
scoreofhighvaliditywithout theinstrument whichproduceditbeingreliable, but
we cannothavea reliableinstrument producing invalidscores[6].However, itmay
be difficultto knowthatwe have a scoreofhighvaliditywithout also findingout
whether we have a reliableinstrument sincethesame strategies are involvedin
assessing bothvalidityand reliability,on thedefinitions usedhere.
Validityand appropriate precisionof scores,and reliabilityof instruments, are
ourgoals,then.Butofcoursethecentralproblemin measurement is thatgenerally
we haveno directaccesstotheproperty we aretrying to measure, andthuswehave
no straightforward meansofassessing thevalidity ofanyparticular score.Ifwedid
have directaccess we wouldpresumably have no need of anymeasuring instru-
ment.In assessingthevalidityof scoresand thereliability ofinstruments we have
to relyupon comparisons of the scoresproducedunderdifferent circumstances,
circumstances systematically variedin orderto assesstheeffects ofdifferent types
of threatto measurement validity.To theextentthatscoresare consistent across
thesedifferent circumstances, we can haveincreasedconfidence thattheyarevalid
and thattheinstrument is reliable.
TypesandSourcesofError
random,
between
& Costner's(1977:24-6) distinction
Mueller,Schuessler constant
erroris moreuseful,I believe,thanthe morecommontwofold
and correlated
betweenrandomand systematic
distinction error,sincethetwokindsofsystematic
[7]. Herearetheauthors'definitions:
characteristics
errorhavedifferent
Randomerrors: of
Randomerrorsbehaveas iftheamountand direction
by drawingsignednumbers
errorweredetermined froma hat,withone
andthe
inthehatbeingpositiveandonehalfnegative
halfofthenumbers
beingzero.(p. 24)
averageofthenumbers
Constanterrors: It is as iftheerrorweredeterminedbydrawingnumbers
froma hat, but the averageof the numbersin the hat is not zero;
consequentlyeach scoreis inflated bythesameamounton
(or deflated)
theaverage.(p. 25)
errors:
Correlated Correlatederrorbehavesas iftheerrorweredetermined
by drawingnumbersfromhats,but a different different
hat(containing
numbers)was used formalesand females,or forrichand poor,or for
(p. 25)
groupings.
otherdifferentiated
typesoferror,
sourcesof errorare likelyto lead to different
Different thoughas
yetweknowtoolittletobe abletotieparticular typeswithany
sourcestoparticular
We can onlymakesuggestions
certainty. as to likelylinks.For example.
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions
'Validity'
and'Reliability' 79
Sourcesoferror Probabletypesoferror
Observer observation
and coding random,constant
or
inaccuracies correlated
calculation
mistakes randomor constant
interpretational
bias constant
or correlated
Instrument contaminationofscoresby
factors
otherthantheproperty
beingmeasured random,constant
or
correlated
Comparisonsof scoresproducedunderdifferentcircumstances
mayallowus to
assesstheeffects
ofdifferentsourcesand typesoferror:
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions
80 M. Hammersley
by
seekingto measurethananothersetofobjects,thenwe can testourinstrument
measuringthesetwo sets of objectsto discoverwhetheror not the expected
is to be found.
difference
Conclusion
My concernin thispaperhas beenwiththeconceptual issuesinvolvedin defining
I have proposeddefinitions
validityand reliability. of validity,precisionand
as goalsofthemeasurement
reliability process.Theseareto be distinguished from
the strategieswhichwe use to achievethem,whichinvolvethe comparisonof
scoresproducedunderdifferent circumstances.These comparisons allow us to
assesstheeffects typesand sourcesoferror,
ofdifferent and theyprovideus witha
Considerable
basisforassessingbothvalidityand reliability. workis stillrequired
in developing and applying However,a prerequisite
thesestrategies. foreffective
workin thisarea,it seemsto me,is to be clearaboutwhatit is we are aimingto
achieve.I havetriedto showthatat presentourusageofconceptslikevalidity and
and thispaperhas been directedtowardsa
is vagueand inconsistent,
reliability
ofthesemeasurement
clarification goals,and theirrelation designed
to strategies to
assessthem.
Correspondence:
M. Hammersley, School of Education,The Open University,
WaltonHall,MiltonKeynes,BucksMK7 6AA,England.
NOTES
[1] I am obliged to John Scarth,Donald MacKinnon,BarryCooper and JohnBynnerfor
comments on earlierdraftsofthisarticle.The errorsare ofcoursemine.
[2] This is a haphazardsample,butit does illustrate therangeofvariationin usage.
[3] I put on one side the questionof whetherit is legitimateto talk of classification as
measurement.
[4] The terms'realist'and 'nominalist'are used in a varietyof waysby philosophers. I use the
termsheresimplyas shorthand.
[S] We probablyneed to use some adjectivelike 'measurement' validityhereto
or 'descriptive'
distinguish whatwe are referring to fromlogicalvalidityand frominternalvalidity.
[6] Incorrect use ofa reliableinstrument wouldproduceinvalidscores,butitis betterto treatthis
as use ofa different instrument.
[7] It is also important to recognise,as Cronbachet al., 1972 emphasise,thatwhatis systematic
errorgivenone focusmay be variationin the targetproperty fromanotherpointof view.
Identification ofsystematic erroris relativeto thepropertybeingmeasured.
[8] Thesevariouscomparisons are ofcourseproducedsimplybycombining conventionalreliabil-
ity and validitychecks.There are additionalpossibilitiesin researchemployingtestsor
inventories, suchas theuse ofthesplithalftechniqueor Cronbach'scoefficient alpha.
REFERENCES
BLACK,J.A. & CHAMPION, D. J.(1976) Methodsand Issuesin Social Research(New York,Wiley).
CAMPBELL,D. T. & FISKE,D. W. (1967) Convergent and discriminant validationbythemultitrait-
multimethod matrix,in: W. A. MEHRENS & R. L. EBEL(Eds) Principlesof Educationaland
Psychological Measurement (Chicago,Rand McNally).
CRONBACH, L. J. & MEEHL,P. E. ( 1955) Constructvalidityin psychological tests,Psychological
Bulletin,52, pp. 281-302.
CRONBACH, ofBehavioural
L. J.et al. (1972) TheDependability Measurements (New York,Wiley).
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions
'Validity'
and'Reliability' 81
This content downloaded from 137.108.145.45 on Sun, 14 Feb 2016 22:35:59 UTC
All use subject to JSTOR Terms and Conditions