0% found this document useful (0 votes)
71 views

Streiner, D.L. (2003) Starting at The Beginning An Introduction To Coefficient Alpha and Internal PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
71 views

Streiner, D.L. (2003) Starting at The Beginning An Introduction To Coefficient Alpha and Internal PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 5
JOURNAL OF PERSONALITY ASSESSMENT, 401), 9-108 Copyrih © 2003, LavrenceEelbaum Assciate. Ine. STATISTICAL DEVELOPMENTS AND APPLICATIONS Starting at the Beginning: An Introduction to Coefficient Alpha and Internal Consistency David L. Streiner Bayerest Cente for Geriatric Care Department of Psychiatry University of Toronto Cronbach's ccs the most widely used index ofthe reliability of a scale. However, its use an in texpretaion ean be subjet toa number of erors. This ment of com other indexes of internal consistency (split-hall tile discusses the historical develop silty and Kuder-Richard: son 20) and discusses four myths associated with ce (a) that isa Fixed property ofthe scale, b) terre at it measvres only the internal consistency ofthe scale, (that higher values ane always pre war lower ones, and () that it isestrieted tthe range of O10 I. It provides sone recom ‘mendations for acceptable values of cin different situations. Perhaps the most widely used measure of the reliability of a scale is Cronbach's (1951). One reason for this is obvious: it is the only reliability index that does not require two ad- ‘ministrations ofthe scale, or two or more raters, and so canbe ‘determined with much less effort than test-retest or interrater reliability. Unfortunately, the ubiquity of its use is matched only by the degree of misunderstanding regarding what does and does not measure. This article is intended to be a ba sie primer about a. It will approach these issues from a con: ‘ceptual and a statistical perspective and illustrate both the strengths and weaknesses of the index. begin by discussing what is meant by reliability in gen- ‘eral and how orand other indexes of “internal consistency” determine this. In classical test theory, a person's total score Iie, the score a person receives on a test or scale. which is sometimes referred to as the observed score) is composed of ‘ovo parts: the true score plus some error associated with the measurement. Thats: this is that» person's total score will vary around the true score to some degree. One way of thinking shout reliability. then. is that ts the ratio ofthe variance ofthe «rac scores tothe toll scores (6, Reliability = Q However, at any given time, a person’s true score will be the same from om testing to another, so that an individhal’s G7, Will ways be zero. Thus, Equation 2 pertains only toa _group of people who differ with respect tothe characteristic being measured Before continuing with issues of measuring reliability, however, it would be Worthwhile to digress fora moment and expand on what is meant by the “true” score. In many re- spects, i's poor choice of words and a potentially mislead- ing term (although one we're stuck with), because “true,” in 100 STREINER 80, However, ifthe tests given in English, which the person leamed only 2 yearsago. the irae" score of 80s ikely notan accurate reflection oF her intelligence. Similarly, personun- dergoing an assessment for purposes of child access and cus- tody may deliberately understate the degree to which he uses, corporal punishiment. Repeat evaluations may yield similar scoreson the estandthe mean willbe a good approximation of thetruescore (because of low random error) but the defensive response style, which produces a bias, means that the true score will not be an accurate one. Finally, a depressed person ‘may have a T score around 75 on numerous administrationsof 4 petsonality test, However if she responds wel to therapy, thenbothnher depression andher tue score should move loser to the average range. The different effects of random and systematic error are captured in Judd, Smith, and Kidder’s (1991) expansion of Equation | SeOrE spy = ScOrecy + Scorese* Scoreps, (3) where CIs the construct of interest, SE the systematic error, and RE is the random error, In this formulation, Scorecy + Scorese is the same as Scorerie in Equation 1, Two advan- tages of expressing the trae score a the sum of the construct and the systematic error is that it illustrates the relationship between reliability and validity. and shows how the different \ypes of error affect each of them: etoitty = 2222s, ” wheres Vatiy = 22. io ‘These last two equations show that random error affects both reliability and validity (because the larger it i, the smaller the ratio between the numerators and denominators), “whereas systematic error affects only validity. Returning to reliability, it can be defined on a conceptual Jevel as the degree to which “mensurements of individuals on audition to these two sources of errr (time and observer), We ‘ean adda third source—that associated withthe homogeneity ‘ot the items that comprise the scale. Iascaletapsa single corstzuctor domain, such as anxiety or mathematical ability, then to ensure content validity, we ‘want the scale to (a) consist of items that sample the entire domain and (b) not include items that tp other abilities oF constructs, For example, ates of mathematics should sample everything a cild is expected to know at a given grade level, but not consist of long, written passages that may reflect the childs reading ability as mauch as his or her math kills. Simi larly, an ansiety inventory should tap all ofthe components of anxiety (¢.g., cognitive, behavioral, affective) but not in- ‘clude items from other realms, such as ego strength or social desirability. Because classical test theory assumes that the items ona Scale are a random sample from the wriverse of all possible items drawn from the domain, then they should be correlated highly with one another. However, this may not always be tue, For example, Person A may endorse two items on an anxiety inventory (eg. “I fel tense most of the time" “Tam afraid to leave the house on my ow"), whereas Person B may say True tothe first but Noto the second, This difference inthe pater of responding Would affect the cor- relationsamong the items, and hence the intemal consistency of the scale. A high degree of internal consistency is desir- able, because it“speaks directly tothe ability of the clinician or the researcher (o interpret the composite score asa reflee- ‘ion ofthe tess items” (Henson, 2001, p. 178) ‘The original method of measuring internal consistency is called “split half reliability. As the name implies, itis caleu- lated by splitting the testin half (¢g..allof the odd numbered items in one half and the even numbered ones in the other) and correlating the two parts, If the scale as a whole is inter- nally consistent, then any two randomly derived halves should contain similar items and thus yield comparable scores. A modification of this was proposed by Rulon (1939), which relies on calculating the variance of the differ- ence score between the (Wo halftess (63) andthe variance ofthe otal score (0% 4) 8rOss people: Reliability =1~ 52. 6 Fut ‘The right most partof the equation (63/0, isthe propor- a est Par one aka ne eee COEFFICIENT ALPHA 101 portional to its length. Splitting a scale in half reduces its length by 50%, and hence underestimates the reliability. This difficulty can be solved relatively easily, though, by using the Spearman-Brown “prophecy” formula that compensates for the reduction in length. The second issue is that there are ‘many ways to split a scale in half in fact, a 12-item scale can be divided 462 ways and each one will result in a somewhat different estimate ofthe reliability.2 This problem was dealt \with for the ease of dichotomous items by Kuder and Rich- aardson (1937), Their famous equation, which is referred to as, KR-20 because it was the 20th one in their article, reads: kR-20=—|1- k Zr. Fal oh | a where kis the number o tems, py the proportion of people ‘who answered positively to item &, qc is the proportion of people who answered negasvely (ie..qu= I~ pu). andy isthe variance ofthe al sores. KR20 canbe thought of ss the mean of al possible splichaf eal. The limitation of handling only dichotomous items was solved by Cronbach (1951), n his generalization of KR-20 into coefficient o which can be written 8) where Zo} is the sum of the variances of ll ofthe items. Co- efficient cthas the same property as KR-20, in terms of being the average of all possible spits > “That pretty much describes what cis and can do. In the next section, I look at the other side ofthe equation and dis- ‘cuss what cis not and cannot, or does not, do. MYTHS ABOUT ALPHA Myth 1: Alpha Is a Fixed Property of a Scale ‘The primary myth surrounding o.(and all other indexes of n liability. for that matter) is that once it is determined in one study, then you know the reliability of the scale under all cir- ‘cumstances. As a number of authors have pointed out, how- ever, reliability isa characteristic ofthe test scores, not ofthe test itself (eg, Caruso, 2000; Pedhazur & Scimelkin, 1991; Yin & Fan, 2000). That is, reliability depends as much on the sample being tested as on the test. This has been reinforced in tho recent guidelines for publishing the results of studies (Wilkinson & The Task Force on Statistical Inference, 1999), which stated ghat, “Its important to remember that test is not reliable or unreliable. Relibilty isa propery of the scores on ates fora particular population of exaninees” {p. 596; emphasis added). The reasons for this flow out of Equations 1,2, and 8, Equation 2 ells us that reliability isthe ratio of the true and total score variances, However, Equation 1 shows that you can never obtain the true score. Conse- quently, any measured value ofthe reliability is an estimate and, as with all estimates of parameters, subject to some de sree of eror. Finally. Equation 8 reflects the fact thatthe rl ability depends on the total score variance, and this is going to differ from one sample of people to another. The more het erogeneous the sample, then the larger the variance ofthe to- tal cores and the higher the reliability. Caruso (2000) did a meta-analysis of reliability studies done with the NEO and found, for example, that the mean reliability of the Agzee- ableness subscale was .79 when it was used in studies with the general population, butonly .62in clinical samples Sin larly, Henson, Kogan, and Vacha-flaase’s (2001) meta-analysis of teacher efficacy scales found that the reli ability estimates for the Internal Failure scale ranged from 5 to 82, and from. to 82 for the General Teaching El cacy seale. The reliabilities were affected by a number of at tvibutes ofthe samples including, not surprisingly, the heter- ‘geneity ofthe teachers. Consequently, scale that may have excellent reliability with one group may have only marginal religilty in another. One implication ofthis is that tis not sufficient to rely on published reports of reliability ifthe scale isto be used with another group of people; it may be necessary to determine it anew ifthe group is sufficiently if ferent, especially with regard to its homogeneity Myth 2: Alpha Measures Only the Internal Consistency of the Scale Its true thatthe higher the correlations among the items of a scale, the higher will be the value of ct But, the converse of ‘this—that @ high value of oLimplies a high degree of intemal consistency—is not always true. The reason is that «is also 102 STREINER three from each subscale), 65 with 12 items, and 75 with 18 items. A scale composed of three orthogonal (i2., uncorrelated) subscales had an ct of .64 with 18 items, He concluded that ‘ita scale has more than 14 items, then it will have an of 70 or beter even if it consists of (wo orthogonal dimensions wth modest (ie, .30) item interconelations. Ifthe dimen- sions fe coreated with each othe, as they usualy ae, hen. ais even greater. (p. 102) In other words, even though a scale may consist of two or ‘more independent constructs, «could be substantial as long as the scale contains enough items. The bottom line is that a high value of cris a prerequisite for internal consistency, but does not guarantee it; long, multidimensional scales will also have high values of Myth 3: Bigger Is Always Better For most indexes of reliability, the higher the value the better. ‘We would like high levels of agreement between independent raters and good stability of scores over time in the absence of change. This is tue, 0, about o, but only up to a point. As T just noted, ccmeasures not only the homogeneity of the items, ‘but also the homogeneity of what is being assessed, In many ‘cases, even seemingly unidimensional constructs can be con- ceptualized having a number of different aspects, Lang (1971), for example, stated that anxiety can be broken down, into three components—cognitive, physiological, and behav- iorsl—whereas Koksal and Power (1990) added a fourth, af- fective, dimension. Moreover, these do not always respond in concert and the correlations among them may be quite mod- est (Antony, 2001). Consequently, any scale tha is designed to measure anxiety ns a whole must by necessity have some degree of heterogeneity among the items. Ifthe anxiety scale hhas three or four subscales, they should each be more homo- geneous than the scale asa whole, but even here, should not be too high (over 90 or so), Higher values may reflect unnec- essary duplication of content across items and point more 10 redundancy than to homogeneity; or, as McClelland (1980) put it, “asking the same question many different ways” (p. 30). In the final section, Twill expand on this a bit more: Myth 4: Alpha Ranges Between 0 and 1 trait for some items, and strongly disagree for other items) 0 ‘minimize Yea-saying bias (e.g., Steiner & Norman, 1995), Needless to say, the scoring for the reversed items should also be reversed. If this isn’t done, the items will be nega- tively correlated, leading toa value of athat is below zero, OF ‘couse, if the items are scored correctly and some correla- tions are still negative, then it points to serious problems in the original construction of the scale. ‘A less frequent cause of a negative value of aris when the variabitity ofthe individual items exceeds their shared vari ance, which may occur when the items are tapping a variety of different constructs (Henson, 2001). Because negative values of care theoretically impossible, Henson recom: ‘mended reporting them as zero, but negative or zero, the con- clusions are the same—the items are most likely not ‘measuring what they purport to USING ALPHA Not al indexes of reliability can be used in all situations. For example, itis impossible (0 assess interrater reliability for self-administered scales and difficult to determine test-retest reliability for conditions that change over brief periods of time (which is not to say that some of our students haven’t tried), Similarly, there are certain types of scales for which c is inappropriate. t should not be used for “power” tests that ‘measure how many items are completed in a fixed period of time (such as the Digit Symbol Coding subtest of the WAIS-Iif, The issue here is that it is assumed that people will differ only in terms of the number of items completed, and that everyone will be correct on most or all of the com- pleted ones. So, for any given person, the correlations be- tween items will depend on how many items were finished, ‘and not the pattern of responding, Closely related to this are many ofthe other subtests of the Wechsler scales and similar types of indexes, where the items are presented in order of difficulty. Again, the expected pat- terof answersis that they will alle correct untilthe difficulty level exceeds the person’s ability and the remaining items ‘would be wrong; oF there should be a number of two-point re. sponses, followed by some one-point answers, and then zeros. fei computed for these types of tests, it will result in a very high value, one that is ony marginally below 1.0. “Third, is inappropriate if the answer to one item depends ‘on the response to a previous one, or when more than one COEFFICIENT ALPHA, 103 than 20 or so items, can be quite respectable, giving the misleading impression that the scale is homogeneous. ‘So, how high should o:be? Inthe first version of his book, ‘Nunnally (1967) recommended .50 to 60 for the erly stages, of research, £80 for basic research tools, and 9002s the "mini- ‘mally tolerable estimate” forclinical purposes, with idea of 95. Heincreasedthe starting levelto.70in later versionsot his book ¢Vunnally, 1978; Nunnally & Bernstein, 1994). In my ‘pinion (and note that this isan opinion, as are all other values, suggested by various authors), he gotitrightforresearch tool but went 100 far for clinical scales. As outlined in Myth 3, ex- ‘cept for extremely narrowly defined traits (and Ican’t think of any). as over 90 most likely indicate unnecessary redundancy rather than a desirable level of internal consistency. CONCLUSIONS Internal consistency is necessary in scales that measure vari- ‘ous aspects of personality (a subsequent article will examine situations where itis nat important). However, Cronbach's 0. must be used and interpreted with some degree of caution. 1, You cannot trust that published estimates of etapply in ll situations. 4 the group for which the scale will. be used is more or less homogenous than the one in the published repon, then will most likely be differ- cent (higher in the first case, lower in the second). 2. Because ais affected by the Jength of the scale, high values do not guarantee internal consistency or unidimensionality, Seales over 20 items or so will hhave acceptable values of o, even though they may consist of two or three orthogonal dimensions. It is necessary to also examine the matrix of correlations of the individual items and to look at the item-total correlations. In this vein, Clark and Watson (1995) recommended a mean interitem correlation within the range of .15 to 20 for scales that measure broad characteristics and between 40 to .50 for those tap- ping narrower ones, 3. Values of ctcan be (o0 high, and point to redundancy among the items, recommend a maximum value of 90. REFERENCES Caruso, 1. (2000), Reliability generalization of the NEO personsity ‘scales Educational and Psychological Measurement, 60, 236-254, ‘Crk... & Watson, D. (1995) Constructing vality: Basi sue in ob- Jetve scale development. Psychological Asem 7, 39-318 ‘Coming, J. M, (1993). What cotton alpha? An examination of teary nd applications. Jounal of Applied Psycholy, 78, 98-104 ‘Cronbach LJ. (1951), Coefficient apha and the eral stractare of tet Prychometria, 16, 207-34 Hamilton, M.A (1967), Development of aratin scale for primary depressive lnss. British Jwomal of Sila Clinical Psychology, 6, 278-296 Henson, RK. (2001). Understanding internal consistency reliability est ‘ts: A conceptual primer on coefictnt alpha Mearurenent and Bra tation in Counseling and Development 34. 171-189, Henson, R-K., Kogan, LR, & Vacha Haase, T-(2001.A reliability gene ‘ization study ofthe Teacher Eiicay Scale and elated instruments. Bd ‘weational and Psychological Measurement, 61, 408-420 Jud, C. Mt, Smith E.R. de Kidder, L-H- (1991). Research methods ins: ‘al relations (6h ed). New York: Harcourt Brate Jovanovich. Kok, F, 4 Power, KG. (1990), Four Systems Anxiety Questonnsie {FSAQ): A selfrepon measure of somatic, cognitive, behavioral, and feeling components Journal of Personality Assessment. 54, 334-545 ude, C.F Richardson, M. W, (1937), The theory ofthe estimation of test eit, Pchomerik, 2, 131-160. Lang, PJ. (1971). The application of psychophysiolosiesl mehous. tS. Garfield & A. Bergin (Bis), Hono of pssehoherapy and behavior ‘ohange ip 75-125). New York: Wiles. Lord, F.M., & Novick, M. R, (968), Stastial theories of mental test scores Reading, MA! Adsison- Wesley. [MeClland, DC: (1980), Motive dispositions: The merits of operant amb Fespondent measures. n L. Wheeler (Ei), Review of ermalty and so ial poyehology (Vo. 1; pp 1). Bevery Hills, CA: Sage ‘Nunnally J. €. (1967). Pasthomeric theory, New Yook: MeGraw Hi Numa, J. €. (1978). Psychometric theory Gnd ed.) New York MeGrawe Hil Nannaly, J C., & Bernstein, tH, (1994) Psychometric theory (el ed) ‘Rew York: MeGaHil Praha 1, & Schmelia,L (191). Measureent, design and analysis: ‘Ar imegrted appr Hille, NJ: Larne Eabaur Assit ine Rulon, P, (1939) simplified procedure for determining the rial of ‘ea of splices, Harvard Educational Revie: 999-103 Sener. D-L., & Norman, G. R. (1998). Health measurement scales: racial guide their development and use (2nd ed) Oxford. Engle Osford University Press. Wechsler, B19) WAPS-I admunistarion and scoring monaal (Sede), San Antoni: TX: Peychologiel Corporation, ‘Wilkinson, L,& The fk Force on Statistical lference. (1999) Statistical ‘methods in psychology journals Guidelines and explanations. American Peychologs, 54, $98-608. ‘Yin, P, & Fen, X. 2000). Assessing the reliability of Beck Depression In ‘venlory Scores Reliab generalization seross stoies. Euational and Peycholopeal Measurenen, 60, 201-223 David L. Streiner Kunin-Lunenfeld Applied Research Unit

You might also like