0% found this document useful (0 votes)
38 views

2010 (Field & Gillett)

This article provides guidance on conducting a meta-analysis, which is a statistical tool for estimating population effects from multiple empirical studies on the same research question. The article describes the key steps in meta-analysis, including conducting a literature search, selecting studies using inclusion criteria, calculating effect sizes, performing the analysis, exploring moderator variables, and writing up results. Examples are provided from psychological research to illustrate the process.

Uploaded by

Batik Penaton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

2010 (Field & Gillett)

This article provides guidance on conducting a meta-analysis, which is a statistical tool for estimating population effects from multiple empirical studies on the same research question. The article describes the key steps in meta-analysis, including conducting a literature search, selecting studies using inclusion criteria, calculating effect sizes, performing the analysis, exploring moderator variables, and writing up results. Examples are provided from psychological research to illustrate the process.

Uploaded by

Batik Penaton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/255878147

How to do a meta-analysis

Article in British Journal of Mathematical and Statistical Psychology · May 2010


DOI: 10.1348/000711010X502733

CITATIONS READS

211 2,876

2 authors, including:

Andy P. Field
University of Sussex
145 PUBLICATIONS 19,567 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Having a family View project

All content following this page was uploaded by Andy P. Field on 07 September 2014.

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Running Head: HOW TO DO A META-ANALYSIS

How to do a meta-analysis

Andy P. Field

Department of Psychology, University of Sussex, UK

Raphael Gillett

School of Psychology, University of Leicester

1
Abstract

Meta-analysis is a statistical tool for estimating the mean and variance of underlying

population effects from a collection of empirical studies addressing ostensibly the same research

question. Meta-analysis has become an increasing popular and valuable tool in psychological

research and major review articles typically employ these methods. This article describes the

process of conducting meta-analysis, from selecting articles, developing inclusion criteria,

calculating effect sizes, conducting the actual analysis (including information on how to do the

analysis on popular computer packages such as SPSS/PASW and R), and estimating the effects of

publication bias. Guidance is also given on how to write up a meta-analysis.

2
WHAT IS META-ANALYSIS AND HOW DO I DO IT?

Psychologists are typically interested in finding general answers to questions across this

diverse discipline. Some examples are whether cognitive behaviour therapy (CBT) is efficacious for

treating anxiety in children and adolescents (Cartwright-Hatton, Roberts, Chitsabesan, Fothergill, &

Harrington, 2004), whether language affects theory of mind performance (Milligan, Astington, &

Dack, 2007), whether eyewitnesses have biased memories for events (Douglass & Steblay, 2006),

whether temperament differs across gender (Else-Quest, Hyde, Goldsmith, & Van Hulle, 2006), the

neuropsychological effects of sports-related concussion (Belanger & Vanderploeg, 2005) and how

pregnant women can be helped to quit smoking (Kelley, Bond, & Abraham, 2001). These examples

illustrate the diversity of questions posed by psychologists to understand human behaviour.

Although answers to these questions can be obtained in single pieces of research, when these

studies are based on small samples the resulting estimates of effects will be more biased than large

sample studies. Also, replication is an important means to deal with the problems created by

measurement error in research (Fisher, 1935). For these reasons, different researchers often address

the same or similar research questions making it possible to answer questions through assimilating

data from a variety of sources using meta-analysis. A meta-analysis can tell us several things:

1. The mean and variance of underlying population effects. For example, the effects in the

population of doing CBT on anxious children compared to waiting list controls. You can

also compute confidence intervals for the population effects.

2. Variability in effects across studies. Meta-analysis can also be used to estimate the

variability between effect sizes across studies (the homogeneity of effect sizes). Some

meta-analysts report these statistics as a justification for assuming a particular model for

their analysis or to see whether there is variability in effect sizes that moderator variables

could explain (see the section on conducting meta-analysis). However, there is

3
accumulating evidence that effect sizes should be heterogeneous across studies in the vast

majority of cases (see, for example, National Research Council, 1992), and significance

tests of this variability have low power of homogeneity tests. Therefore, variability

statistics should be reported, regardless of whether moderator variables have been

measured, because they tell us something important about the distribution of effect sizes

in the meta-analysis, but not as a justification for choosing a particular method.

3. Moderator variables: If there is variability in effect sizes, and in most cases there is

(Field, 2005b), this variability can be explored in terms of moderator variables (Field,

2003b; Overton, 1998). For example, we might find that compared to a waiting list

control, CBT including group therapy produces a larger effect size for improvement in

adolescents with eating disorders than CBT without a group component.

This article is intended as an extended tutorial in which we overview the key stages necessary

when conducting a meta-analysis. The article describes how to do meta-analysis in a step-by-step

way using some examples from the psychological literature. In doing so we look both at the

theory of meta-analysis, but also how to use computer programs to conduct one: we focus on

PASW (formerly SPSS), because many psychologists use it, and R (because it is free and does

things that PASW cannot). We have broken the process of meta-analysis into 6 steps: (1) do a

literature search; (2) decide on some inclusion criteria and apply them; (3) calculate effect sizes

for each study to be included; (4) do the basic meta-analysis; (5) consider doing some more

advanced analysis such as publication bias analysis and exploring moderator variables; and (6)

write up the results.

THE EXAMPLE DATA SETS

In this tutorial we use two real data sets from the psychological literature. Cartwright-Hatton,

et al. (2004) conducted a systematic review of the efficacy of CBT for childhood and adolescent

4
anxiety. This study is representative of clinical research in that relatively few studies had addressed

this question and sample sizes within each study were relatively small. These data are used as our

main example and the most benefit can be gained from reading their paper in conjunction with this

one. When discussing moderator analysis, we use a larger data set from Tanenbaum and Leaper

(2002), who conducted a meta-analysis on whether parent’s gender schemas related to their

children’s gender-related cognitions. These data files are available on the website that accompanies

this article (Field & Gillett, 2009).

STEP 1: DO A LITERATURE SEARCH

The first step in meta-analysis is to search the literature for studies that have addressed the same

research question using electronic databases such as the ISI Web of Knowledge, PubMed, and

PsycInfo. This can be done to find articles, but also to identify authors in the field (who might have

unpublished data – see below); in the later case it can be helpful not just to backward search for

articles but to forward search by finding authors who cite papers in the field. It is often useful to

hand-search relevant journals that are not part of these electronic databases and to use the reference

sections of the articles that you have found to check for articles that you have missed. One potential

bias in a meta-analysis arises from the fact that significant findings are more likely to be published

than non-significant findings both because researchers do not submit them (Dickersin, Min, &

Meinert, 1992) and reviewers tend to reject manuscripts containing them (Hedges, 1984). This is

known as publication bias or the ‘file drawer’ problem (Rosenthal, 1979). This bias is not trivial:

significant findings are estimated to be eight times more likely to be submitted than non-significant

ones (Greenwald, 1975), studies with positive findings are around 7 times more likely to be

published than studies with results supporting the null hypothesis (Coursol & Wagner, 1986) and

97% of articles in psychology journals report significant results (Sterling, 1959). The effect of this

bias is that meta-analytic reviews will over-estimate population effects if they have not included

5
unpublished studies, because effect sizes in unpublished studies of comparable methodological

quality will be smaller (McLeod & Weisz, 2004) and can be half the size of comparable published

research (Shadish, 1992). To minimise the bias of the file drawer problem the search can be

extended from papers to relevant conference proceedings, and to contact people that you consider to

be experts in the field to see if they have any unpublished data or know of any data relevant to your

research question that is not in the public domain. This can be done by direct email to authors in the

field, but also by posting a message to a topic specific newsgroup or email listserv.

Moving to our example, Cartwright-Hatton et al. (2004) gathered articles by searching 8

databases: Cochrane Controlled Trials register, Current Controlled Trials, Medline,

Embase/Psychinfo, Cinahl, NHS Economic Evaluation Database, National Technical Information

Service, ISI Web of science. They also searched the reference lists of these articles, and hand

searched 13 journals known to publish clinical trials on anxiety or anxiety research generally.

Finally, the authors contacted people in the field and requested information about any other trials

not unearthed by their search. This search strategy highlights the use of varied resources to ensure

all potentially relevant studies are included and to reduce bias due to the file-drawer problem.

STEP 2: DECIDE ON INCLUSION CRITERIA

The inclusion of badly-conducted research can also bias a meta-analysis. Although meta-

analysis might seem to solve the problem of variance in study quality because these differences will

‘come out in the wash’, even one red sock (bad study) amongst the white clothes (good studies) can

ruin the laundry. Meta-analysis can end up being an exercise in adding apples to oranges unless

inclusion criteria are applied to ensure the quality and similarity of the included studies.

Inclusion criteria depend on the research question being addressed and any specific

methodological issues in the field; for example, in a meta-analysis of a therapeutic intervention like

CBT, you might decide on a working definition of what constitutes CBT, and maybe exclude

6
studies that do not have proper control groups and so on. You should not exclude studies because of

some idiosyncratic whim: it is important that you formulate a precise set of criteria that is applied

throughout; otherwise you will introduce subjective bias into the analysis. It is also vital to be

transparent about the criteria in your write up, and even consider reporting the number of studies

that were included/excluded at each hurdle in the process.

It is also possible to classify studies into groups, for example methodologically strong or

weak, or the use of waiting list control or other intervention controls, and then see if this variable

moderates the effect size; by doing so you can answer questions such as: do methodologically

strong studies (by your criteria) differ in effect size to the weaker studies? Or, does the type of

control group affect the strength of the effect of CBT?

In the Cartwright-Hatton et al. (2004) review, they list a variety of inclusion criteria that will

not be repeated here; reading their paper though will highlight the central point that they devised

criteria sensible to their research question: they were interested in child anxiety, so variables such as

age of patients (were they children), diagnostic status (were they anxious) and outcome measures

(did they meet the required standard) were used as inclusion criteria..

STEP 3: CALCULATE THE EFFECT SIZES

What are Effect Sizes and How Do I calculate them?

Once you have collected your articles, you need to find the effect sizes within them, or

calculate them for yourself. An effect size is usually a standardized measure of the magnitude of

observed effect (see, for example, Clark-Carter, 2003; Field, 2005c). As such, effect sizes across

different studies that have measured different variables, or have used different scales of

measurement can be directly compared: an effect size based on the Beck anxiety inventory could be

compared to an effect size based on heart rate. Many measures of effect size have been proposed

7
(see Rosenthal, 1991 for a good overview) and the most common are Pearson’s correlation

coefficient, r, Cohen’s, d, and the odds ratio (OR). However, there may be reasons to prefer

unstandardized effect size measures (Baguley, 2009), and meta-analytic methods exist for analysing

these that will not be discussed in this paper (but see Bond, Wiitala, & Richard, 2003).

Pearson’s correlation coefficient, r, is a standardized form of the covariance between two

variables and is well known and understood by most psychologists as a measure of the strength of

relationship between two continuous variables; however, it is also a very versatile measure of the

strength of an experimental effect. If you had a sports-related concussion group (coded numerically

as 1) and a non-concussed control (coded numerically as 0), and you conducted a Pearson

correlation between this variable and their performance on some cognitive task, the resulting

correlation will have the same p value as a t-test on the same data. In fact there are direct

relationships between r and statistics that quantify group differences (e.g. t and F), associations

between categorical variables (χ2), and the p value of any test statistic. The conversions between r

and these various measures are discussed in many sources (e.g. Field, 2005a, 2005c; Rosenthal,

1991) and will not be repeated here.

Cohen (1988, 1992) made some widely adopted suggestions about what constitutes a large or

small effect: r = .10 (small effect, the effect explains 1% of the total variance); r = .30 (medium

effect, the effect accounts for 9% of the total variance); r = .50 (large effect, the effect accounts for

25% of the variance). Although these guidelines can be a useful rule of thumb to assess the

importance of an effect (regardless of the significance of the test statistic), it is worth remembering

that these ‘canned’ effect sizes are not always comparable when converted to different metrics, and

that there is no substitute for evaluating an effect size within the context of the research domain that

it is being used (Baguley, 2009; Lenth, 2001).

8
Cohen’s d is based on the standardized difference between two means. You subtract the mean

of one group from the other and then standardize this by dividing by σ, which is the sum of squared

errors (i.e. take the difference between each score and the mean, square it, and then add all of these

squared values up) divided by the total number of scores.

M1 − M 2
d= .
σ

σ can be based on either a single group (usually the control group) or can be a pooled estimate

based on both groups by using the sample size, n, and variances, s, from each:

(n1 − 1)s1 + (n2 − 1)s2 .


n1 + n2 − 2

Whether you standardise using one group or both depends on what you are trying to quantify. For

example, in a clinical drug trial, the drug dosage will affect not just the mean of any outcome

variables, but also the variance; therefore, you would not want to use this inflated variance when

computing d and would instead use the control group only (so that d reflects the mean change

relative to the control group).

If some of the primary studies have employed factorial designs, it is possible to obtain

estimators of effect size for these designs that are metrically comparable with the d estimator for the

two-group design (Gillett, 2003). As with r, Cohen (1988, 1992) has suggested benchmarks of d =

0.30, 0.50 and 0.80 as representing small, medium and large effects respectively.

The odds ratio is the ratio of the odds (the probability of the event occurring divided by the

probability of the event not occurring) of an event occurring in one group compared to another (see

Fleiss, 1973). For example, if the odds of being symptom free after treatment are 10, and the odds

of being symptom free after being on the waiting list are 2 then the odds ratio is 10/2 = 5. This

means that the odds of being symptom free are 5 times greater after treatment, compared to being

9
on the waiting list. The odds ratio can vary from 0 to infinity, and a value of 1 indicates that the

odds of a particular outcome are equal in both groups. If dichotomised data (i.e. a 2 × 2 contingency

table) need to be incorporated into an analysis based mainly on d or r, then a d-based measure

called d-Cox exists (see Sanchez-Meca, Marin-Martinez, & Chacon-Moscoso, 2003, for a review).

There is much to recommend r as an effect size measure (e.g. Rosenthal & DiMatteo, 2001).

It is certainly convenient because it is well understood by most psychologists, and unlike d and the

OR it is constrained to lie between 0 (no effect) and ±1 (a perfect effect). It does not matter what

effect you are looking for, what variables have been measured, or how those variables have been

measured a correlation coefficient of 0 means there is no effect, and a value of ±1 means that there

is a perfect association. (Note that because r is not measured on a linear scale, an effect such as r

=.4 is not twice as big as one with r = .2). However, there are situations in which d may be

favoured; for example, when group sizes are very discrepant (McGrath & Meyer, 2006) r might be

quite biased because, unlike d, it does not account for these ‘base rate’ differences in group n. In

such circumstances, if r is used it should be adjusted to the same underlying base rate, which could

be the base rate suggested in the literature, the average base rate across studies in the meta-analysis,

or a 50/50 base rate, (which maximizes the correlation).

Whichever effect size metric you chose to use, your next step will be to go through the

articles that you have chosen to include and calculate effect sizes using your chosen metric for

comparable effects within each study. If you were using r, this would mean obtaining a value for r

for each effect that you wanted to compare for every paper you want to include in the meta-analysis.

A given paper may contain several rs depending on the sorts of questions you are trying to address

with your meta-analysis. For example, cognitive impairment in PTSD could be measured in a

variety of ways in individual studies and so a meta-analysis might use several effect sizes from the

same study (Brewin, Kleiner, Vasterling, & Field, 2007). Solutions include calculating the average

10
effect size across all measures of the same outcome within a study (Rosenthal, 1991), comparing

the meta-analytic results when allowing multiple effect sizes from different measures of the same

outcome within a study, or computing an average effect size so that every study contributes only

one effect to the analysis (as in Brewin, et al., 2007).

Articles might not report effect sizes, or may report them in different metrics. If no effect

sizes are reported then you can often use the reported data to calculate one. For most effect size

measures you could do this using test statistics (as mentioned above r can be obtained from t, z, χ2,

and F), or probability values for effects (by converting first to z).If you use d as your effect size

then you can use means and standard deviations reported in the paper. Finally, if you are calculating

odds ratios then frequency data from the paper could be used. Sometimes papers do not include

sufficient data to calculate an effect size, in which case contact the authors for the raw data, or

relevant statistics from which an effect size can be computed. (Such attempts are often unsuccessful

and we urge authors to be sympathetic to emails from meta-analysts trying to find effect sizes.) If a

paper reports an effect size in a different metric to the one that you have chosen to use then you can

usually convert from one metric to another to at least get an approximate effect sizei. A full

description of the various conversions are beyond the scope of this article, but many of the relevant

equations can be found in (Rosenthal, 1991). There are also many Excel spreadsheets that are

provided online computing effect sizes and converting between them; some examples are Wilson

(2004) and DeCoster (1998).

Calculating Effect Sizes for Cartwright-Hatton et al. (2004)

When reporting a meta-analysis it is a good idea to tabulate the effect sizes with other helpful

information (such as the sample size on which the effect size is based, N) and also to present a stem-

and-leaf plot of the effect sizes. For the Cartwright-Hatton et al. data we used r as the effect size

measure but we will highlight differences for situations in which d is used when we talk about the

11
meta-analysis itself. Table 1 shows a stem and leaf plot of the resulting effect sizes and this should

be included in the write up. This stem and leaf plot tells us the exact effect sizes to 2 decimal places

with the stem reflecting the first decimal place and the leaf showing the second; for example, we

know the smallest effect size was r = .18, the largest was r = .85 and there were effect sizes of .71

and .72. Table 2 shows the studies included in Cartwright-Hatton et al. (2004), with their

corresponding effect sizes (expressed as r) and the sample sizes on which these rs are based.

Insert Tables 1 and 2 Here

STEP 4: DO THE BASIC META-ANALYSIS

Having collected the relevant studies and calculated effect sizes from each study, you must do

the meta-analysis. This section looks first at some important conceptual issues before exploring how

to actually do the meta-analysis.

Initial Considerations

The main function of meta-analysis is to estimate effects in the population by combining the

effect sizes from a variety of articles. Specifically, the estimate is a weighted mean of the effect

sizes. The ‘weight’ that is used is usually a value reflecting the sampling accuracy of the effect size,

which is typically a function of sample sizeii. This makes statistical sense because if an effect size

has good sampling accuracy (i.e. it is likely to be an accurate reflection of reality) then it is

weighted highly, whereas effect sizes that are imprecise are given less weight in the calculations. It

is usually helpful to also construct a confidence interval around the estimate of the population effect

also. Data analysis is rarely straightforward, and meta-analysis is no exception because there are

different methods for estimating the population effects and these methods have their own pros and

cons. There are lots of issues to bear in mind and many authors have written extensively about these

issues (Field, 2001, 2003a, 2003b, 2005b, 2005c; Hall & Brannick, 2002; Hunter & Schmidt, 2004;

12
Rosenthal & DiMatteo, 2001; Schulze, 2004). In terms of doing a meta-analysis, the main issues (as

we see them) are: (1) which method to use, and (2) how to conceptualise your data. Actually, these

two issues are linked.

Which Method Should I Chose?

In essence, there are two ways to conceptualise meta-analysis: fixed- and random-effects

models (Hedges, 1992; Hedges & Vevea, 1998; Hunter & Schmidt, 2000)iii. The fixed-effect model

assumes that studies in the meta-analysis are sampled from a population in which the average effect

size is fixed or one that can be predicted from a few predictors (Hunter & Schmidt, 2000).

Consequently, sample effect sizes should be homogenous because they come from the same

population with a fixed average effect. The alternative assumption is that the average effect size in

the population varies randomly from study to study: studies in a meta-analysis come from

populations that have different average effect sizes; so, population effect sizes can be thought of as

being sampled from a ‘superpopulation’ (Hedges, 1992). In this case, the effect sizes should be

heterogeneous because they come from populations with varying average effect sizes.

The above distinction is tied up with the method of meta-analysis that you chose because

statistically speaking the main difference between fixed- and random-effects models is in the

sources of error. In fixed-effects models there is error because of sampling studies from a

population of studies. This error exists in random-effects models but there is additional error created

by sampling the populations from a superpopulation. As such, calculating the error of the mean

effect size in random-effects models involves estimating two error terms, whereas in fixed-effects

models there is only one. This, as we will see, has implications for computing the mean effect size.

The two most widely-used methods of meta-analysis are those by Hunter & Schmidt (2004) which

is a random effects method, and the method by Hedges and colleagues (e.g. Hedges, 1992; Hedges

& Olkin, 1985; Hedges & Vevea, 1998) who provide both fixed- and random-effects methods.

13
However, multilevel models can also be used in the context of meta analysis too (seeHox, 2002,

Chapter 8).

Before doing the actual meta-analysis, you need to decide whether to conceptualise your

model as fixed- or random-effects. This decision depends both on the assumptions that can

realistically be made about the populations from which your studies are sampled, and the types of

inferences that you wish to make from the meta-analysis. On the former point, many writers have

argued that real-world data in the social sciences are likely to have variable population parameters

(Field, 2003b; Hunter & Schmidt, 2000, 2004; National Research Council, 1992; Osburn &

Callender, 1992). There are data to support these claims: Field (2005b) calculated the standard

deviations of effect sizes for all meta-analytic studies (using r) published in Psychological Bulletin

1997-2002 and found that they ranged from 0 to 0.3, and were most frequently in the region of

0.10-0.16; Barrick and Mount (1991) similarly found that the standard deviation of effect sizes (rs)

in published data sets was around 0.16. These data suggest that a random-effects approach should

be the norm in social science data.

The decision to use fixed- or random-effects models also depends upon type of inferences that

you wish to make (Hedges & Vevea, 1998): fixed-effect models are appropriate for inferences that

extend only to the studies included in the meta-analysis (conditional inferences) whereas random-

effects models allow inferences that generalise beyond the studies included in the meta-analysis

(unconditional inferences). Psychologists will typically wish to generalize their findings beyond the

studies included in the meta-analysis and so a random-effects model is appropriate.

The decision about whether to apply fixed- or random-effects methods is not trivial. Despite

considerable evidence that variable effect sizes are the norm in psychological data, fixed-effects

methods are routinely used: a review of meta-analytic studies in Psychological Bulletin found 21

studies using fixed-effects methods (in 17 of these studies there was significant variability in

14
sample effect sizes) and none using random effects methods (Hunter & Schmidt, 2000). The

consequences of applying fixed-effect methods to random-effects data can be quite dramatic:

significance tests of the estimate of the population effect have Type I error rates inflated from the

normal 5% to 11–28% (Hunter & Schmidt, 2000) and 43-80% (Field, 2003b) depending on the

variability of effect sizes. In addition, when applying two random effect methods to 68 meta-

analyses from 5 large meta-analytic studies published in Psychological Bulletin, Schmidt, Oh and

Hay (2009) found that the published fixed-effect confidence intervals around mean effect sizes were

on average 52% narrower than their actual width: these nominal 95% fixed-effect confidence

intervals were on average 56% confidence intervals. The consequences of applying random-effects

methods to fixed-effects data are considerably less dramatic: in Hedges method for example, the

additional between-study effect size variance used in the random-effects method becomes zero

when sample effect sizes are homogenous yielding the same result as the fixed-effects method.

We mentioned earlier that part of conducting a meta-analysis is to compute statistics that

quantify heterogeneity. These tests can be used to ascertain whether population effect sizes are

likely to be fixed or variable (Hedges & Olkin, 1985). If these homogeneity tests yield non-

significant results then sample effect sizes are usually regarded as roughly equivalent and so

population effect sizes are likely to be homogenous (and hence the assumption that they are fixed is

reasonable). However, these tests should be used cautiously as a means to decide on how to

conceptualise that data because they typically have low power to detect genuine variation in

population effect sizes (Hedges & Pigott, 2001). In general, we favour the view that the choice of

model should be determined a priori by the goal of the analysis rather than being a post hoc decision

based on the data collected.

To sum up, we believe that in most cases a random-effects model should be assumed (and the

consequences of applying random-effects models to fixed-effects data are much less severe than the

15
other way around). However, fixed-effects analysis may be appropriate when you do not wish to

generalise beyond the effect sizes in your analysis (Oswald & McCloy, 2003); for example, a

researcher who has conducted several similar studies some of which were more successful than

others might reasonably estimate the population effect of her research by using a fixed-effects

analysis. For one thing, it would be reasonable for her to assume that her studies are tapping the

same population, and also, she would not necessarily be trying to generalise beyond her own

studies.

Which Method is Best?

The next decision is whether to use Hunter and Schmidt (2004) and Hedges and colleagues’

methodiv. These methods will be described in due course, and the technical differences between

them have been summarised by Field (2005b) and will not be repeated here. Field (2001; but see

Hafdahl & Williams, 2009) conducted a series of Monte Carlo simulations comparing the

performance of the Hunter and Schmidt and Hedges and Olkin (fixed- and random-effects) methods

and found that when comparing random-effects methods the Hunter-Schmidt method yielded the

most accurate estimates of population correlation across a variety of situations (a view echoed by

Hall & Brannick, 2002 in a similar study). However, neither the Hunter-Schmidt nor Hedges and

colleagues’ method controlled the Type I error rate when 15 or fewer studies were included in the

meta-analysis, and the method described by Hedges and Vevea (1998) controlled the Type I error

rate better than the Hunter-Schmidt method when 20 or more studies were included. Schulze (2004)

has also done extensive simulation studies and based on these findings recommends against using

Fisher’s z transform and suggests that the ‘optimal’ study weights used in the H-V method can, at

times, be sub-optimal in practice. However, Schulze based these conclusions on using only the

fixed-effects version of Hedges’ method. Field (2005b) looked at Hedges and colleagues’ random-

effects method and again compared it to Hunter and Schmidt’s bare-bones method using a Monte

16
Carlo simulation. He concluded that in general both random-effects methods produce accurate

estimates of the population effect size. Hedges’ method showed small (less than .052 above the

population correlation) overestimations of the population correlation in extreme situations (i.e.

when the population correlation was large, ρ ≥ .3, and the standard deviation of correlations was

also large, σ ≥ 0.16; also when the population correlation was small, ρ ≥ .1 and the standard
ρ

deviation of correlations was at its maximum value, σ = 0.32). The Hunter-Schmidt estimates were
ρ

generally less biased than estimates from Hedges’ random effects method (less than .011 below the

population value), but in practical terms the bias in both methods was negligible. In terms of 95%

confidence intervals around the population estimate Hedges’ method was in general better at

achieving these intervals (the intervals for Hunter and Schmidt’s method tended to be too narrow,

probably because they recommend using credibility intervals and not confidence intervals — see

below). However, the relative merits of the methods depended on the parameters of the simulation

and in practice the researcher should consult the various tables in Field (2005b) to assess which

method might be most accurate for the given parameters of the meta-analysis that they are about to

conduct. Also, Hunter and Schmidt’s method involves psychometric corrections for the attenuation

of observed effect sizes that can be caused by measurement error (Hunter, Schmidt, & Le, 2006).

Not all studies will report reliability coefficients, so their methods use the average reliability across

studies to correct effect sizes. These psychometric corrections can be incorporated into any

procedure, including that of Hedges’ and colleagues, but these conditions are not explored in the

comparison studies mentioned above.

Methods of Meta-Analysis

Hunter and Schmidt Method (Hunter & Schmidt, 2004)

As already mentioned, this method emphasises isolating and correcting for sources of error

such as sampling error and reliability of measurement variables. However, Hunter and Schmidt

17
(2004) spend an entire book explaining these corrections and so for this primer, we will conduct the

analysis in its simplest form. The population effect is estimated using a simple mean in which each

effect-size estimate, r, is weighted by the sample size on which it is based, n:

∑ ni ri
r= i =1
. (1)
k
∑ ni
i =1

Table 2 shows the effect sizes and their sample sizes, and in the final column we have calculated

each effect size multiplied by the sample size on which it is based. The sum of this final column is

the top half of Equation 1, whereas the sum of sample sizes (column 2 in Table 2) is the bottom of

this equation. Therefore, the population effect can be estimated as:

∑ ni ri 312.06
r= i =1
k = = .554
∑ ni 563.00
i =1

By Cohen’s (1988, 1992) criteria, this means that CBT for childhood and adolescent anxiety had a

large effect compared to waiting list controls.

The next step is to estimate the generalizability of this value using a credibility intervalv.

Hunter and Schmidt (2004) recommend correcting the population effect for artefacts before

constructing these credibility intervals. If we ignore artefact correction, the credibility intervals are

based on the variance of effect sizes in the population. Hunter and Schmidt (2004) argue that the

variance across sample effect sizes consists of the variance of effect sizes in the population and the

sampling error and so the variance in population effect sizes is estimated by correcting the variance

in sample effect sizes by the sampling error. The variance of sample effect sizes is the frequency

weighted average squared error:

∑ ni (ri − r )2
σˆ =
2 i =1
. (2)
r k

∑ ni
i =1

It is also necessary to estimate the sampling error variance using the population correlation

estimate, r , and the average sample size, N , (see Hunter & Schmidt, 2004, p. 88):
18
σˆ e2 = (1−Nr−1) .
2 2
(3)

To estimate the variance in population correlations we subtract the sampling error variance

from the variance in sample correlations (see Hunter & Schmidt, 2004, p. 88):

σˆ ρ2 = σˆ r2 − σˆ e2 . (4)

The credibility intervals are based on taking the population effect estimate (Equation 1) and

adding to or subtracting from it the square root of the estimated population variance in Equation 4

multiplied by z /2 , in which α is the desired probability (e.g. for a 95% interval z /2 = 1.96):
α α

95% Credibility Interval Upper = r + 1.96 σˆ ρ2 ,


(5)
95% Credibility IntervalLower = r − 1.96 σ ρ . ˆ2

A chi-square statistic is used to measure homogeneity of effect sizes. This statistic is based on

the sum of squared errors of the mean effect size: Equation 6 shows how the chi-square statistic is

calculated from the sample size on which the correlation is based (n), the squared errors between

each effect size and the mean, and the variance.

k 2
χ 2 = ∑ (n −1)(r − r )
i i
(6)
i =1
(1− r ) 2 2

Hedges and Colleagues’ Method (Hedges & Olkin, 1985; Hedges & Vevea, 1998)

In this method, if r is being used effect sizes are first converted into a standard normal metric,

using Fisher’s (1921) r-to-Z transformation, before calculating a weighted average of these

transformed scores (in which ri is the effect size from study i), Equation 7:

z r i = 12 Log e ( ),
1 + ri
1 − ri
(7)

The transformation back to ri is shown in Equation 8:

19
(2 z ri ) − 1
ri = e
( 2z ) . (8)
e ri + 1

To remove the slight positive bias found from Fisher-transformed rs, the effect sizes can be

transformed with r – [(r(1 – r2))/2(n – 3)] before the Fisher transformation in Equation 7 is applied

(see Overton, 1998). This is done in the PASW syntax files that we have produced to accompany

this paper. Note also that less biased r-to-z transformations have been developed that may explain

some of the differences between the two methods of meta-analysis discussed in this paper (Hafdahl,

2009, in press).

In the fixed-effect model, the transformed effect sizes are used to calculate an average in

which each effect size is weighted by the inverse within-study variance of the study from which it

came (Equation 9). When r is the effect size measure, the weight (wi) is the sample size, ni, minus

⎛ 1 + d i2 ⎞
three ( wi = ni − 3 ), but when d is the effect size measure this weight is wi = 4 N i ⎜⎜ ⎟⎟
⎝ 8 ⎠

∑ wi zri
zr = i =1
k , (9)
∑ wi
i =1

in which k is the number of studies in the meta-analysis. When using r as an effect size measure the

resulting weighted average is in the z-metric and should be converted back to r using Equation 8

This average, and the weight for each study, is used to calculate the homogeneity of effect

sizes (Equation 10). The resulting statistic Q has a chi-square distribution with k – 1 degrees of

freedom:

k
(
Q = ∑ wi zri − zr . )2 (10)
i =1

If you wanted to apply a fixed effects model you could stop here. However, as we have

suggested there is usually good reason to assume that a random effects model is most appropriate.

To calculate the random-effects average effect size, the weights use a variance component that

20
incorporates both between-study variance and within-study variance. The between-study variance is

denoted by τ2 and is added to the within-study variance to create new weights (Equation 11):

wi* = ( 1
wi + τˆ 2 ) −1
. (11)

The value of wi depends upon whether r or d has been used (see above): when r has been used wi =

ni – 3. The random-effects weighted average in the z metric uses the same equation as the fixed

effects model, except that the weights have changed to incorporate between-study variance

(Equation 12):

*
∑ wi* zri (12)
z =
r
i =1
k ,
∑ wi*
i =1

The between-studies variance can be estimated in several ways (see Friedman, 1937; Hedges

& Vevea, 1998; Overton, 1998; Takkouche, Cadarso-Suarez, & Spiegelman, 1999), however,

Hedges and Vevea (1998, Equation 10) use Equation 12, which is based on Q (the weighted sum of

squared errors in Equation 10), k, and a constant, c, such that:

τˆ 2 = Q −(ck −1) , (13)

where the constant, c, is defined in Equation 14:

k
k ∑ (wi )2
c = ∑ wi − i =1
, (14)
k
i =1 ∑ wi
i =1

If the estimate of between-studies variance, τˆ2 , yields a negative value then it is set to zero

(because the variance between studies cannot be negative). The estimate τˆ2 , is substituted in

Equation 11 to calculate the weight for a particular study, and this in turn is used in Equation 12 to

calculate the average correlation. This average correlation is then converted back to the r metric

using Equation 8 before being reported.

21
The final step is to estimate the precision of this population effect estimate using confidence

intervals. The confidence interval for a mean value is calculated using the standard error of that

mean. Therefore, to calculate the confidence interval for the population effect estimate, we need to

know the standard error of the mean effect size (Equation 15). It is the square root of the reciprocal

of the sum of the random-effects weights (see Hedges & Vevea, 1998, p. 493):

( )
SE zr* = k
1
. (15)
∑ wi*
i =1

The confidence interval around the population effect estimate is calculated in the usual way

by multiplying the standard error by the two-tailed critical value of the normal distribution (which is

1.96 for a 95% confidence interval). The upper and lower bounds (Equation 16) are calculated by

taking the average effect size and adding or subtracting its standard error multiplied by 1.96

( )
95% CI Upper = z r* + 1.96SE Z r* ,
(16)
95% CI Lower = z r* − 1.96SE (Z ) . *
r

These values are again transformed back to the r metric using Equation 8 before being reported.

Doing Meta-Analysis on a Computer

In reality, you will not do the meta-analysis by hand (although we believe that there is no

harm in understanding what is going on behind the scenes). There are some stand-alone packages

for conducting meta-analyses such as Comprehensive Meta-Analysis , which implements many

different meta-analysis methods, converts effect sizes, and creates plots of study effects. Hunter and

Schmidt (2004) provide specialist custom-written software for implementing their full method on

the CD-ROM of their book. There is also a program called Mix (Bax L, Yu L. M., Ikeda, Tsuruta,

& Moons, 2006), and the Cochrane Collaboration provides software called Review Manager for

conducting meta-analysis (The Cochrane Collaboration, 2008) Both of these packages have

excellent graphical facilities.

22
For those who want to conduct meta-analysis without the expense of buying specialist

software, meta-analysis can also be done using R (R Development Core Team, 2008), a freely

available package for conducting a staggering array of statistical procedures. R is based on the S

language and so has much in common with the commercially-available package S-Plus. Scripts for

running a variety of meta-analysis procedures on d are available in the ‘meta’ package that can be

installed into R (Schwarzer, 2005). Likewise, publication bias analysis can be run in R. The

implementation of some of these programs will be described in due course. In addition, Excel users

can use a plug-in called MetaEasy (Kontopantelis & Reeves, 2009).

PASW does not, at present, offer built in tools for doing meta-analysis but the methods

described in this paper can be conducted using custom-written syntax. To accompany this article we

have produced syntax files to conduct many of the meta-analytic techniques discussed in this paper

(although the Hunter-Schmidt version is in only its simplest form). Other PASW syntax files for r

and also d can be found on Lavesque (2001) and Wilson (2004). All of the data and syntax files

accompanying this paper can be downloaded from our webpage (Field & Gillett, 2009). The PASW

syntax files are:

1. Basic meta-analysis: the files Meta_Basic_r.sps, Meta_Basic_d.sps and Meta_Basic_D_h.sps:

can be used to perform a basic meta-analysis on effect sizes expressed as r, d, and the difference

between proportions (D or h) respectively in PASW. The output provides an estimate of the

average effect size of all studies, or any subset of studies, a test of homogeneity of effect size

that contributes to the assessment of the goodness of fit of the statistical model, elementary

indicators of, and tests for, the presence of publication bias, and parameters for both fixed-

effects and random-effects models.

2. Moderator variable analysis: the files Meta_Mod_r.sps, Meta_Mod_d.sps and

Meta_Mod_D_h.sps: can be used for analysing the influence of moderator variables on effect

23
sizes expressed as r, d, and the difference between proportions (D or h) respectively in PASW.

Each of these files is run using a shorter syntax file to launch these files (i.e. Meta_Mod_r.sps is

launched by using the syntax file Launch_Meta_Mod_r.sps) The programs use weighted

multiple regression to provide an evaluation of the impact of continuous moderator variables on

effect sizes, an evaluation of the impact of categorical moderator variables on effect sizes, tests

of homogeneity of effect sizes that contribute to the assessment of the goodness of fit of a

statistical model incorporating a given set of moderator variables, and estimates for both fixed-

effects and random-effects models.

3. Publication bias analysis: the files Pub_Bias_r.R, Pub_Bias_d.R and Pub_Bias_D_h.R: can be

used to produce funnel plots and a more sophisticated publication bias analysis on effect sizes

expressed as r, d, and the difference between proportions (D or h) respectively using the

software R. Each file computes an ordinary unadjusted estimate of effect size, four adjusted

estimates of effect size that indicate the potential impact of severe and moderate one- and two-

tailed bias, for both fixed-effects and random-effects models (see below). Note that these files

include Vevea and Woods’ (2005) scripts for R.

In this section, we will do the basic analysis that we described in the previous section using

the effect sizes from Cartwright-Hatton et al. (2004) expressed as r. Before we begin, you need to

create a folder in the ‘My Documents’ folder on your hard drive (usually denoted ‘C’) called ‘Meta-

Analysis’ (for the first author, the complete file path would, therefore, be ‘C:\Users\Dr. Andy

Field\My Documents\Meta-Analysis’vi. This folder is needed for some of our files to work.

In the PASW data editor, create two new variables, the first for the effect sizes, r, and the

second for the total sample size on which each effect size is based, n; it is also good practice to

create a variable in which you identify the study from which each effect size came. You can

download this data file (Cartwright-Hatton et al. (2004) data.sav) from the accompanying website.

24
Once the data are entered, simply open the syntax file and in the syntax window click on the ‘Run’

menu and then select ‘All’. The resulting output is in Figure 1. Note that the analysis calculated

both fixed- and random-effects statistics and for both methods. This is for convenience but given we

have made an a priori decision about which method to use, and whether to apply a fixed- and

random-effects analysis we would interpret only the corresponding part of the output. In this case,

we opted for a random-effects analysis. This output is fairly self-explanatory, for example, we can

see that for Hedges and Vevea’s method, that the Q statistic (Equation 9 above) is highly

significant, χ2 (9) = 41.27, p < .001. Likewise, the population effect size once returned to the r

metric and its 95% confidence interval are: .61 (CI.95 = .48 (lower), .72 (upper)). We can also see

that this population effect size is significant, z = 7.57, p < .001.

At the bottom of the output are the corresponding statistics from the Hunter-Schmidt method

including the population estimate, .55, the sample correlation variance from Equation 2, .036, the

sampling error variance from Equation 3, .009, the variance in population correlations from

Equation 4, .027, the upper and lower bounds of the credibility interval from Equation 5, .87 and

.23, and the chi-square test of homogeneity from Equation 6 and its associated significance, χ2 (9) =

41.72, p < .001. The output also contains important information to be used to estimate the effects of

publication bias but we will come back to this issue in due course.

Based on both homogeneity tests, we could say that there was considerable variation in effect

sizes overall. Also, based on the estimate of population effect size and its confidence interval we

could conclude that there was a strong effect of CBT for childhood and adolescent anxiety disorders

compared to waiting list controls. To get some ideas about how to write up a meta-analysis like this

see Brewin et al. (2007).

Insert Figure 1 Here

25
STEP 5: DO SOME MORE ADVANCED ANALYSIS

Moderator Analysis

Theory behind Moderator Analysis

The model for moderator effects is a mixed model (which we mentioned earlier): it assumes a

general linear model in which each z-transformed effect size can be predicted from the transformed

moderator effect (represented by β1):

zr = β 0 + Cβ1 + ei (17)

The within-study error variance, is represented by ei which will on average be zero with a

variance of 1/(ni–3). To calculate the moderator effect, β1, a generalised least squared (GLS)

estimate is calculated. For the purposes of this tutorial it is not necessary to know the mathematics

behind the process (if you are interested then read Field, 2003b; Overton, 1998). The main thing to

understand is that the moderator effect is coded using contrast weights that relate to the moderator

effect (like contrast weights in ANOVA). In the case of a moderator effect with two levels (e.g.

whether the CBT used was group therapy or individual therapy) we could give one level codes of –

1, and the other level codes of 1 (you should use 0.5 and -0.5 if you want the resulting beta to

represent the actual difference between the effect of group and individual CBT).As such, when we

run a moderator analysis using PASW we have to define contrast codes that indicate which groups

are to be compared.

A Cautionary Tale: The Risk of Confounded Inference Caused by Unequal Cell Sizes

For theoretical and practical reasons, the primary studies in a meta-analysis tend to focus on some

combinations of levels of variables more than on others. For example, white people aged around 20

are more commonly used as participants in primary studies than black people aged around 50. The
26
occurrence of unequal cell sizes can introduce spurious correlations between otherwise independent

variables. Consider a meta-analysis of 12 primary studies that investigated the difference between

active and passive movement in spatial learning using the effect size measure d. Two moderator

variables were identified as potentially useful for explaining differences among studies: (a) whether

a reward was offered for good performance, and (b) whether the spatial environment was real or

virtual. However, only 8 out of the 12 studies provided information about the levels of the

moderator variables employed in their particular cases. Table 3 presents the original dataset with

full information about all 12 studies that was not available to the meta-analyst. The design of the

original dataset is balanced, because cell sizes are equal. Table 3 also displays a reduced dataset of

8 studies, which has an unbalanced design because cell sizes are unequal.

Insert Table 3 Here

In the original balanced dataset, the mean effect of a real environment is greater than that of a

virtual one (.6 versus .2). Second, there is no difference between the mean effect when reward is

present and when it is absent (.4 versus .4). Third, the correlation between the reward and

environment factors is r = .0, as would be expected in a balanced design.

However, the meta-analyst must work with the reduced unbalanced dataset because key

information about levels of moderator variables is missing. In the reduced dataset, the correlation

between the reward and environment factors equals r = -.5. Crucially, the non-zero correlation

allows variance from the environment variable to be recruited by the reward variable. In other

words, the non-zero correlation induces a spurious difference between the reward level mean effects

(.3 versus .5). The artefactual difference is generated because high-scoring real environments are

under-represented when reward is present (.60 versus .18, .20, .22), while low-scoring virtual

environments are under-represented when reward is absent (.20 versus .58, .60, .62).

27
Although the pool of potential moderator variables is often large for any given meta-analysis,

not all primary studies provide information about the levels of such variables. Hence, in practice,

only a few moderator variables may be suitable for analysis. In our example, suppose that too few

studies provided information about the environment (real or virtual) for it to be suitable for use as a

moderator variable. In that event, the spurious difference between the reward levels would remain,

and would be liable to be misinterpreted as a genuine phenomenon. In our example, we have the

benefit of knowing the full data set and therefore being able to see that the missing data were not

random. However, a meta-analyst just has the reduced data set and has no way of knowing whether

missing data are random or not. As such, missing data does not invalidate a meta-analysis per se,

and does not mean that moderator analysis should not be done when data are missing in studies for

certain levels of the moderator variable. However, it does mean that when studies at certain levels

of the moderator variable are under- or un-represented your interpretation should be restrained and

the possibility of bias made evident to the reader.

Moderator Analysis Using PASW

The macros we have supplied allow both continuous and categorical predictors (moderators)

to be entered into the regression model that a researcher wishes to test. To spare the researcher the

complexities of effect coding, the levels of a categorical predictor are coded using integers 1, 2, 3,

... to denote membership of category levels 1, 2, 3, ... of the predictor. The macros yield multiple

regression output for both fixed-effects and random-effects meta-analytic models.

The Cartwright-Hatton et al. data set is too small to do a moderator analysis, so we will turn

to our second example of Tanenbaum and Leaper (2002). Tanenbaum and Leaper were interested in

whether the effect of parent’s gender schemas on their children’s gender-related cognitions was

moderated by the gender of the experimenter. A PASW file of their data can be downloaded from

the website that accompanies this article (in this case Tanenbaum & Leaper (2002).sav). Load this

28
data file into PASW and you will see that the moderator variable (gender of the experimenter) is

represented by a column labelled ‘catmod’ in which male researchers are coded with the number 2

and females with 1. In this example we have just one column representing our sole categorical

moderator variable, but we could add in other columns for additional moderator variables.

The main PASW syntax file (in this case Meta_Mod_r,sps) is run using a much simpler

launch file. From PASW, open the syntax file Launch_Meta_Mod_r.sps. This file should appear in

a syntax window and comprises three lines:

cd "%HOMEDRIVE%%HOMEPATH%\My Documents\Meta-Analysis".

insert file="Meta_Mod_r.sps".

Moderator_r r=r n=n conmods=( ) catmods=(catmod).

The first line simply tells PASW where to find your meta-analysis files (remember that in Vista this

line will need to be edited to say ‘Documents’ rather than ‘My Documents’). The second line

references the main syntax file for the moderator analysis. If this file is not in the ‘...\My

Documents\Meta-Analysis’ directory then PASW will return a ‘file not found’ error message. The

final line is the most important because it contains parameters that need to be edited. The four

parameters need to be set to the names of the corresponding variables in the active data file:

• r = the name of the variable containing the effect sizes. In the Tanenbaum data file, this

variable is named ‘r’, so we would edit this to read r=r, if you had labelled this column

‘correlation’ then you would edit the text to say r=correlation etc.

• n = the name of the sample size variable. In the Tanenbaum data file, this variable is named

‘n’, so we would edit this to read n=n, if you had labelled this column ‘sample_size’ in

PASW then you would edit the text to say n=sample_size etc.

29
• conmods = names of variables in the data file that represent continuous moderator variables,

e.g., conmods=(arousal accuracy). We have no continuous moderators in this example so

we leave the inside of the brackets blank, e.g. conmods = ().

• Catmods = names of categorical moderator variables, e.g., catmods=(gender religion). In

the Tanenbaum data file, we have one categorical predictor which we have labelled

‘catmod’ in the data file, hence, we edit the text to say catmods=(catmod).

On the top menu bar of this syntax file, click ‘Run’ and then ‘All’. (The format of the launch file for

d as an effect size is much the same except that there are two variables for sample size representing

the two groups, n1 and n2, which need to be set to the corresponding variable names in PASW, e.g.

n1=n_group1 n2 = n_group2.)

Insert Figure 2 Here

Figure 2 shows the resulting output. Tanenbaum and Leaper used a fixed effects model and

the first part of the output replicates what they report (with the 95% confidence interval reported in

parenthesis throughout): there was an overall small to medium effect, r = .16, (.14 .18) and the

gender of the researcher significantly moderated this effect, χ2(1) = 23.72, p < .001. The random-

effects model tells a different story: there was still an overall small to medium effect, r = .18, (.13

.22); however, the gender of the researcher did not significantly moderated this effect, χ2(1) = 1.18,

p = .28. Given the heterogeneity in the data this random effects analysis is probably the one that

should have been done.

Estimating Publication Bias

Earlier on we mentioned that publication bias can exert a substantial influence on meta-

analytic reviews. Various techniques have been developed to estimate the effect of this bias, and to

correct for it. We will focus on only a selection of these methods. The earliest and most commonly-

30
reported estimate of publication bias is Rosenthal’s (1979) fail-safe N. This was an elegant and

easily understood method for estimating the number of unpublished studies that would need to exist

to turn a significant population effect-size estimate into a non-significant one. To compute

Rosenthal’s fail-safe N, each effect size is first converted into a z-score and the sum of these scores

is used in the following equation:

2
⎛ k ⎞
⎜ ∑ zi ⎟ (26)
N fs = ⎝ i =1 ⎠ − k,
2.706
In which, k is the number of studies in the meta-analysis and 2.706 is intrinsic to the equation.

For Cartwright-Hatton et al.’s data, we get 915 from our PASW basic analysis syntax (see Figure

1). In other words, there would need to be 915 unpublished studies not included in the meta-analysis

to make the population effect size non-significant.

However, the fail-safe N has been criticised because of its dependence on significance testing.

As significance testing the estimate of the population effect size is not recommended, other methods

have been devised. For example, when using d as an effect size measure, Orwin (1983) suggests a

variation on the fail-safe N that estimates the number of unpublished studies required to bring the

average effect size down to a predetermined value. This predetermined value could be 0 (no effect

at all), but could also be some other value that was meaningful within the specific research context:

for example, how many unpublished studies would there need to be to reduce the population effect

size estimate from 0.67 to a small, by Cohen’s (1988) criterion, effect of 0.2. However, any fail-safe

N method addresses the wrong question: it is usually more interesting to know the bias in the data

one has and to correct for it than to know how many studies would be needed to reverse a

conclusion (see Vevea & Woods, 2005).

A simple and effective graphical technique for exploring potential publication bias is the

funnel plot (Light & Pillerner, 1984). A funnel plot displays effect sizes plotted against the sample

31
size, standard error, conditional variance or some other measure of the precision of the estimate. An

unbiased sample would ideally show a cloud of data points that is symmetric around the population

effect size and has the shape of a funnel. This funnel shape reflects the greater variability in effect

sizes from studies with small sample sizes/less precision. A sample with publication bias will lack

symmetry because studies based on small samples that showed small effects will be less likely to be

published than studies based on the same sized samples but that showed larger effects (Macaskill,

Walter, & Irwig, 2001). Figure 3 shows an example of a funnel plot showing approximate

symmetry around the population effect size estimate. When you run the PASW syntax for the basic

meta-analysis funnel plots are produced; however, the y-axis is scaled the opposite way to normal

conventions. For this reason, we advise that you use these plots only as a quick way to look for

publication bias, and use our publication bias scripts in R to produce funnel plots for presentation

purposes (see below). Funnel plots should really be used only as a first step before further analysis

because there are other factors that can cause asymmetry other than publication bias. Some

examples are true heterogeneity of effect sizes (in intervention studies this can happen because the

intervention is more intensely delivered in smaller more personalised studies), English language

bias (studies with smaller effects are often found in non-English language journals and get

overlooked in the literature search) and data irregularities including fraud and poor study design

(Egger, Smith, Schneider, & Minder, 1997).

Insert Figure 3 Here

Attempts have been made to quantify the relationship between effect size and its associated

variance. An easy method to understand and implement is Begg and Mazumdar’s (1994) rank

correlation test for publication bias. This test is Kendall’s tau applied between a standardized form

of the effect size and its associated variance. The resulting statistic (and its significance) quantifies

the association between the effect size and the sample size: publication bias is shown by a

32
strong/significant correlation. This test has good power for large meta-analyses but can lack power

for smaller meta-analyses, for which a non-significant correlation should not be seen as evidence of

no publication bias (Begg & Mazumdar, 1994). This statistic is produced by the basic meta-analysis

syntax file that we ran earlier. In your PASW output you should find that Begg and Mazumdar’s

rank correlation for the Cartwright-Hatton et al. data is highly significant, τ (N = 10) = -.51, p < .05,

indicating significant publication bias. Similar techniques are available based on testing the slope of

a regression line fitted to the funnel plot (Macaskill, et al., 2001).

Funnel plots and the associated measures of the relationship between effect sizes and their

associated variances offer no means to correct for any bias detected. Two main methods have been

devised for making such corrections. Trim and fill (Duval & Tweedie, 2000) is a method in which a

biased funnel plot is truncated and the number (k) of missing studies from the truncated part is

estimated. Next k artificial studies are added to the negative side of the funnel plot (and therefore

have small effect sizes) so that in effect the study now contains k studies with effect sizes as small

in magnitude as the k largest effect sizes. A new estimate of the population effect size is then

calculated including these artificially small effect sizes. This is a useful technique but, as Vevea

and Woods (2005) point out, it relies on the strict assumption that all of the ‘missing’ studies are

those with the smallest effect sizes; as such it can lead to over-correction. More sophisticated

correction methods have been devised based on weight function models of publication bias. These

methods use weights to model the process through which the likelihood of a study being published

varies (usually based on a criterion such as the significance of a study). The methods are quite

technical and have typically been effectively only when meta-analyses contain relatively large

numbers of studies (k > 100). Vevea and Woods’ (2005) recent method, however, can be applied to

smaller meta-analyses and has relatively more flexibility of the meta-analyst to specify the likely

conditions of publication bias in their particular research scenario. Vevea and Woods specify four

typical weight functions which they label ‘moderate one-tailed selection’, ‘severe one-tailed

33
selection’ ‘moderate two-tailed selection’, and ‘severe two-tailed selection’; however, they

recommend adapting the weight functions based on what the funnel plot reveals (see Vevea &

Woods, 2005).

Estimating and Correcting for Publication Bias Using a Computer

We have already calculated the fail safe N, Begg and Mazumdar’s rank correlation and some

crude funnel plots in our basic analysis. However, for the more sophisticated meta-analyst we

recommend producing funnel plots with confidence intervals superimposed, and correcting

population effect size estimates using Vevea and Woods’ methods (above). Vevea and Woods

(2005) have produced code for implementing their sensitivity analysis in S-PLUS, and this code

will run in R alsovii. We have produced script files for R that feed data saved from the initial PASW

meta-analysis into Vevea and Woods’ S-plus code, and use the package ‘meta’ to produce funnel

plots too (you can also use Mix or Review Manager to produce funnel plots).

To do this part of the analysis you will need to download R if you do not already have it and

install it. To get a funnel plot, you will also need to install the ‘meta’ package that we mentioned

earlier. To do this, in R, go to the ‘Packages’ menu and select ‘Install Packages …’. You will be

asked to select a ‘CRAN Mirror’ and you should chose the one in the nearest geographical location

to you. Having done this, select ‘meta’ from the list (if it is not listed try changing to a different

CRAN mirror). We need to then initialise the ‘meta’ package by going to the ‘Packages’ menu, then

selecting ‘Load Package …’ and selecting ‘meta’. You will now be able to use the commands in

this package.

We can now run a publication bias analysis on the Cartwright-Hatton et al. data. To do this,

go to the ‘File’ menu in R and select ‘Open script …’. Find your meta-analysis directory (remember

that you created this folder earlier), and select the file Pub_Bias_r.R (remember that if d is your

effect size then you must select the file Pub_Bias_d.R). This script will open in a new window

34
within R. In this new window, simply click with the right mouse button and select ‘Select all’ and

then click with the right mouse button again and select ‘Run line or selection’ (this process scan be

done more quickly by using the keyboard shortcut of Ctrl + A followed by Ctrl + R).

The resulting funnel plot (Figure 4) shows the effect size plotted against the standard error for

each study, and a reference line representing the 95% confidence interval. If the data were unbiased,

this plot would be funnel shaped around the dotted line and symmetrical. The resulting plot is

clearly not symmetrical (and shows 1 effect size that appears to be very discrepant from the rest), or

funnel shaped and shows clear evidence of bias.

Insert Figures 4 and 5 Here

Figure 5 shows the output for Vevea and Woods’ (2005) sensitivity analysis for both fixed-

and random-effects models. We will interpret only the random-effects model. The unadjusted

population effect size estimate is first given (with its variance component) and also the value when

this estimate is converted back into r. These values correspond approximately to the values that we

have already calculated from our PASW analysis. However, the adjusted parameter estimates

provide the values of the population effect size estimate corrected for four different selection bias

models outlined by Vevea and Woods (2005). The four different selection bias models represent a

range of situations differing in the extent and form of the selection bias. As such, they are a

reasonable starting point in the absence of any better information about the selection bias model

most appropriate for your data (based on, for example, the funnel plot). However, Vevea and

Woods (2005) recommend applying a greater variety of selection models, or applying selection

models specifically tailored to the data within the particular meta-analysisviii. The important thing in

terms of interpretation is how the population effect size estimate changes under the different

selection bias models. For Cartwright-Hatton et al.’s data, the unadjusted population effect size (as

r) was .61 as calculated above using the PASW syntax. Under both moderate one- and two-tailed

35
selection bias, the population effect size estimate is unchanged to 2 decimal places (see the column

labelled r in Figure 5 for the random-effects model). Even applying a severe selection bias model,

the population effect size drops only to .60. As such, we can be confident that the strong effect of

CBT for childhood and adolescent anxiety disorders is not compromised even when applying

corrections for severe selection bias.

STEP 6: WRITE IT UP

Rosenthal (1995) wrote an excellent article on best practice in reporting meta-analytic

reviews. Based largely on his advice we recommend the following: First, you should always be

clear about your search and inclusion criteria, which effect size measure you are using (and any

issues you had in computing these), which meta-analytic technique you are applying to the data and

why (especially whether you are applying a fixed- or random-effects method). Rosenthal

recommends stem and leaf plots of the computed effect sizes because this is a concise way to

summarise the effect sizes that have been included in your analyses. If you have carried out a

moderator analysis, then you might also provide stem and leaf plots for sub-groups of the analysis

(e.g., see Brewin, et al., 2007). Other plots that should be considered are forest plots and a bean

plot. You should always report statistics relating to the variability of effect sizes (these should

include the actual estimate of variability as well as statistical tests of variability), and obviously the

estimate of the population effect size and its associated confidence interval (or credibility interval).

You should, as a matter of habit, also report information on publication bias, and preferably a

variety of analyses (for example, the fail-safe N, a funnel plot, Begg & Mazumdar’s rank

correlation, and Vevea and Woods’ sensitivity analysis).

36
SUMMARY

This article has tried to offer a comprehensive overview of how to conduct a meta-analytic

review including new files for an easy implementation of the basic analysis in PASW and R. To

sum up, the analysis begins by collecting articles addressing the research question that you are

trying to address. This will include emailing people in the field for unpublished studies, electronic

searches, searches of conference abstracts and so on. Once the articles are selected, inclusion

criteria need to be devised that reflect the concerns pertinent to the particular research question

(which might include the type of control group used, clarity of diagnosis, the measures used or other

factors that ensure a minimum level of research quality). The included articles are then scrutinised

for statistical details from which effect sizes can be calculated; the same effect size metric should be

used for all studies (see the aforementioned electronic resources for computing these effect sizes).

Next, decide on the type of analysis appropriate for your particular situation (fixed vs. random

effects, Hedges’ method or Hunter and Schmidt’s etc.) and then to apply this method (possibly

using the PASW resources produced to supplement this article). An important part of the analysis is

to describe the effect of publication bias descriptively (e.g. funnel plots, the rank correlation of the

fail-safe N) and to re-estimate the population effect under various publication bias models using

Vevea and Woods (2005) model. Finally, the results need to be written up such that the reader has

clear information about the distribution of effect sizes (e.g., a stem and leaf plot), the effect size

variability, the estimate of the population effect and its 95% confidence interval, the extent of

publication bias (e.g. funnel plots, the rank correlation of the fail-safe N) and the influence of

publication bias (Vevea and Woods’ adjusted estimates).

37
References

Baguley, T. (2009). Standardized or simple effect size: What should be reported? British Journal of
Psychology, 100, 603-617.
Barrick, M. R., & Mount, M. K. (1991). The big 5 personality dimensions and job-performance - a
meta-analysis. Personnel Psychology, 44(1), 1-26.
Bax L, Yu L. M., Ikeda, N., Tsuruta, H., & Moons, K. G. M. (2006). Development and validation of
MIX: Comprehensive free software for meta-analysis of causal research data. BMC Medical
Research Methodology, 6(50).
Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a bank correlation test for
publication bias. Biometrics, 50(4), 1088-1101.
Belanger, H. G., & Vanderploeg, R. D. (2005). The neuropsychological impact of sports-related
concussion: A meta-analysis. Journal of the International Neuropsychological Society,
11(4), 345-357.
Bond, C. F., Wiitala, W. L., & Richard, F. D. (2003). Meta-analysis of raw mean differences.
Psychological Methods, 8(4), 406-418.
Brewin, C. R., Kleiner, J. S., Vasterling, J. J., & Field, A. P. (2007). Memory for emotionally
neutral information in posttraumatic stress disorder: A meta-analytic investigation. Journal
of Abnormal Psychology, 116(3), 448-463.
Cartwright-Hatton, S., Roberts, C., Chitsabesan, P., Fothergill, C., & Harrington, R. (2004).
Systematic review of the efficacy of cognitive behaviour therapies for childhood and
adolescent anxiety disorders. British Journal of Clinical Psychology, 43, 421-436.
Clark-Carter, D. (2003). Effect size: The missing piece in the jigsaw. The Psychologist, 16(12),
636-638.
Cohen, J. (1988). Statistical power analysis for the behavioural sciences (2nd edition). New York:
Academic Press.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159.
Coursol, A., & Wagner, E. E. (1986). Effect of positive findings on submission and acceptance
rates: A note on meta-analysis bias. Professional Psychology, 17, 136-137.
DeCoster, J. (1998). Microsoft Excel spreadsheets: Meta-analysis. Retrieved 1st October 2006,
from http://www.stat-help.com/spreadsheets.html
Dickersin, K., Min, Y.-I., & Meinert, C. L. (1992). Factors influencing publication of research
results: follow-up of applications submitted to two institutional review boards. Journal of
the American Medical Association, 267, 374–378.
Douglass, A. B., & Steblay, N. (2006). Memory distortion in eyewitnesses: A meta-analysis of the
post-identification feedback effect. Applied Cognitive Psychology, 20(7), 859-869.
Duval, S. J., & Tweedie, R. L. (2000). A nonparametric "trim and fill" method of accounting for
publication bias in meta-analysis. Journal of the American Statistical Association, 95(449),
89-98.
Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a
simple, graphical test. British Medical Journal, 315(7109), 629-634.

38
Else-Quest, N. M., Hyde, J. S., Goldsmith, H. H., & Van Hulle, C. A. (2006). Gender differences in
temperament: A meta-analysis. Psychological Bulletin, 132(1), 33-72.
Field, A. P. (2001). Meta-analysis of correlation coefficients: A Monte Carlo comparison of fixed-
and random-effects methods. Psychological Methods, 6(2), 161-180.
Field, A. P. (2003a). Can meta-analysis be trusted? Psychologist, 16(12), 642-645.
Field, A. P. (2003b). The problems in using Fixed-effects models of meta-analysis on real-world
data. Understanding Statistics, 2, 77-96.
Field, A. P. (2005a). Discovering statistics using SPSS (2nd Edition). London: Sage.
Field, A. P. (2005b). Is the meta-analysis of correlation coefficients accurate when population
correlations vary? Psychological Methods, 10(4), 444-467.
Field, A. P. (2005c). Meta-analysis. In J. Miles & P. Gilbert (Eds.), A handbook of research
methods in clinical and health psychology (pp. 295-308). Oxford: Oxford University Press.
Field, A. P., & Gillett, R. (2009). How to do a meta-analysis. Retrieved 5th July, 2009, from
http://www.statisticshell.com/How_To_Do_Meta-Analysis.html
Fisher, R. A. (1921). On the probable error of a coefficient of correlation deduced from a small
sample. Metron, 1, 3-32.
Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver & Boyd.
Fleiss, J. L. (1973). Statistical methods for rates and proportions. New York: John Wiley & Sons.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis
of variance. Journal of the American Statistical Association, 32, 675-701.
Gillett, R. (2003). The metric comparability of meta-analytic effect-size estimators from factorial
designs. Psychological Methods, 8, 419-433.
Greenwald, A. G. (1975). Consequences of prejudice against null hypothesis. Psychological
Bulletin, 82(1), 1-19.
Hafdahl, A. R. (2009). Improved Fisher-z estimators for univariate random-effects meta-analysis of
correlations. British Journal of Mathematical and Statistical Psychology, 62(2), 233-261.
Hafdahl, A. R. (in press). Random-effects meta-analysis of correlations: Evaluation of mean
estimators. British Journal of Mathematical and Statistical Psychology.
Hafdahl, A. R., & Williams, M. A. (2009). Meta-analysis of correlations revisited: Attempted
replication and extension of Field's (2001) simulation studies. Psychological Methods,
14(1), 24-42.
Hall, S. M., & Brannick, M. T. (2002). Comparison of two random-effects methods of meta-
analysis. Journal of Applied Psychology, 87(2), 377-389.
Hedges, L. V. (1984). Estimation of effect size under non-random sampling: the effects of
censoring studies yielding statistically insignificant mean differences. Journal of
Educational Statistics, 9, 61-85.
Hedges, L. V. (1992). Meta-analysis. Journal of Educational Statistics, 17(4), 279-296.
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic
Press.

39
Hedges, L. V., & Pigott, T. D. (2001). The power of statistical tests in meta-analysis. Psychological
Methods, 6(3), 203-217.
Hedges, L. V., & Vevea, J. L. (1998). Fixed- and random-effects models in meta-analysis.
Psychological Methods, 3(4), 486-504.
Hox, J. J. (2002). Multilevel Analysis, Techniques and Applications. Mahwah, NJ: Lawrence
Erlbaum Associates.
Hunter, J. E., & Schmidt, F. L. (2000). Fixed effects vs. random effects meta-analysis models:
Implications for cumulative research knowledge. International Journal of Selection and
Assessment, 8(4), 275-292.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in
research findings (2nd ed.). Newbury Park, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Le, H. (2006). Implications of direct and indirect range restriction
for meta-analysis methods and findings. Journal of Applied Psychology, 91(3), 594-612.
Kelley, K., Bond, R., & Abraham, C. (2001). Effective approaches to persuading pregnant women
to quit smoking: A meta-analysis of intervention evaluation studies. British Journal of
Health Psychology, 6, 207-228.
Kontopantelis, E., & Reeves, D. (2009). MetaEasy: A meta-analysis add-in for Microsoft Excel.
Journal of Statistical Software, 30(7).
Lavesque, R. (2001). Syntax: Meta-Analysis Retrieved 1st October 2006, from
http://www.spsstools.net/
Lenth, R. V. (2001). Some practical guidelines for effective sample size determination. American
Statistician, 55(3), 187-193.
Light, R. J., & Pillerner, D. B. (1984). Summing up: The science of reviewing research. Cambridge,
MA:: Harvard University Press.
Macaskill, P., Walter, S. D., & Irwig, L. (2001). A comparison of methods to detect publication bias
in meta-analysis. Statistics in Medicine, 20(4), 641-654.
McGrath, R. E., & Meyer, G. J. (2006). When effect sizes disagree: The case of r and d.
Psychological Methods, 11(4), 386-401.
McLeod, B. D., & Weisz, J. R. (2004). Using dissertations to examine potential bias in child and
adolescent clinical trials. Journal of Consulting and Clinical Psychology, 72(2), 235-251.
Milligan, K., Astington, J. W., & Dack, L. A. (2007). Language and theory of mind: Meta-analysis
of the relation between language ability and false-belief understanding. Child Development,
78(2), 622-646.
National Research Council. (1992). Combining information: Statistical issues and opportunities for
research. Washington, D.C.: National Academy Press.
Orwin, R. G. (1983). A fail-safe N for effect size in meta-analysis. Journal of Educational
Statistics(8), 157-159.
Osburn, H. G., & Callender, J. (1992). A note on the sampling variance of the mean uncorrected
correlation in meta-analysis and validity generalization. Journal of Applied Psychology,
77(2), 115-122.

40
Oswald, F. L., & McCloy, R. A. (2003). Meta-analysis and the art of the average. In K. R. Murphy
(Ed.), Validity generalization: A critical review (pp. 311-338). Mahwah, NJ: Lawrence
Erlbaum.
Overton, R. C. (1998). A comparison of fixed-effects and mixed (random-effects) models for meta-
analysis tests of moderator variable effects. Psychological Methods, 3(3), 354-379.
R Development Core Team. (2008). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing.
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin,
86(3), 638-641.
Rosenthal, R. (1991). Meta-analytic procedures for social research (2nd ed.). Newbury Park, CA:
Sage.
Rosenthal, R. (1995). Writing Meta-analytic Reviews. Psychological Bulletin, 118(2), 183-192.
Rosenthal, R., & DiMatteo, M. R. (2001). Meta-analysis: Recent developments in quantitative
methods for literature reviews. Annual Review of Psychology, 52, 59-82.
Sanchez-Meca, J., Marin-Martinez, F., & Chacon-Moscoso, S. (2003). Effect-size indices for
dichotomized outcomes in meta-analysis. Psychological Methods, 8(4), 448-467.
Schmidt, F. L., Oh, I. S., & Hayes, T. L. (2009). Fixed-versus random-effects models in meta-
analysis: Model properties and an empirical comparison of differences in results. British
Journal of Mathematical & Statistical Psychology, 62, 97-128.
Schulze, R. (2004). Meta-analysis: a comparison of approaches. Cambridge, MA: Hogrefe &
Huber.
Schwarzer, G. (2005). Meta Retrieved 1st October 2006, from http://www.stats.bris.ac.uk/R/
Shadish, W. R. (1992). Do family and marital psychotherapies change what people do? A meta-
analysis of behavioural outcomes. In T. D. Cook, H. Cooper, D. S. Cordray, H. Hartmann,
L. V. Hedges, R. J. Light, T. A. Louis & F. Mosteller (Eds.), Meta-Analysis for
Explanation: A Casebook (pp. 129-208). New York: Sage.
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from
tests of significance - or vice versa. Journal of the American Statistical Association,
54(285), 30-34.
Takkouche, B., Cadarso-Suarez, C., & Spiegelman, D. (1999). Evaluation of old and new tests of
heterogeneity in epidemiologic meta-analysis. American Journal of Epidemiology, 150(2),
206-215.
Tenenbaum, H. R., & Leaper, C. (2002). Are parents' gender schemas related to their children's
gender-related cognitions? A meta-analysis. Developmental Psychology, 38(4), 615-630.
The Cochrane Collaboration. (2008). Review Manager (RevMan) for Windows: Version 5.0.
Copenhagen: The Nordic Cochrane Centre.
Vevea, J. L., & Woods, C. M. (2005). Publication bias in research synthesis: Sensitivity analysis
using a priori weight functions. Psychological Methods, 10(4), 428-443.
Wilson, D. B. (2004). A spreadsheet for calculating standardized mean difference type effect sizes.
Retrieved 1st October 2006, from http://mason.gmu.edu/~dwilsonb/ma.html

41
Acknowledgements

Thanks to Jack Vevea for amending his S-PLUS code from Vevea & Woods (2005) so that it

would run using R, and for responding to numerous emails from me about his weight function

model of publication bias.

42
Author note

Correspondence concerning this article should be addressed to either Andy P. Field,

Department of Psychology, University of Sussex, Falmer, Brighton, East Sussex, BN1 9QH,

[email protected]; or Raphael Gillett, School of Psychology, Henry Wellcome Building,

University of Leicester, Lancaster Road, Leicester LE1 9HN, [email protected]

43
Table 1: Stem and leaf plot of all effect sizes (rs)

Stem Leaf
.0
.1 8
.2
.3
.4
.5 0, 5, 8
.6 0, 2, 5
.7 1, 2
.8 5
.9

44
Table 2: Calculating the Hunter-Schmidt Estimate

Study N r N×r

Barrett (1998) 50 0.55 27.30

Barrett et al. (1996) 76 0.50 38.12

Dadds et al. (1997) 93 0.18 16.49

Flannery-Schroeder & Kendall (2000) 43 0.85 36.45

Hayward et al. (2000) 33 0.62 20.43

Kendall (1994) 45 0.71 31.85

Kendall et al. (1997) 70 0.58 40.90

Shortt et al. (2001) 65 0.65 42.03

Silverman et al. (1999) 41 0.60 24.69

Spence et al. (2000) 47 0.72 33.80

Total 563 312.06

45

View publication stats

You might also like