0% found this document useful (0 votes)
68 views22 pages

1 s2.0 S0749596X20300061 Main

Uploaded by

David Lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views22 pages

1 s2.0 S0749596X20300061 Main

Uploaded by

David Lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Journal of Memory and Language 112 (2020) 104092

Contents lists available at ScienceDirect

Journal of Memory and Language


journal homepage: www.elsevier.com/locate/jml

Best practice guidance for linear mixed-effects models in psychological T


science
Lotte Meteyarda, , Robert A.I. Daviesb

a
School of Psychology & Clinical Language Sciences, University of Reading, Berkshire RG6 6AL, UK
b
Department of Psychology, Lancaster University, Lancaster LA1 4YF, UK

ARTICLE INFO ABSTRACT

Keywords: The use of Linear Mixed-effects Models (LMMs) is set to dominate statistical analyses in psychological science
Linear mixed effects models and may become the default approach to analyzing quantitative data. The rapid growth in adoption of LMMs has
Hierarchical models been matched by a proliferation of differences in practice. Unless this diversity is recognized, and checked, the
Multilevel models field shall reap enormous difficulties in the future when attempts are made to consolidate or synthesize research
findings. Here we examine this diversity using two methods – a survey of researchers (n = 163) and a quasi-
systematic review of papers using LMMs (n = 400). The survey reveals substantive concerns among psychol-
ogists using or planning to use LMMs and an absence of agreed standards. The review of papers complements the
survey, showing variation in how the models are built, how effects are evaluated and, most worryingly, how
models are reported. Using these data as our departure point, we present a set of best practice guidance, focusing
on the reporting of LMMs. It is the authors’ intention that the paper supports a step-change in the reporting of
LMMs across the psychological sciences, preventing a trajectory in which findings reported today cannot be
transparently understood and used tomorrow.

Introduction or by the stimuli presented (Baayen, Davidson, & Bates, 2008). We


expect that the responses made by a participant to some stimuli will be
Linear Mixed-effects Models (LMMs) have become increasingly correlated, or that responses from children in the same class or school
popular as a data analysis method in the psychological sciences. They or region will be correlated, or that responses to the same stimulus item
are also known as hierarchical or multilevel or random effects models across participants will be correlated. The hierarchical structure in the
(Snijders & Bosker, 2011). LMMs are warranted when data are collected data (the ways in which data can be grouped) is associated with a
according to a multi-stage sampling or repeated measures design. That hierarchical structure in the error variance. LMMs allow this structure
is, when there are likely to be correlations across the conditions of an to be explicitly modelled.
experiment because the conditions include the same participants or We review current practice for LMMs in the psychological sciences.
participants who have some association with each other. Multi-stage To begin, we present an example of a mixed-effects analysis (Section
sampling can arise naturally when collecting data about the behavior or 1.1), with the aim of clearly illustrating how random effects relate to
attributes of participants recruited, e.g., as students from a sample of fixed effects. Researchers who are comfortable in their conceptual un-
classes in a sample of schools, or as patients from a sample of clinics in a derstanding of LMMs may wish to skip this part. Following the example,
sample of regions. Repeated measures occur when participants experi- we present data from a survey of researchers (Section 2) and a review of
ence all or more than one of the manipulated experimental conditions, reporting practices in papers published between 2013 and 2016
or when all participants are presented with all stimuli. Such in- (Section 3). Our observations reveal significant concerns in the com-
vestigations are common in psychology. These designs yield data-sets munity over the implementation of LMMs, and a worrying range of
that have a multilevel or hierarchical structure. Participant-level ob- reporting practices in published papers (Section 4). Using the available
servations, e.g., an individual’s measured skill level or score, can be literature, we then present best practice guidance (Section 4.1) with a
grouped within the classes or schools, clinics or regions from which the bullet-point summary (Section 5). To preempt two key conclusions,
participants are recruited. Trial-level observations, e.g., the latency of researchers should be reassured that there is no single correct way to
response to a stimulus word, can be grouped by the participants tested implement an LMM, and that the choices they make during analysis will


Corresponding author.
E-mail address: [email protected] (L. Meteyard).

https://doi.org/10.1016/j.jml.2020.104092
Received 18 March 2019; Received in revised form 14 January 2020; Accepted 15 January 2020
Available online 29 January 2020
0749-596X/ © 2020 Elsevier Inc. All rights reserved.
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Fig. 1. Illustrations for Participant Intercepts for Naming Accuracy. a shows each participant’s mean accuracy across all the naming trials they completed, with the
group mean as the rightmost column. b shows the participant’s accuracy scaled as standard deviations from the group mean (centered at zero) – the Random
Intercepts by Participant.

comprise one path, however justified, amongst multiple alternatives. we have three fixed effects. Cue type is a factor with four levels (the
This being so, to ensure the future utility of our findings, the commu- different cues). Length and frequency are two continuous predictors
nity must adopt a standard format for reporting complete model outputs that have a value associated to each target picture name. The random
(see the example tables in Appendix 5). All appendices and data are effects are associated with the unexplained2 differences between the
available at osf.io/bfq39. participants (10 participants, each of whom completed 700 trials) and
the items (175 items, each associated with 40 observed responses).
Participants and items were sampled from respective person or picture
An example populations: each participant and each item can be seen to be a sampling
unit from a wider population. Intuitively, responses by a participant will
Our example is introductory but it is not intended as a step-by-step tend to be correlated because one person may be more or less accurate
tutorial. We provide an explanation of mixed-effects models without than another, on average. Responses to each item will tend to be cor-
recourse to algebra or formulae. In particular, we discuss random in- related because one picture will be more or less difficult than another.
tercepts and random slopes in the context of this example, and how For simplicity, we are going to illustrate random effects for participants
these can be fit alone (intercepts or slopes only) or together (intercepts only. Graphs are generated from mixed-effects models with all fixed
and slopes) for a given fixed effect predictor. In our experience as re- effects predictors but only the random effect under consideration (see
searchers and teachers, this is the biggest conceptual hurdle to under- Figs. 1–4). This is so we can consider each case in isolation.
standing and working with LMMs. The simplest possible random effect to include in the mixed-effects
A subset of data from Meteyard and Bose (2018) has been used, and model would be the random effect of participant on intercepts, in an
scripts and data are available from osf.io/bfq39 – Files – LMMs_Best- intercepts only model. What does that mean? To start, we can calculate
Practice_Example.R and NamingData.txt for those wishing to recreate the average accuracy (grand mean) across all participants’ responses.
the analysis and graphs.1 For those wishing to see the model output, However, the participants differ in the severity of their aphasia, and this
osf.io/bfq39 – Files – LMMs_BestPractice_Example_withOutput is variation leads to differences between participants in their average
available as both an R script and a text file. naming accuracy (Fig. 1a). To account for this, we can model the
To collect the data, ten individuals with aphasia completed a picture random variance in intercepts due to unexplained differences between
naming task. Stimuli comprised 175 pictures from the Philadelphia participants: the random intercepts by participants. Fig. 1a shows the
Naming Test (Roach, Schwartz, Martin, Grewal, & Brecher, 1996). The raw data, with each participant’s accuracy (averaged across all the trials
experiment tested how cues presented with the pictures affected they completed) and the grand mean. It is clear that some participants
naming accuracy, and each picture was presented with four different are above the mean and some are below it. Because we are modelling
cues. Thus, each participant was presented with each picture four times, how each participant deviates from the grand mean, it is convenient to
and the study conformed to a repeated measures design. The four cues scale the units for these differences as standard deviations from the grand
were: a word associated to the naming target (towel - bath); an un- mean, centered at zero. Fig. 1b shows the random intercepts for parti-
associated word (towel - rice); the phonological onset (towel – ‘t’); and a cipants (extracted from a mixed-effects model that included the fixed
tone. Given previous findings, we predicted that a phonological onset effects plus just the random intercepts by participant) where zero re-
cue or an associated word cue would improve naming accuracy, relative presents the grand mean. This plot shows the difference between each
to an unassociated word cue or the tone cue. The experiment also tested participant’s accuracy and the grand mean accuracy. The model output
how the properties of the target name affected naming accuracy. Here tells us that the variance associated with random intercepts is 1.58
we will look at the length of the word (in phonemes) and the frequency (SD = 1.257). So, on average, participant-level intercepts vary around
of the word (log lemma frequency). We predicted that words with more the grand mean by 1.257 SD units. Given that the units for measure-
phonemes (longer words) would be harder to name, as reflected in ment of accuracy go from 0 to 1, we can interpret this as quite a large
reduced response accuracy, whereas words with higher frequency
would be easier to name, as seen in increased accuracy.
2
In conventional mixed-effects modelling terms, given this design, We do not have a complete explanation for why different participants or
items are associated with variation in responses, so there will always be some
error variance associated with participants and items that is ‘unexplained’. Any
1
We have left annotations and comments between the two authors in the explanations or predictions that we do have can be included in the model as
script, to illustrate the work-in-progress nature of a coding script. fixed effect predictors.

2
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

amount of variation across participants. This is clearly illustrated in participant. More accurate participants have higher intercepts, and
Fig. 1a and b. participants show differences in how steep or shallow the slope for
Our readers will know that participants may differ not only in the length is. Steeper slopes mean a stronger effect of length on naming
level of performance (average accuracy of response) but also in the accuracy. Finally, we plotted the same data as the random effects – that
ways in which they are influenced by the experimental conditions or by is, the per-subject deviations from the average intercept and from the
the stimulus attributes. We can account for random differences between average slope (Fig. 3c). From this plot, we can see that deviation in
sampled participants in their response to cue type by specifying a model overall accuracy (i.e. random variation in intercepts by participant) is
term corresponding to random slopes for the effect of cue type, that is, to much greater than in slopes for length.
deviations between participants in the coefficient for cue type. We can The model output gives us the variance associated with participant
calculate the average naming accuracy within each cue condition across intercepts (SD = 1.259), with values consistent with those we have
participants. To get the fixed effect result, we can then (as in an seen previously for participant intercepts. It also gives us the variance
ANOVA) compare the four cue types to each other and see on average associated with the effect of Length in Phonemes, with an SD = 0.135.
the effect of cue type on naming accuracy. Fig. 2a shows the average This is much smaller than the variation associated with participant
response accuracy per condition, illustrating this fixed effect for cue intercepts, and that tells us that (as seen in Fig. 3c) participants show
type. much more variation in the overall accuracy of their naming than they
The trends in the plot suggest that cues which share the target onset do in how their naming accuracy is affected by the lengths of the words
(known as a ‘phonological cue’ in aphasia research) increase accuracy they name. The correlation (covariance) between the slopes for Length
relative to the other three cue types. When we model random slopes for and the participant intercepts is positive and relatively low (0.34).
cue type over participants (i.e. slopes only, without random intercepts), Thus, there is some tendency for participants with higher accuracy
we aim to gauge how the effect of cue type differs across participants. In (higher intercepts) to show greater effects of Length (steeper slopes).
this experiment, cue type is a factor with four levels, so we are con- Finally, the same process can be applied to the effect of target name
cerned with the variation among participants in how the average ac- frequency, and this is illustrated by Fig. 4a–c. Here we can see more
curacy of response differs between the four conditions. variation between participants in how frequency affects naming accu-
Fig. 2b shows the individual participant data for each condition. It is racy.
clear that not all participants show the same effect of phonological The model output gives us the variance associated with participant
cueing. For example, participants who are highly accurate across all intercepts (SD = 1.268), again consistent with previous estimates. It
conditions (Participants A to D) show ceiling effects, so there is not also gives us the variance associated with the effect of Frequency, with
much scope for phonological cueing to improve naming further. So, an SD = 0.200. This is larger than the variation associated with the
what is the spread (variance) of deviations between participants around effect of Length, as illustrated in the spread of dots (per-subject de-
the average effect of cue type? Fig. 2c shows the participant random viations) in Fig. 4c. The correlation between the random slopes for the
slopes estimated for the effect of cue type. This shows how within each effect of Frequency and the participant random intercepts is large and
cue type condition the effect for different participants varies around the negative (−0.84). Participants with higher accuracy (higher intercepts)
mean accuracy of responses under that condition. show a reduced effect of Frequency (shallower slopes), this is clearly
The model output tells us the variance in slopes associated with reflected in Fig. 4b.
each cue type (Shared onset SD = 1.268, Associated word SD = 1.259, We hope that this example has done two things. First, clearly ex-
Tone SD = 1.310 and Non-associated word SD = 1.254). So, on plained the concept of random and fixed effects. Second, highlighted
average, within each condition, participants vary around the mean by just how informative the random effects can be.
~1.3 units. The model output also tells us how the by-subjects devia-
tions in the slopes of the effects of cue type conditions are correlated
with each other, with high positive correlations (0.94–0.99). A per- The ascendency of mixed models
subject deviation in response to one condition will, predictably, corre-
late with the deviation for the same participant in response to other LMMs have grown very popular in modern Psychology because they
conditions. This is perhaps unsurprising given how much the partici-
pants vary between each other, a variation that is driven principally by
the severity of their aphasia. Put another way, the main explanation of
participants’ performance across the different cue conditions comes
from accounting for the differences between the participants. This is a
nice example of the variance-covariance structure in the data – i.e. where
variation arises and how it is related across groupings in the data.
For the continuous fixed effect predictors, the term ‘random slope’
will make more intuitive sense, and here we will model both random
intercepts and random slopes for the effect of length across participants.
For a more complete account of the data, we will also ask the model to
fit the covariance for intercepts and slopes – that is, to model them as
correlated. For example, participants who are more accurate (higher
intercept) may show a stronger effect of length (steeper slope), resulting
in a positive correlation between intercept and slope. First, to see how
word length affects naming accuracy, we look at the slope of naming
accuracy when we plot it against length, illustrating the average effect
of length (see Fig. 3a). By fitting random intercepts and random slopes Fig. 2a. Average accuracy across the four Cue Type conditions. Accuracy values
for word length over participants, we model both the differences be- are fitted values taken from a mixed effect model fit to the data with all three
tween participants in overall accuracy (see Fig. 1) and the between- fixed effect predictors and random slopes for Cue Type by Participant:
participant differences in the slope for the effect of length. To illustrate Accuracy ~ Cue Type + Length Phonemes + Frequency + (0 + Cue Type |
this, we have plotted the fitted values from a model with random in- Participant). Note random intercepts for Participants were not included in this
tercepts and with random slopes for word length over participants. model, to illustrate a slopes only model. Error bars are 95% confidence inter-
Fig. 3b shows the separate estimated slope of the length effect for each vals.

3
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Effect of Cue Type by Participant


A B
1.0

0.8

0.6

0.4

0.2

0.0
C D
1.0

0.8

0.6

0.4

0.2

0.0
E F
1.0
Accuracy (model fit)

0.8

0.6

0.4

0.2

0.0
G H
1.0

0.8

0.6

0.4

0.2

0.0
I J
1.0

0.8

0.6

0.4

0.2

0.0
Sh.Ons Assoc Tone NonAssoc Sh.Ons Assoc Tone NonAssoc

Cue Type
Fig. 2b. Effect of Cue Type by Participant. Sh.Ons = Shared onset cue (phonological cue), Assoc = Associated word cue, NonAssoc = Non associated word Cue. Each
panel represents the data from a single participant, showing their naming accuracy (across all trials in that Cue Type condition) as a boxplot. Accuracy values are
fitted values taken from a mixed effect model fit to the data with all three fixed effect predictors and random slopes for Cue Type by Participant: Accuracy ~ Cue
Type + Length Phonemes + Frequency + (0 + Cue Type | Participant). Note random intercepts for Participants were not included in this model, to illustrate a
slopes only model.

enable researchers to estimate (fixed) effects while properly taking into associated with an increasing awareness of the need to use LMMs.
account the random variance associated with participant, items or other However, the growth in popularity has been associated with a diversity
sampling units. From under 100 Pubmed citations in 2003, the number among approaches that will incubate future difficulties. In simple
of articles referring to LMMs rose to just under 700 by 2013 (see Fig. 5), terms, variation in current reporting practices will make meta-analysis
the starting year in our review of LMM practice. This popularity is or systematic review of findings near impossible. The present article

4
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Effect of Cue Type by Participant, as deviations from Condition Mean


Shared Onset Associated Word Tone Non associated word
2

1
Deviation from Condition Mean (0)

−1

−2

A B C D E F G H I J A B C D E F G H I J A B C D E F G H I J A B C D E F G H I J
Participant
Fig. 2c. Effect of Cue Type by Participant, as deviations from the condition mean (Random slopes for Cue Type by Participant). Each panel shows the values for a Cue
Type condition (Shared onset, Tone, Associated word, Non Associated word). In each panel, participant’s accuracy is scaled as standard deviations from the condition
mean (centered at zero). These are the Random Slopes by Participant. These are taken from a mixed effect model fit to the data with all three fixed effect predictors
and random slopes for Cue Type by Participant: Accuracy ~ Cue Type + Length Phonemes + Frequency + (0 + Cue Type | Participant). Note random intercepts for
Participants were not included in this model, to illustrate a slopes only model.

examines the diversity in modelling practice and outlines the features of by-items deviations from the average outcome (e.g., fast or slow re-
a reproducible approach in using and reporting mixed-effects models. sponding participants, see Fig. 1), or from the average slopes of the
Historically, the dominant approach for repeated measures data in experimental effects (e.g., individual differences in the strength of an
psychology has been to aggregate the observations. Typically, in experimental effect, see Fig. 2-4). The presence of random differences in
Psycholinguistics, a researcher would calculate the mean latency of the intercept or in the slope of the experimental effect between-items
response for each participant, by averaging over the RTs of each sti- meant, Clark (1973) observed, that the at-the-time common practice of
mulus, to get the average RT by-participants within a condition for a set using only by-subjects’ ANOVAs to test differences between conditions
of stimuli (e.g., per cue type, if our example were a naming latency in mean outcomes was likely to be associated with an increased risk of
study). In a complementary fashion, mean RTs for each stimulus would committing a Type I error. Such errors arise in Null Hypothesis Sig-
be calculated by averaging over the RTs of each participant, to get the nificance Testing (NHST) where the researcher calculates a test statistic
average RT by-items within a condition (e.g., each cue type condition). (e.g., t corresponding to a difference between conditions) and compares
The means of the by-participants or by-items latencies would be com- its value with a distribution of hypothetical test statistics generated
pared using Analysis of Variance (ANOVA) in, respectively, by-parti- under the null hypothesis assumption of no difference. A p value in-
cipants (F1 or F_s) or by-items (F2 or F_i) analyses. If s/he was seeking dicates the proportion of test statistic values, in the null hypothesis
to correlate the average latency of responses by-items with variables distribution, equal to or more extreme than the test statistic calculated
indexing stimulus properties, or by-participants with variables indexing given the study data (Cassidy, Dimova, Giguère, Spence, & Stanley,
participant attributes, s/he would use multiple regression to estimate 2019). Errors arise when a researcher rejects the null hypothesis when
the effects of item or participant attributes on the averaged latencies. A there is no substantial underlying difference in outcomes between
series of analyses dating back over 50 years have shown that these conditions. Ignoring random variation in outcomes among stimulus
approaches suffer important limitations (Baayen et al., 2008; Clark, items can mean that significant effects are observed and interpreted as
1973; Coleman, 1964; Raaijmakers, Schrijnemakers, & Gremmen, experimental effects, when they are in fact due to uncontrolled varia-
1999). tion amongst items (e.g., effects seen in by-participant average RTs are
As Clark (1973; after Coleman, 1964) noted, researchers seeking to in fact driven by a ‘fast’ or ‘slow’ item influencing the means). This was
estimate experimental effects must do so in analyses that account for termed the language-as-fixed-effect fallacy.
random differences in outcome values both between participants and Clark (1973) remedy was to calculate F1 and F2 and then combine
between items. The random differences can include by-participants or them into a quasi-F ratio (minF’) that afforded a test of the experimental

5
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

a) Average (group) effect of Length in Phonemes b) Effect of Length in Phonemes by Participant

1.00
1.00

0.75
Accuracy (model fit)

0.50

0.75

0.25 Participant

Naming acuracy (model fit)


0.00
C
−2 0 2 D
Length in Phonemes (z score)
0.50 E

F
c) Participant intercepts and slopes for Length
G
Intercept Length in Phonemes
H

1 J

0.25
Deviation from Mean (0)

−1

−2 0.00

A B C D E F G H I J A B C D E F G H I J −2 0 2
Participant Length in Phonemes (z score)

Fig. 3. Illustrations for Participant Intercepts and Slopes for Length in Phonemes. Accuracy values are taken from a mixed effect model fit to the data with all three
fixed effect predictors, random intercepts by Participant and correlated random slopes for Length by Participant: Accuracy ~ Cue Type + Length
Phonemes + Frequency + (1 + Length Phonemes | Participant). a shows the average effect of Length in Phonemes, a negative slope showing that words that are
longer are harder to name. b shows the effect of Length for each individual participant (steeper or shallower slopes) and the overall differences in accuracy between
participants (higher or lower intercepts). c shows the Random Intercepts and Slopes for Length. In c the left panel shows the Participant Intercepts, scaled as
deviations from the grand mean Intercept (as in b). The right panel of c shows the Participant Slopes for the effect of Length scaled as deviations from the average
effect of Length.

effect incorporating both by-participants and by-items error terms. variables cannot be assumed to take a monotonic function (Cohen,
Analyses have shown that minF’ analyses perform well in the sense that Cohen, West, & Aiken, 2003). In such circumstances, researchers have
Type I errors are committed at a rate corresponding to the nominal .05 tended to estimate the effects of continuous experimental variables
or .01 significance threshold (Baayen et al., 2008; Barr, Levy, using multiple regression, e.g., predicting by-item mean reading re-
Scheepers, & Tily, 2013). However, such analyses suffer from two cri- sponse latencies from a set of predictors capturing different word
tical limitations. Firstly, use of the approach is restricted to situations properties (Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004).
where data have been collected in a balanced fashion across the cells of However, Lorch and Myers (1990) demonstrated that such by-items
the experimental design. Most researchers know that balanced data regression analyses reverse the language-as-fixed-effect problem by
collection is rare. Experimenters can make mistakes and observations failing to take into account random between-participants differences.
are missed or lost. Participants make errors and null responses may be Lorch and Myers (1990) recommended that the researcher conduct
recorded. Perhaps critically, in practice, Raaijmakers et al. (1999; a two-step analysis, firstly, conducting a regression analysis separately
Raaijmakers, 2003) showed how the use of minF’ declined and was for each participant, e.g., predicting a participant’s response latencies
replaced by the reporting of separate F1 and F2 analyses, despite the from variables indexing stimulus properties and then, secondly, con-
associated risk of elevated Type I error rates (see also Baayen et al., ducting an analysis of the per-participant coefficients estimates. This
2008). approach, sometimes known as slopes-as-outcomes analysis, has been
The minF’, F1 and F2 analyses are also restricted to situations where used in some highly cited experimental Psychology studies (see ex-
data have been collected according to a factorial design. That is, com- amples by Balota et al., 2004; Kliegl, Nuthmann, & Engbert, 2006;
paring outcomes recorded for different levels of a categorical factor or Zwaan, Magliano, & Graesser, 1995) though perhaps more often in
different conditions of an experimental manipulation. However, re- educational and other areas of social science research (see, e.g., cita-
searchers often seek to examine the relationships between continuous tions of Burstein, Miller, & Linn, 1981; see discussion in Kreft & de
outcome and continuous experimental variables. Cohen (1983) de- Leeuw, 1998). However, the approach does not take into account var-
monstrated that the cost of dichotomizing continuous variables is to iation between participants in the uncertainty about coefficients esti-
substantially reduce the sensitivity of analyses. This may be especially mates (e.g., if one participant has fewer observations than another).
important where the relationship between outcome and experimental That is, in a two-step analysis it is not possible to distinguish variation

6
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

a) Average (group) effect of Frequency b) Effect of Frequency by Participant


1.00
1.00

0.75
Accuracy (model fit)

0.50
0.75

Participant
0.25
A
B

Accuracy (model fit)


C

0.00 D

−2 −1 0 1 2 3 0.50 E

c) Participant intercepts and slopes for Frequency F

Intercept Frequency G
H
I

1 J

0.25
Deviation from Mean (0)

−1

0.00

−2
−2 −1 0 1 2 3
Frequency (z score)
A B C D E F G H I J A B C D E F G H I J
Participant

Fig. 4. Illustrations for Participant Intercepts and Slopes for Frequency. These figures parallel those seen in Fig. 3. Accuracy values are taken from a mixed effect
model fit to the data with all three fixed effect predictors, random intercepts by Participant and correlated random slopes for Frequency by Participant: Accu-
racy ~ Cue Type + Length Phonemes + Frequency + (1 + Frequency | Participant). a shows the average effect of Frequency, a positive slope showing that words
with higher Frequency are easier to name. b shows the effect of Frequency for each individual participant (steeper or shallower slopes) and the overall differences in
accuracy between participants (higher or lower intercepts). c shows the Random Intercepts and Slopes for Frequency. In c the left panel shows the Participant
Intercepts, scaled as deviations from the grand mean Intercept (as in Fig. 1b). The right panel of Fig. 3c shows the Participant Slopes for the effect of Frequency scaled
as deviations from the average effect of Frequency.

between per-participant coefficients and error variance (Snijders & hierarchically structured data-sets are not properly accounted for, then
Bosker, 2011). As well as avoiding the language-as-a-fixed-effect-fal- false positive results become worryingly high (e.g., a Type I error rate as
lacy, LMMs are also a solution to the limitations of slopes-as-outcomes high as 80%: Aarts, Verhage, Veenvliet, Dolan, & van der Sluis, 2014;
analyses, as they ‘shrink’ – or pool - estimates towards sampling unit see also Clark, 1973; Rietveld & van Hout, 2007) and the power of
means (e.g., participant means) when there are fewer data points for summary statistics to detect experimental effects is reduced (Aarts
that grouping (e.g., more missing data points for a participant, Gelman et al., 2014). More generally, an analysis that fails to account for po-
& Hill, 2007). tential differences between sampling units in the slopes of experimental
Introductions to LMMs (e.g., Snijders & Bosker, 2011) often discuss variables can mis-estimate the robustness of observed effects (Gelman,
random differences between sampling units (e.g., between participants) 2014). For example, one half of participants may show an effect in a
either as error variance that must be controlled, or as phenomena of positive direction and half show an effect in a negative direction. If this
scientific interest (e.g., Baayen et al., 2008; Bolker et al., 2009; Gelman variation is not captured, the estimated direction of the average effect
& Hill, 2007; Kliegl, Wei, Dambacher, Yan, & Zhou, 2011). Either way, across all participants can be misleading (for an excellent exploration of
LMMs allow this variation to be modelled by the experimenter as this, see Jaeger, Graff, Croft, & Pontillo, 2011). Given these numerous
random effects. This means specifying that the measured outcome de- analytic advantages, LMMs have been rapidly adopted, particularly in
viates, per sampling unit, from the average of the data set (random subject areas such as psycholinguistics (Baayen et al., 2008; Baayen,
intercepts, see Fig. 1) or from the average slope of the experimental or 2008).
covariate effect of interest (random slopes, Fig. 2-4). Random intercepts
and random slopes variance estimates can tell us how much of the So what is the problem?
overall error variance is accounted for by variation between sampling
units, e.g., the differences in overall RT between participants or between The problem for researchers is that there are multiple analytic de-
items. They can also tell us what the estimated difference is for a given cisions to be made when using LMMs. This issue is not new to their
sampling unit, e.g., by how much does a participant’s overall RT differ advent in experimental Psychology. Simmons, Nelson, and Simonsohn
from the grand mean RT? (2011) demonstrated the decisive impact on results of ‘researcher de-
It is worth highlighting that if these systematic differences in grees of freedom’. Silberzahn and Uhlmann (2015) showed that the

7
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

900
Number of Pubmed citations

600

300

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Year
Fig. 5. Number of Pubmed citations for ‘Linear Mixed Models’ by year. Generated using the tool available at http://dan.corlan.net/medline-trend.html, entering
“Linear Mixed Models” as the phrase search term and using data from 2000 to 2018.

same data can reasonably be analysed in a variety of different ways by understand. It may be feared that educational resources are sufficient to
different research groups. Neither demonstration depended on the use motivate the use of LMMs but are not sufficient to enable their appro-
of LMMs. The proliferation of alternate findings that arise from varia- priate application by researchers. Established researchers may baulk at
tion in choices at each point in a sequence of analytic decisions is the time needed to undergo retraining in software applications and
crystallized by Gelman and Loken (2013) in the metaphor ‘the garden statistics, and to have to allocate more time in the future as software
of forking paths’. Multiple analytic steps make variation in observed and analytic practices update.
results more likely, even when reasonable assumptions and decisions The development of appropriate training for current or developing
have been made at each step (Gelman & Loken, 2013; Silberzahn & researchers is an important concern for the future but we are optimistic
Uhlmann, 2015). The same concerns have arisen in fields other than that this challenge can be met over time. There are a growing number of
experimental Psychology, for example, following the rapid expansion in LMM tutorials available for different disciplines which include ex-
neuroimaging studies in which complex analyses with multiple analytic amples and technical descriptions of software use (Baayen et al., 2008;
steps are the norm (Carp, 2012a, 2012b; Poldrack & Gorgolewski, 2014; Brauer & Curtin, 2018; Brysbaert, 2007; Chang & Lane, 2016; Cunnings,
Wager, Lindquist, & Kaplan, 2007). Thus, this paper reports on the use 2012; Field & Wright, 2011; Field, Miles, & Field, 2009; Jaeger, 2008;
of LMMs in the context of ongoing concerns regarding statistical best Kliegl, 2014; Magezi, 2015; Murayama, Sakaki, Yan, & Smith, 2014;
practices across the cognitive and neurosciences (e.g., Carp, 2012a, Rabe-Hesketh & Skrondal, 2012; Rasbash et al., 2000; Schluter, 2015;
2012b; Chabris et al., 2012; Cumming, 2013a, 2013b; Ioannidis, 2005; Th. Gries, 2015; Tremblay & Newman, 2015; West & Galecki, 2011;
Kriegeskorte, Simmons, Bellgowan, & Baker, 2009; Lieberman & Winter, 2013). From the authors’ own experiences, as interested but not
Cunningham, 2009; Pashler & Wagenmakers, 2012; Simmons et al., mathematically expert readers, the most friendly and relevant tutorials
2011; Vul, Harris, Winkielman, & Pashler, 2009). As we shall report, for language researchers can be found in Brysbaert (2007), Cunnings
the decisions that researchers must make when conducting LMMs ap- (2012) and Winter (2013). Once the reader is comfortable, we strongly
pear to be associated with a heightened sense of uncertainty and in- recommend the recent paper by Brauer and Curtin (2018).
security. The trouble is not that researchers are not doing what experts advise
Researchers’ concerns may stem, in part, from the fact that the rapid but, rather, it is the ways in which researchers have responded to the
adoption of LMMs has not been complemented by the adoption of evolution of recommendations in what is, in part, a methodological
common standards for how they are applied and, critically, how they field with active areas of development. Critically, the literature on
are reported. There are many excellent introductory texts available for LMMs is fairly consistent in terms of recommendations for best practice
LMMs (e.g., Baayen et al., 2008; Baayen, 2008; Bates, 2007; Bolker but there has been some diversity in the guidance available to re-
et al., 2009; Bryk & Raudenbush, 1992; Gelman & Hill, 2007; Goldstein, searchers (e.g., compare Barr et al., 2013; Bates, Kliegl, Vasishth, &
2011; Hox, 2010; Judd, Westfall, & Kenny, 2012; Kreft & de Leeuw, Baayen, 2015; Matuschek, Kliegl, Vasishth, Baayen, & Bates, 2017).
1998; Pinheiro & Bates, 2000; Snijders & Bosker, 2011; see also Ap- Thus, a critique by Barr et al. (2013) of the application of relatively
pendix 3). The caveat here is that even some texts that are designed to simple mixed-effects models, including random intercepts but not
be introductory require a higher level of mathematical literacy than is random slopes, led to a wider sense of unease about the replicability of
required for or delivered by a majority of undergraduate psychology previously published results. It appeared to many that the frequent
courses (e.g., fluency in linear and matrix algebra). It is also not clear reporting of findings from LMMs including just random intercepts
how many undergraduate courses teach LMMs. Therefore, students may would be associated with an inflated risk of false positives. However,
be required to read research papers that they are not equipped to latterly, it has been argued that the risk of false positives must be

8
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

balanced with the risk of false negatives through the inclusion of a Table 1
parsimonious selection of random effects (Matuschek et al., 2017). This Reported position and institution type for questionnaire respondents (% of
apparent diversity in recommendations could be a source of the un- total).
certainty in approach or diversity in practice that (to anticipate) our Position % Institution %
observations uncover. But we would read the succession of publications
as marking a progression in our understanding of the most useful ap- Undergraduate 0.00 University UK 25.77
Postgraduate MSc 1.23 University Other 59.51
plication of mixed-effects models. Many agree that LMMs are appro-
Postgraduate PhD 24.54 Research Institute UK 0.61
priate to many experimental data analysis problems. Many assume that Postdoctoral researcher 26.38 Research Institute Other 9.82
random effects should be incorporated in the LMMs that are fitted. As Lecturer/Assistant Professor 24.54 Institution Other 4.29
we will argue, the key issue is that since LMMs are an explicit modelling Reader/Senior Lecturer 11.04
/Associate Professor
approach, they require a different attitude than has been ingrained,
Professor 9.20
perhaps, through the long tradition of the application of ANOVA to the Other 3.07
analysis of data from factorial design studies.
What we hope to make clear is that there is no single correct way in
which LMM analyses should be conducted, and this has important im-
plications for how the reporting of LMMs should be approached. Design and procedure
Researchers will, quite reasonably, be guided in their approach to A qualitative questionnaire was used, with both open and closed
analyses by the research question, the structure of the data as it arises questions. Ethical approval for the study was granted by the University
from study design and the process of observation, and the constraints of Reading School of Psychology & Clinical Language Sciences Research
associated with the use of statistical computing software. The problem Ethics Committee. The online questionnaire was implemented in
is that variation in practice – especially reporting practice - can have a LimeSurvey (LimeSurvey Project Team, Schmitz, C. (2015) (2015),
direct and damaging impact on our ability to aggregate data and to 2015). Individuals were invited to complete the questionnaire via email
accumulate knowledge. Replicability and reproducibility are critical for lists and personal emails to academic contacts of the authors, with a
scientific progress, so the way in which researchers have implemented request to forward on to any interested parties. All responses to the
LMM analysis must be entirely transparent. We also hope that the questionnaire were anonymous. The questionnaire began with a brief
sharing of analysis code and data becomes widespread, enabling the introduction to the study. Consent was provided by checking a box to
periodic re-analysis of raw data over multiple experiments as studies indicate agreement with a statement of informed consent.
accumulate over time. The questionnaire elicited answers to questions focusing on the use
and reporting of Linear Mixed-effects Models (LMMs). Appendix 1 pro-
vides the full questionnaire. Questions covered demographic informa-
Present study tion, use and reporting of LMMs, challenges encountered when using
LMMS and concerns they had for their own research and their field.
We examine the diversity in practices adopted by different re- A period of approximately one month was allowed for responses to
searchers when reporting LMMs, and the uncertainty that that diversity be collected. Data collection was stopped once we had reached the
appears to engender. We completed a survey of current LMM practice in current sample size, as the sign-up rate to complete the questionnaire
psychology. This consisted of two parts, a questionnaire sent out to had slowed. The sample size was judged adequate for our purposes
researchers and a review of papers that used LMM analyses. We found (frequency and thematic analysis of question responses) and we judged
widespread concern and uncertainty about the implementation of that a substantial increase in numbers was unlikely if we left more time.
LMMs alongside a range of reporting practices that frequently omitted
key information. The survey demonstrates the assimilation of a data Analysis
analysis method in our discipline in ‘real time’. To address these con- The complete data can be found at osf.io/bfq39 – Files – Mixed
cerns, we present a set of best practice guidance with a focus on clear models survey results_analysis.xlsx. For closed questions, the percen-
and unambiguous reporting of mixed-effects model analysis. tage of responses falling into a given category were calculated. For
open-ended questions, thematic analysis was completed to identify the
most common responses across individuals (Braun & Clarke, 2006).
Questionnaire Individual responses to each question (e.g., challenges to using LMMs)
were collated as rows in a spreadsheet and given a thematic label to
Method code the response (e.g., software, convergence, lack of standard pro-
cedures etc.). Responses were then reviewed and sorted, combining
Participants responses that fell into the same thematic label. We were interested in
163 individuals completed the questionnaire: 94 females, 63 males reporting the most common responses, so the total number of responses
and 6 who did not disclose their gender. Mean age was 36 years that fell into a given theme were counted as a percentage of the total
(standard deviation, SD = 9.26, range = 23–72). Just under 40% of responses to that question. For questions where categorical responses
respondents reported their discipline as Psycholinguistics, 16% were made (e.g., reporting software used, listing training and re-
Linguistics, 11% Psychology, 5% Cognitive Science/Psychology, 4% sources), we generated lists of unique responses and the frequency (%
Language Acquisition and 3% Neuroscience; 15% of individuals re- of total) with which each one was reported. The results of the above
ported more than one discipline. A number of other disciplines were analyses are presented together.
reported by individuals (e.g., Anthropology, Clinical Psychology,
Reading). Mean number of years working in a given discipline was Results
10.38 (SD = 8.16, range = 0.5–30). Data on academic position and
institution can be found in Table 1. We recognize that this sample is Usage of mixed-effects models
biased towards those already using mixed-effects models (rather than The great majority of respondents (91%) had used mixed-effects
reading about them), and this was reflected in the high proportion who models for data analysis. The mean year of first using LMMs was 2010
stated they already used them (see Section 2.2.1 below) and the small (SD = 3.94 years, range = 1980–2014). We asked respondents to es-
number who stated they were planning to use mixed-effects models timate how often they used mixed-effects models, the mean was 64% of
(Section 2.2.3). data analysed (SD = 31%, range = 0–100).

9
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Training & software Table 2


The majority of respondents had attended a workshop, course or Most frequently reported challenges and concerns in using LMMs.*
training event to learn mixed-effects models (68%), 30% had learnt Reported challenge %
from colleagues, 21% from internet resources, 12% using specific books
or papers, 10% were self-taught and 9% had learnt from a statistics Lack of standardized procedures 26
Selecting and specifying models 25
advisor or mentor. Appendix 2 provides a comprehensive list of the
Researcher reports lack of knowledge 18
specific authors, papers, books, websites and other resources used by Understanding and interpreting random effects 14
respondents. Readers may find this useful for their own training needs. Lack of training/guidelines for analysis, interpretation and reporting 13
The majority of individuals used the statistical programming lan- Use of new and unfamiliar software 12
guage and environment, R (90%) (R Core Team, 2017), with 20% General concern over use of LMM for own analysis 75
mentioning the lme4 package (Bates, Maechler, Bolker, & Walker, Reporting results 15
2015). Other named R packages were gamm4 (Wood & Scheipl, 2016), Model selection 14
Learning and understanding analysis 14
languageR (Baayen, 2013), lmerTest (Kuznetsova, Brockhoff, &
Lack of established standards 11
Christensen, 2016), mgcv (e.g., Wood, 2011), and nlme (Pinheiro,
General concern over use of LMMs for discipline 73
Bates, DebRoy, Sarkar, & Core Team, 2016). The next most frequently
Lack of standards 23
used software was SPSS (8%; IBM IBM Corp, 2013). A number of other LMMs used when not fully understood 23
software applications were named by one or two people: MLwiN Misuse of models 17
(Rasbash, Charlton, Browne, Healy, & Cameron, 2009), Matlab Reporting is inconsistent and lacks detail 17
(MATLAB and Statistics Toolbox Release 2012b, 2012b, 2012), Mplus Peer review of LMMs is not robust 10

(Muthén & Muthén, 2011), Stan (Team, 2016), JASP (Team, 2016), S-
* Identified by thematic analysis.
PLUS (e.g. Venables, 2014), SAS and ESS (Rossini, Heiberger,
Sparapani, Maechler, & Hornik, 2004).
code might also not be reliable. For example, people reported differ-
ences when running the same analysis in different versions of the same
Planned use
software”). A number of individuals reported specific difficulties with
For individuals who had not yet used mixed-effects models (15 re-
model coding and fitting (e.g., coding of categorical variables, setting
spondents), 11 were planning to use them and four were not. For those
up contrasts, structuring data appropriately, forward and backward
planning to use them, reasons included exploration of a larger number
model fitting and post-hoc analyses).
of predictor variables (5 responses), to look at change over time or
A number of responses reflected unease at the shift from traditional
longitudinal data (2 responses) and for better statistical practice (e.g.,
factorial designs and ANOVA or other inferential statistical tests (e.g.,
control of individual differences, inclusion of random effects; 2 re-
“[lack of] convincing evidence that mixed models provide information
sponses).
above and beyond F1 and F2 tests”). For example, susceptibility to p-
value manipulation or difficulties in establishing p-values (4%; “too
Challenges to using mixed-effects models
many people still believe that we are fishing for p-values if we do not
The most frequently reported concern was a lack of consensus or
use classical anovas”), knowing how to map models onto study design
established, standardized procedures (26% of respondents; e.g. “it's
(4%; “Knowing when it's appropriate to use them”, “to understand the
quite difficult… to understand what standard practice is”). Related to
influence on future study designs”), difficulties with small samples,
this were responses that described a lack of training or clear guidelines
sparse data and calculating effect sizes or power.
for analysis, interpretation or reporting results (13%; e.g., “minimal
training/knowledge available in my lab”, “Presentation of data for
publication”) and the relative novelty of the analysis (7%; ”it is rela- Concerns using mixed-effects models for own data and in the wider discipline
tively new so recommended practices are in development and not al- Around 75% of respondents had concerns over using LMMs in their
ways fully agreed upon“). own data analysis. For these respondents, the most salient concerns
A number of responses highlighted a lack of knowledge (18%; e.g., were reporting results (15%; e.g., “Do you report your model selection
“I do not know enough about them”, “some reviewers request these criteria and if so, in what level of detail… perhaps several models fail to
models but researchers are not all skilled in these techniques”, “com- converge before you arrive at one that does?”), selecting the right
plex math behind it not easy to grasp”, “not enough people who know model (14%; e.g., “model selection”), learning how to do the analysis
it”). A broad challenge in applying conceptual knowledge was seen in and fully understanding it (14%; e.g., “I do not have enough knowledge
responses covering difficulties in selecting or specifying models (25%; to correctly apply the technique”), a lack of established standards (11%;
e.g., “Model specification - knowing what to include and what not to e.g., “the lack of standardized methods is a problem”), models that do
include”), models which fail to converge or in which assumptions are not converge (9%; e.g., “How to deal with convergence issues”) and the
violated (14%; e.g., “How to deal with models that fail to converge”), review process when submitting LMM analysis for publication (9%; e.g.,
understanding and interpreting random effects structures (16%; e.g., “experimental psychology reviewers are often suspicious of them”).
“Determining what constitutes an appropriate slope term in the random Other concerns broadly reflected those already identified as challenges
effects structure”), identifying interactions (4%; “Working out sig- above. See Table 2 for the most frequently reported concerns.
nificant interactions”) or interpreting results generally (7%; “difficulty Around 73% of respondents had concerns over the use of LMMs in
in interpreting the results”). Other specific points included models their discipline or field. Here, the key concerns were a lack of standards
being overly flexible or complex (e.g., “The potential complexity of the (23%, e.g., “lack of established standards”), use of models without them
models that goes substantially beyond standard procedures”, “Mixed being fully understood (23%; e.g., “Overzealous use of random effects
models are so flexible that it can be difficult to establish what is the best without thinking about what they mean”), frank misuse of models
suited model for a given analysis”) and challenges in checking and (17%; e.g., “Misapplication of mixed models by those not at the fore-
communicating model reliability (e.g., “Knowing how to test whether a front of this area”), reports of model fitting being inconsistent and not
model violates the assumptions of the specific model type being used”). detailed enough (17%; e.g., “not describing the analysis in enough
The most frequently reported concerns are reported in Table 2. detail”), a lack of familiarity and understanding of the models (10%;
Technical challenges were highlighted, specifically the use of new e.g., “lack of knowledge about their implementation”) and the review
or unfamiliar software (12%; e.g., “software package (R) I was un- process not being robust (10%; e.g., “Reviewers often can’t evaluate the
familiar with”) and the reliability of analysis code (e.g., “Some of the analyses”). Additional concerns were over researchers being able to

10
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

misuse the flexibility of mixed-effects models (5%; e.g., “increased reported that results had been comparable, 46% reported that results
‘researcher degrees of freedom’”) or “p-hack” the data (3%; e.g., “It's were not comparable and 21% responded with N/A. An open question
easier to p-hack than an ANOVA”), and the breadth of approaches to asked for respondents’ evaluation of this comparison. The most frequent
making decisions during model fitting (4%; e.g., “The variety of ap- response was that results were comparable (26%; e.g., “Largely
proaches people take for deciding on model structure”). There was also methods correspond to each other”). A number of responses identified
concern over why LMMs were deemed better than factorial ANOVA that mixed-effects models were preferred or gave a better, more de-
approaches (3%; e.g., “Why are they privileged over simpler tailed fit to the data (28%; e.g., “I think we got a better fit for our data
methods?”) and that it was difficult to compare them against these using LMEs instead of the traditional ANOVAs/Regression models”).
traditional approaches (2%; e.g. “less accessible to readers/reviewers However, it is not clear whether results were comparable in terms of the
without experience… than traditional analyses”). See Table 2. size of numeric effects or coefficients. Responses instead focused on
whether results were significant. LMMs were reported to be more sen-
sitive/less conservative, demonstrating significance for small effects
Current practice
(16%; e.g., “differences can occur if effects are just above or below
For respondents who were currently using mixed-effects models,
p = .05”, “mixed models seems less conservative than for example
70% did not specify variance-covariance structures for the models. On
(repeated measures) anova”). However, mixed-effects models were also
reflection, participants may not have understood this question given
found to be more conservative, depending on how the random effects
that it was not accompanied by an explanation of these terms. We asked
structures were specified (8%; e.g., “Mixed models are typically more
people to provide a typical model formula from their analyses. Two
conservative, but not always”). Traditional F1/F2 tests were sometimes
individuals stated that they used SPSS, and therefore did not specify
used to confirm or interpret effects in the mixed-effects models (4%; “I
model formulae. Of those who did provide an example, only three ex-
look if both analyses point to the same effects of the experimental
plicitly mentioned model comparison and model checking. See Table 5
manipulations”) and in one instance F1/F2 tests were reported to be
for a summary of random effects from model examples. 100% specified
“much easier and less time-consuming” than LMMs. See Table 3.
random intercepts for subjects/participants and 92% specified random
intercepts for items/stimulus materials or trials. Random slopes to
allow the effect of interest to vary across subjects and/or items were Reporting & preferred reporting
less common (62%). Respondents were asked for their typical practice when reporting
When included, random slopes were often qualified on the basis of models, this question was multiple choice and a summary of responses
experimental design and only included when appropriate for the data is given in Table 4. The vast majority reported p-values and model
structure (e.g., random slopes for within-subject factors; Barr et al., fitting (88% and 80% respectively), but other options were chosen
2013). Where multiple predictor factors were included, interactions much less often: model likelihood was reported by 50% of respondents;
between factors for random slopes were typically included. It is notable confidence intervals by 37%; specification/reporting of model itera-
that some respondents stated that they did not include interaction terms tions by 36%; and F-tests between models by 31%.
for random slopes, excluded these first if the model failed to converge, For preferred reporting format, the majority were in favour of a
or simplified random effects until the model converged. Some removed table (53%), followed by written information in the text (19%) and then
the modelling of correlations between random effects for the same plots (15%). The main reasons for selecting tables were ease of reading
reason. See Table 3. and clarity. Written text could provide details and facilitate inter-
pretation. Plots were deemed important for more complex models and
to visualize the model structure. Some individuals stated that reporting
Comparison to traditional approaches
format should depend on the data and model complexity (7%).
Around 61% of respondents had compared the results of LMM
analyses to the results of more traditional analyses (i.e. ANOVA or other
factorial inferential statistics; 15% responded N/A). Of those, 33% Sharing of code and data
We asked respondents to state whether they would share data and
Table 3 code, with 70% responding that they would share both (e.g., “Yes.
Current practice. Science should be open in its practice”). Table 5 summarizes the re-
sponses. Some respondents specified that they would share data only
Current practice %
after publication, on request, after an embargo or when they had fin-
Do you specify variance-covariance structures? ished using it (9%; e.g., “I would be willing to share data on personal
Yes 30 request”). Reasons for sharing included being open and transparent or a
No 70 duty to share work that had been publicly funded (e.g., “yes, always.
Random Effect structures from model examples: No-brainer: tax-payer-funded scientist”). A number of respondents
Random intercepts for subjects 100 identified a general benefit to the field and to improve standards. For
Random intercepts for stimuli/trials 92
example, to contribute to meta-analysis or further data exploration
Random slopes for effect to vary across subjects 62

Comparison to factorial analysis (ANOVA)


Table 4
Do you compare LMMs to factorial analysis?
Current practice in reporting mixed models (% total).*
Yes 61
No 24 What is reported % Yes % No
N/A 15
Were results comparable? p-values 88 12
Yes 33 Model fitting 80 20
No 46 Likelihood 49 51
N/A 21 Confidence intervals 37 63
How do LMMs compare to factorial analysis?* Iterated models 36 64
LMM are better fit to data 28 F-tests 31 69
Largely comparable 26
LMMs are more sensitive/less conservative 16 * Ordered by frequency of response high to low, rounded up to nearest %;
LMMs are more conservative 8 147 responses. Respondents were asked simply to indicate whether they re-
ported model fitting and model likelihood, for detailed discussion of these parts
* Identified by thematic analysis. of LMM analysis see Sections 4.1.4 to 4.1.6.

11
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Table 5 understanding the analysis process and difficulties in building, selecting


Sharing of code and data.* and interpreting LMMs. For some, these difficulties are compounded by
Would you share data and code? % having to learn about new software applications (for an overview of
software applications and their comparability see McCoach et al.,
Share both data and code 70 2018). Software applications undergo changes and updates which may
Share code 15
change the results of a fitted model, as illustrated in the grey literature
Specified sharing of data after publication 9
Would not share either 3
around lme4 (e.g. internet discussion boards such as stackover-
Unsure 3 flow.com; Nava, 2017). Such back-end changes – typically not salient to
Would you like access to data and code? %
the average psychology researcher – will add to the sense that mixed-
Access to both 74 effects models are complex and problematic. Respondents were con-
Access to both but unlikely to use it 6 cerned by not knowing what to report or how to report results from
Access to code 9 LMMs. This point feeds into reports of LMMs being received skeptically
Did not want access to either 3
by reviewers as inconsistent formatting and presentation of analyses
Did not want access to code 3
Did not want access to data 2 will exacerbate difficulties in the review process. It may be that two
Unsure 2 individuals trained in LMMs complete analyses that are true to their
original training, but which – for similar data – differ in implementation
* Identified from thematic analysis. and are reported differently in publications. Given that reviewers are
sampled from the community of active researchers, lack of knowledge
(e.g., “… to facilitate additional research and replication of previous in reviewers was also a concern. At present, we are using a method of
results. This data would also be extremely helpful for meta-analyses and analysis that the community feels is not well understood, not clearly
for future research to be able run power analyses based on previous reported and not robustly reviewed. Little wonder that some see it as
findings”). Analytic rigour was also mentioned, for example having a overly flexible and yet another way of fishing for results.
more open discussion about how models are used, checking model fit- Most researchers report p-values for model coefficients and some
ting, correcting errors, and having more experienced people look at the detail of model fitting for LMMs, fewer provide details of iterated
data (e.g., “We definitely need transparency and standards here because models or Likelihood comparisons between different models. This
most of us are not statisticians”). Around 3% would not share data and means that, in general, the number of decisions being made during
3% were unsure. Reasons included not wanting to be ‘scooped’, and model fitting and the process of model selection is not transparently
being unsure if data sharing was allowed on ethical grounds. One re- reported in manuscripts. This lack of transparency should not be seen as
spondent asked “Why should I share my data?”. deliberate obfuscation: most respondents were willing to share analysis
Around 15% responded that they would share code, with no state- code and data, and felt that it was important to do so.
ment about data sharing. Reasons for sharing code included it being Alternate choices taken at multiple analytic steps can foster the
good practice and good for learning, as well as comparing analyses emergence of different results for the same data (Gelman & Loken,
(e.g., “Good practice, other researchers can look at what you did and 2013; Silberzahn & Uhlmann, 2015) giving the impression of un-
learn something, or point out errors”, “I think it is helpful to share code. principled flexibility. The rapid uptake of LMMs has been driven, in
This will hopefully lead to a more open discussion of the choices we all part, by the need to explicitly account for both subject- and item-related
make when doing this type of analyses”). Two individuals stated that random error variance (Baayen et al., 2008; Brysbaert, 2007; Locker,
they would not share code due to their inexperience. A few respondents Hoffman, & Bovaird, 2007) and part of the anxiety over model building
mentioned difficulty in sharing code that could often be ‘messy’ and arises when one moves from factorial ANOVA into LMMs (Boisgontier &
that it would be time consuming to prepare code for publication. Cheval, 2016). Although ANOVA and LMM share a common origin in
We asked respondents to state if they would like to access data and the general linear model, they are very different in terms of execution.
code in published reports. Around 74% would like access to both, with In LMMs, the analysis process is similar to regression (Bickel, 2007). A
a further 6% specifying yes but that they would be unlikely to use it. model equation for the data is specified and reliable analysis requires
Reasons for accessing were broadly similar to those identified above, larger data sets (e.g., trial-level data or large samples of individuals,
with mention of transparency, improved standards, for learning, for Baayen, 2008; Luke, 2017; Maas & Hox, 2004; 2005; Pinheiro & Bates,
meta-analysis, analytic rigour and checking reported data. Some re- 2000; Westfall, Kenny, & Judd, 2014). Nested models may be compared
spondents reported that current data sharing practices were already or ‘built’ to find the best fit to the data. The process feels notably dif-
sufficient (e.g., sharing data on request, depositing in centralised ar- ferent to producing a set of summary statistics (e.g., averaging re-
chives, e.g., “Doesn't have to be in published reports. Can be in a da- sponses to all items for a subject), which are then put through a fac-
tabase accessed via the publisher or institute”), or that this was a wider torial analysis (such as ANOVA). Survey responses reflected this uneasy
issue and not specific to LMMs (e.g., “I don't see the access to data and shift. Of respondents who had compared LMMs to ANOVA, a third
code being a mixed effects specific issue. This is for any paper, re- found comparable results but nearly half found results that were not
gardless of the statistical technique used”). A smaller number specified comparable. For those who had compared the two analyses, LMMs were
that they would like access to code (9%) with no statement about data. reported to be a better fit to the data, but could be both more or less
Some respondents qualified that data and code should be part of sup- conservative especially when effects were marginally significant under
plementary materials or a linked document, rather than in the pub- ANOVA. It is worth noting that LMMs are not a new level of complexity
lication itself. Finally, a few people did not want access to code (3%), for statistics in cognitive science (compare: structural equation mod-
data (2%) or both (3%), or they were unsure (2%). See Table 5. elling, Bowen & Guo, 2011; growth curve modelling, Nagin & Odgers,
2010), especially when compared against advances in brain imaging
Discussion analysis and computational modelling. However, the perceived com-
plexity is demonstrated by survey responses repeatedly referring to a
Most respondents had concerns over the use of LMMs in their own lack of knowledge and established standards.
analyses and in their discipline more widely. Concerns were driven by The survey data clearly demonstrates that researchers are un-
the perceived complexity of LMMs, with responses detailing a lack of comfortable with the use of LMMs. This is despite a number of excellent
knowledge (own knowledge, that of reviewers or other researchers). texts (see Appendix 2, and references given in the Introduction) and an
Our interpretation is that this knowledge deficit (perceived or real) explosion of online tutorials and support materials. To evaluate whe-
drives the other concerns. Namely, difficulties in learning and ther there is a problem in how LMMs are actually implemented and

12
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

communicated, we completed a review of published papers using LMM experimental cognitive science.
analysis. Reporting the model selection process was infrequent (typically
present in ~20–25 papers in each year; Table A4.3) and a wide variety
Review of current practice in use and reporting of LMMs of practices were present. Manuscripts reported “best fit” models fol-
lowing Likelihood Ratio Tests (LRTs) or Akaike Information Criterion or
Our objective was to review current practice in the use and re- Bayesian Information Criterion (AIC/BIC) comparisons (n = 23) or
porting of LMM/GLMMs in linguistics, psychology, cognitive science minimal model approaches in which models were simplified by re-
and neuroscience. This complements the survey by adding objective moving fixed or random effects that were not significant (n = 31).
data on how LMMs are used and reported. Models were also selected by moving from maximal to minimal models
(n = 6) or minimal to maximal models (n = 8), or using backwards
Method fitting (n = 7).
Model comparisons for fixed effects were not present in all manu-
We completed a review of published papers using LMM analysis, scripts (typically present in ~50–60 papers in each year; Table A4.4).
taking a sample rather than exhaustively searching all papers. This This may be because researchers using experimental designs are mod-
approach was chosen to make the review manageable. To start, the first elling all fixed effects together (as in an ANOVA) rather than using
author used Google Scholar to find papers citing Baayen et al. (2008), model comparison to select them. When comparisons were present, the
widely seen as a seminal article whose publication was instrumental to majority reported LRTs (n = 129), with fewer reporting AIC or BIC
the increased uptake of LMM analysis (cited over 3500 times to date). (n = 12) or a combination of LRTs and AIC/BIC (n = 20). Some papers
To keep the review contemporary, papers were chosen from a four-year described the model comparison process but did not provide data for
period spanning 2013, 2014, 2015 and 2016. Papers had to be in the the comparisons (n = 54).
field of language research, psychology or neuroscience (judged on the Model comparisons for random effects were also not present in all
basis of title, topic and journal). From each year, the first 100 citations manuscripts (Table A4.5). The numbers that did test for the inclusion of
fitting the above criteria were extracted from Google Scholar, when random effects increased over time (2013 = 16, 2014 = 33,
limited by year, giving 400 papers in total. The first search was com- 2015 = 43, 2016 = 42). When a specific approach was reported, there
pleted on 30th May 2017, giving a total of 3524 citations for Baayen was a clear preference for using a maximal random effects structure
et al. (2008) with 2360 citations between 2013 and 2016. Therefore, we (Barr et al., 2013; n = 86), followed in frequency by a preference for
sample ~17% of the papers fitting our criteria, published in that four- using Likelihood Ratio Tests to determine random effects structures
year period. Sixteen papers were excluded as they did not contain an (LRTs, n = 25). Less common was a combination of starting with a
LMM analysis (e.g., citing Baayen et al. in the context of a review, or maximal structure and then using LRTs to simplify (n = 11) or starting
when referring to possible methods). One paper was not accessible. with a minimal structure and using LRTs to add more complex random
Three papers were initially reviewed to establish the criteria for clas- effects (n = 7).
sifying papers, with an excel spreadsheet created with a series of drop- Reporting of convergence issues was increasingly common over the
down menus for classification. To check coding and classifications, the four-year period (2013 = 2, 2014 = 8, 2015 = 14, 2016 = 21; Table
second author looked at one reported model from 80 papers (20% of the A4.4), and a variety of methods were reported for dealing with this. For
total papers coded; 20 papers from each year). Initial agreement was example, simplification by removing slopes (n = 9), correlations be-
77%, with differences resolved by discussion. The spreadsheet with all tween intercepts and slopes (n = 2) or both slopes and correlations
the data and classifications from the review can be found here (https:// (n = 4). Some manuscripts reported the “fullest model that converged”
osf.io/bfq39/; Files – Baayen Papers Rev with coding check.xlsx). without specific detail on how simplification took place (n = 14).
Classification criteria are summarized in Table 6, and a fuller descrip- Fixed effect predictors (Table A4.5) were most often modelled as
tion of these can be found in Appendix 3. main effects and interactions (n = 287) as compared to main effects
alone (n = 94), the inclusion of control variables was also common
Results (n = 109). The vast majority of models included random intercepts for
both participants and items (Table A4.6, n = 277), with a good number
The complete data set can be found at osf.io/bfq39 and tables with that included intercepts for participants only (n = 64). Random slopes
counts in Appendix 4. Here we will summarise the data by walking were present in around half the papers (2013 = 41, 2014 = 50,
through the stages of LMM analysis: model selection, evaluating sig- 2015 = 67, 2016 = 58; Table A4.6). Most commonly, random slope
nificance and reporting results. Tables presenting counts in Appendix 4 terms were included to capture variation in fixed effect predictors
follow the order below. varying as main effects over participants (n = 78) or over both parti-
cipants and items (n = 94). It was less common to include the variation
Model selection of fixed effect interactions as slopes over subjects and/or items
The majority of papers used LMM (n = 193), GLMM (n = 88) or a (n = 36). Where random slopes were modelled, it was rare for manu-
combination of both LMM and GLMM (n = 95). General Additive scripts to explicitly report whether correlations or covariances between
Models (GAMs) were rare in our sampled papers (n = 5; see Table A4.1 intercepts and slopes had been modelled (~10–15 papers per year) and
in Appendix 4). this information was often unclear or difficult to judge (n = 63).
The majority of papers approached the use of LMMs as a variant on A simple way to report the structure of a model is to provide the
regression with random effects controlling for participant and item model equation (Table A4.7); this was given in a minority of papers
variation (n = 272) but a number also used LMMs as a replacement for with a clear increase over time (2013 = 7, 2014 = 6, 2015 = 26,
ANOVA (n = 61). It was relatively rare for studies to look at the 2016 = 22, total n = 61). However, the majority of papers did not
random effects as data of interest (n = 13; see Table A4.2). The classic provide this information (n = 317).
use of LMMs for hierarchical sampling designs was present relatively
infrequently (n = 26), which may be a result of the sampling process. Evaluating significance
LMMs have been used for a number of years in educational and orga- We classified 10 different combinations or approaches to evaluating
nisational research to address questions concerning hierarchical sam- significance for fixed effects (see Table A4.8). It is worth noting that
pling designs (Gelman & Hill, 2007; Scherbaum & Ferreter, 2009; only around half the papers reported the method used (n = 207), so we
Snijders, 2005). Baayen et al. (2008) – our seed paper - presents LMMs can assume that researchers employed methods that were defaults for
as a method to control for by participant and by item variation in software packages. The main methods reported were: MCMC

13
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Table 6
Classification criteria for review and associated data table.
Criteria Options Data Table

Field / Topic Psychology, Linguistics & Phonetics, Neuroscience, Psycholinguistics.


Model Type LMM, GLMM, LMM & GLMM, GAMM, Other. A4.1
Approach ANOVA testing for fixed effects via LRTs/model comparison A4.2
ANOVA testing with random effects of interest
Regression with random effects control for subject / item variance
Regression with multiple predictors and control variables
Regression with random effects of interest
Repeated measures / control for hierarchical sampling
Repeated measures with random effects of interest
Model Comparison LRTs, AIC/BIC, LRTs & AIC/BIC, descriptive A4.3
Statement on model selection What detail is given by the authors on how different models have been compared, or a final model selected? A4.4
Convergance / Random Effect What detail is given by the authors of any convergence issues and what was done to deal with this (e.g. model
simplification simplification)?
Model equation Yes reported, not reported, given for some and not others A4.5
Dependent variable RT, Errors / Categorical variable, RT & Errors, eye movement data, brain imaging data, other
Fixed Effects 1 IV, IV & Control variables A4.6
Fixed Effects 2 Main effects, main effects & interactions
Random Effect approach (if mentioned) LRTs; LRT & AIC/BIC; LRTs/AIC for slopes; Maximal structure; LRTs backwards from maximal, LRTs upwards from A4.7
minimal; LRTs against null
Random Effect Intercepts modelled Subject, Item/other, Subject & Item/other, Subject, item & other, Item & other
Random Effect Slopes modelled FE over subject, FE over item/other, FE over subject & items/other, FE over subject with interactions, FE over items/
other with interactions, FE over subject & items/other with interactions
Random Effect covariances modelled Yes reported as modelled, no not modelled, unclear whether modelled or not
Reporting Format Text only A4.8
Text & Tables
Text, tables & figures
Table & Figures
Text & Figures
Figures
Tables
Reporting Fixed Effects Coefficients A4.9
Coefficients, t & p
Coefficients, SE/CI
Coefficients, SE/CI, t/z
Coefficients, SE/CI, t/z & p
Coefficients, SE/CI, p
Coefficients, p
t/z, p
p
Additional note if condition means reported, not coefficients.
Reporting Random Effects Variance, variance & covariance, or not reported A4.10
Model fit reported R2, model estimate correlation with data, R2 & est. correlations, AIC/BIC, Log Likelihood, other (define), no. A4.11
P-values (if mentioned) Assume t > 1.96 / 2 A4.12
MCMC
LRTs
F tests
Sattherwaite
Kenward-Rogers
Appendices for full reporting (if mentioned) Yes A4.13

bootstrapping procedures available in R (n = 71); assuming t was tables (n = 85) and text and figures (n = 94), with several only re-
normally distributed and taking t > 1.96 or t > 2 as significant porting model output in the text (n = 52; Table A4.10). Thus, many
(n = 52); or taking p-values for fixed effects from Likelihood Ratio papers do not provide a summary of model output in a table, as you
Tests (LRTs) comparing models with and without the effect of interest would expect for an analysis using multiple regression.
(n = 40). Other options for evaluating significance involved using We saw every possible variation in reporting fixed effects (Table
approximations for calculating degrees of freedom (e.g., Sattherwaite, A4.11). The majority reported fixed effect coefficients, standard errors
n = 20; number of observations – fixed effects n = 2), or using F tests or confidence intervals, test statistics (t/z) and p-values (n = 128). It
calculated over the model output (n = 23) (see further discussion in was also common to report the coefficients and the standard error or
Sections 4.1.5 and 4.1.6). confidence intervals with a test statistic but no p-value (n = 52), a p-
It was very rare for measures of model fit to be reported (Table value but no test statistic (n = 39), coefficients without standard errors
A4.9), with most papers not providing this information (n = 330). or confidence intervals (n = 73), or to provide only a test statistic or a
When model fit information was provided, it was most often the Log p-value (n = 43).
Likelihood or AIC/BIC value (n = 35) which are informative relative to Most studies did not report random effects at all (Table A4.12,
another model of the same data. R2 was provided in few cases (n = 8). n = 304), with only 51 papers reporting variances and 23 reporting
variances and correlations or covariances.
Reporting results A small number of papers used appendices to provide a complete
Manuscripts typically used text, tables and figures to report model report on model selection, fitting and code used for analysis (n = 25,
output (n = 151) although other options were evenly split over text and Table A4.13).

14
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Discussion never the case that papers reported both whether a coefficient was
significant and whether the inclusion of that predictor improved model
The variation in practice evident from the review of papers mirrors fit (n = 2).
the uncertainty reported by surveyed researchers. Naturally, some of
the variation will be attributable to what is appropriate to the data and General discussion
the hypotheses (e.g., the use of LMMs or GLMMs, the modelling of main
effects only or interactions). What concerns us is the evidence for un- Linear Mixed-effects Models (LMMs) have, for good reason, become
necessary or arbitrary variation in the reporting of LMMs. Because it is an increasingly popular method for analyzing data across many fields
arbitrary, this variation will make analyses difficult to parse and it will but our findings outline a problem that may have far-reaching con-
incubate an irreducible difficulty (given low rates of data or code sequences for psychological science even as the use of these models
sharing) for the aggregation or summary of psychological findings. This grows in prevalence. We present a snapshot of what psychological re-
difficulty will, necessarily, impede the development of theoretical ac- searchers think about mixed-effect models, and what they do when they
counts or practical applications. publish reports based on their results. A survey of researchers reveals
Prior to completing this work, we hypothesized that models were that we are concerned about applying LMMs in our own analyses, and
being used in different ways by the research community – as an alter- about the use of LMMs across the discipline. These widely-held con-
native to multiple regression or as an alternative to ANOVA. We found cerns are linked to uncertainty about how to fit, understand and report
some support for this, the vast majority of models (70%) were framed as the models. We may understand the reasons why we should use them
regression analyses, and around 15% as ANOVA analysis. We also found but many among us are unclear how to proceed, as writers or as re-
other approaches, for example, whether the random effects were re- viewers, in the absence of clear guidance, and in the face of marked
ported as data of interest, or whether the study was explicitly control- inconsistencies in reporting practices. These concerns are mirrored in a
ling for a hierarchical sampling procedure. Around 56% of the papers striking diversity apparent in the ways in which researchers specify
reported some form of model comparison but did not always then give models, present effects estimates, and communicate the results of sig-
informative detail. For model selection, 24% provide explicit detail on nificance tests.
the approach taken for fixed effects and around 35% provided detail on We observe that it is the reporting of models that is the principle
how the random effects structure had been chosen. The review of pa- point of failure. We find substantial, seemingly arbitrary, variation
pers clearly shows both diversity of practice and a lack of transparency across studies in the information communicated about models and the
and detail in reporting. This makes the diversity confusing rather than a estimates derived from them. We predict that this variation will make
source of information. In this context, it is not surprising researchers analyses difficult to parse, and thus will seed an irreducible difficulty
report confusion and lack of knowledge. for the future for the accumulation of psychological evidence. We saw
Of particular interest was the variation in how significance was that model equations were very rarely reported, though this is a simple
established. Only half the papers reported the method used, yet we means to communicate the precise structure of both fixed and random
encountered 10 different methods for testing significance in use. effects. Papers using LMM analysis do not always provide a complete
Depending on the study (e.g., confirmatory hypothesis testing or data summary of the model results. Fixed effect coefficients were not always
exploration) researchers will have different needs for their analysis reported with standard errors or confidence intervals. Random effects
(Cunnings, 2012). When replacing ANOVA or ANCOVA, researchers were hardly reported at all. These are all essential data for meta-ana-
might want something similar to an F test that provides a p-value for lysis and power analysis. Curiously, then, the reporting of LMMs often
the main effect or the interaction effect. This can be achieved by testing ignores the key reason for using the analysis in the first place: an ex-
to see if the inclusion of a predictor improves model fit (e.g., Frisson, plicit accounting for the variance associated with groupings (sampling
Koole, Hughes, Olson, & Wheeldon, 2014; Trueswell, Medina, Hafri, & units) in the data. Random effect variances and covariances allow us to
Gleitman, 2013). Alternatively, an ANOVA can be used to get F-tests for see just how much of the variance in the data can be attributed to, for
predictors. Here, the ANOVA summarises the variation across levels of a example, individual variation (e.g., fast or slow participants) and the
predictor in the model, and therefore how much variation in the out- predicted effects (e.g., do fast participants always show a smaller ef-
come that predictor accounts for (e.g., if there is zero variation across fect?). If we care about psychological mechanisms, these are valuable
experimental conditions, that manipulation does not change the out- observations that are simply not being reported.
come; Gelman & Hill, 2007). It is interesting to note that Gelman and The need for common standards was raised in relation to core as-
Hill (2007) suggested the latter use of ANOVA not as a final analysis pects of working with LMM analysis, including model building, model
step in establishing significance, but as a tool for data exploration to comparison, model selection, and the interpretation of results. There
inform which predictors are interesting when building models. are varying ways to build any statistical model, for example, in linear
We found 63 papers that evaluated significance by using F tests or regression (e.g., stepwise model selection, simultaneous entry of cov-
model comparison (~30% of the papers that reported a specific method ariate predictors) and so there are varying ways to build an LMM. There
of testing significance). However, it was not the case that LMMs framed is no one approach that will suit all circumstances, therefore re-
as ANOVA always used this method for evaluating significance: such searchers should report and justify the process they took. A number of
cases were evenly split across analyses framed as ANOVA (n = 30) and recent studies have shown how the results for experimental data can
those framed as regression (n = 31, see Table A4.14). Where the vary substantially depending on alternate more-or-less reasonable-
analysis was framed as regression, we expected that it would draw on seeming decisions taken during data analysis (Gelman & Loken, 2013;
the power of LMMs to account for nested sampling groups (e.g., geo- Silberzahn & Uhlmann, 2015; Simmons et al., 2011; see, also, Patel,
graphic or genealogical relationships between different languages, Burford, & Ioannidis, 2015). The more complex the analysis pipeline,
Jaeger et al., 2011), modelling the influence of individual differences the greater the possible number of analyses, and the greater the like-
(e.g., such as age, Davies, Arnell, Birchenough, Grimmond, & Houlson, lihood of widespread but undocumented variation in practice. We do
2017), change over time in repeated measures data (e.g., Walls & not identify the existence of alternate analytic pathways as inherently
Schafer, 2006), or accounting for multiple predictor variables (Baayen, troubling – the path we take during analysis is always one amongst
2010; Davies et al., 2017). What researchers might want here is more many. The difficulty for scientific reasoning stems from the occlusion of
similar to regression, exploring model building and comparison (e.g., approaches, decisions and model features by inconsistent or incomplete
Goldhammer et al., 2014) and coefficients for predictor variables. The reporting.
vast majority of manuscripts were framed as regression and reported In general, maybe we as a field can live with a balance in which data
the significance of coefficients (n = 122). Interestingly, it was almost are sacred but analyses are contingent. On publication, we share the

15
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

data and analysis as transparently as possible, and seek to guarantee its Preparation for using LMMs
fidelity. We do not assume that an analysis as‐published will be the last A number of researchers are moving from analysing factorial design
word on the estimation of effects carried in the data. We allow that data with ANOVA to analysing factorial design data with LMMs. In this
alternate analyses may, in future, lead to revision in estimated effects. context, the sample of experimental stimuli or trial types needs to be
This approach would be supported by a reduced reliance on sig- carefully considered to furnish the sensitivity sufficient to detect ex-
nificance cut-offs and a greater focus on effect sizes themselves. A more perimental or observed effects (see below), and the computational en-
systematic exploration of the sensitivity of results to analytic choices gine (most often, maximum likelihood estimation) for LMMs assumes a
may permit the field to build in robustness to results reporting. In a large sample size (Maas & Hox, 2004, 2005). It is our view that some
helpful recent discussion, Gelman and Hennig (2017) explore the ways issues with convergence are likely caused by researchers using LMMs to
in which researchers can usefully move to considering statistical ana- analyse relatively small sets of data. With smaller samples, it is less
lyses in terms of transparency, consensus, impartiality, correspondence likely that a viable solution can be found to fit the proposed model to
to observable reality, and stability. Consistent with our analysis of the the data. It is worth highlighting that the literature on mixed-effects
application and reporting of mixed-effects models in psychological models defines ‘small’ as 50 or fewer sampling units (Bell, Morgan,
science, Gelman and Hennig (2017) advocate, moreover, the broader Kromery, & Ferron, 2010; Maas & Hox, 2004, 2005). A researcher may
acknowledgement of multiple perspectives, the ways in which different be interested in the effect of frequency, testing this with 10 high fre-
decisions can be made given differing perspectives or in different con- quency and 10 low frequency words. In an ANOVA, the participant
texts, and the rigorous explanations of our choices given the possibility average RT for the high and low frequency words would be calculated.
of alternate approaches. It may be that we shall see, increasingly, that In an LMM, this would be the coefficient for frequency (e.g. Fig. 4a).
analyses addressing scientific hypotheses are supplemented by ex- However, a random effect may also be fit to model how this effect
aminations of the stability of estimates over reasonable variants in differs for each participant (e.g. Fig. 4b). In this case, the model only has
approach. We are certain, however, that transparency in reporting will available 20 data points per participant (10 high and 10 low) and this
be foundational to progress. may simply be insufficient to complete the computation (Bates et al.,
2015). With more complex random effect structures (e.g., maximal
structures for some designs, after Barr et al., 2013) and perhaps no
Best practice guidance change in how researchers plan experiments, it is unsurprising that
convergence issues have become increasingly common.
In the following sections, we present short discussions and re- In short, plan to collect data for as many stimuli and as many par-
commendations for practice for the key areas highlighted by the survey ticipants as possible. This comes with the caveat that with very large
and review results. We offer, in Table 7, advice concerning best practice sample sizes, smaller effects can become ‘significant’ even though they
in reporting LMMs. may not be meaningful. We direct researchers to the discussion in
Section 4.1.6 below, and the very sensible advice from the American

Table 7
Best practice guidance for reporting LMMs.
Issue Recommendation

Preparation for modelling


Software Report the software and version of software used for modelling
Power analysis (Section 4.1.2) Report any a-priori power analyses, including effect sizes for fixed effects and variances for random effects.

The model
Assumptions of LMM (Section 4.1.3) Report what data cleaning has been completed, outlier/data removal, transformations (e.g., centering or standardizing variables)
or other changes prior to or following analysis (e.g., Baayen & Milin, 2015).
Report whether models meet assumptions for LMMs.
Report if transformations were carried out in order to meet assumptions (e.g., log transformation of reaction time to meet the
assumption that residuals are normally distributed).
Selection of fixed and random effects (Section Random effects are explicitly specified according to sampling units (e.g., participants, items), the data structure (e.g., repeated
4.1.4 and 4.1.5) measures) and anticipated interactions between fixed effects and sampling units (e.g., intercepts only or intercepts and slopes).
Fixed effects and covariates are specified from explicitly stated research questions and/or hypotheses.
Report the size of the sample analysed in terms of total number of data points and of sampling units (e.g., number of participants,
number of items, number of other groups specified as random effects, such as classes of children).
Model comparison* (Section 4.1.5) A clear statement of the methods by which models are compared/selected; e.g., simple to complex, covariates first, random effects
first, fixed effects first etc.
Report comparison method (LRT, AIC, BIC) and justify the choice.
A complete report of all models compared (e.g., in appendices/supplementary data/analysis scripts) with model equations and the
result of comparisons. An example table reporting model comparisons can be found in Appendix Table A5.1.
Convergence issues (Section 4.1.5) If models fail to converge, the approach taken to manage this should be comprehensively reported. This should include the formula
for each model that did or did not converge and a rationale for a) the simplification method used and b) the final model reported.
This may be most easily presented in an analysis script.

The results (Sections 4.1.6 and 4.1.7)


Model* Provide equation(s) that transparently define the reported model(s). An elegant way to do this is providing the model equation
with the table that reports the model output (see Appendix Table A5.2).
Model output* Final model(s) reported in a table that includes all parameter estimates for fixed effects (coefficients, standard errors and/or
confidence intervals, associated test statistics and p-values if used), random effects (standard deviation and/or variance for each
random effect, correlations/covariances if modelled) and some measure of model fit (e.g. R-squared, correlation between fitted
values and data) (see Appendix Table A5.2).
Data and code Share coding script used to complete the analysis.
Wherever possible share data that generated the reported results.

* Example tables here are adapted from the excellent examples in Stevenson, Hickendorff, Resing, Heiser, & de Boeck, 2013 (Table 2), Goldhammer et al., 2014
(Table 1) and Li, Bicknell, Liu, Wei, & Rayner, 2014.

16
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Statistical Association (Wasserstein & Lazar, 2016) to move away from (1600 data points). It bears repeating that any power analysis is de-
cut-offs for interpreting p values. Where smaller sample sizes are pendent on the effect sizes under consideration so there is no simple
unavoidable (e.g. recruitment of hard to reach or specialist populations, rule (e.g., “just use 40 participants and 40 items”). In parallel, it is an
difficulty generating large samples of stimuli), researchers should - of empty critique to say that studies are ‘underpowered’ unless we can
course - acknowledge this limitation. They should also examine guess the likely effect sizes. For example, with LMM analysis a typical
(see osf.io/bfq39/files/LMMs_BestPractice_Example_withOutput) the factorial experiment in psychology with 30 participants responding to
random effects and consider their validity. Convergence issues may 30 stimuli has power of 0.25 for a small effect size (0.2) and 0.8 for a
mean that the fitting of random effects for some terms is not possible. medium effect size (0.5, see Fig. 2 in Westfall et al., 2014). To achieve a
Random effect variances that are close to zero indicate there is little power of 0.95 for this number of participants and stimuli, you need a
variance to be accounted for in the data. Random intercepts and slopes minimum effect size of around 0.6. Recall that 0.4 is a typical effect size
that show high or near perfect correlations may indicate over-fitting. for psychological studies (Brysbaert & Stevens, 2018). Adding more
participants alone does not remedy this problem (Luke, 2016), as power
Power analysis for LMMs asymptotes due to the variation in stimuli (Westfall et al., 2014). This
It will surprise no-one that power analysis for LMMs is complicated. links back to the issue identified above: the higher-level groupings
This is principally because study design features like the use of repeated (sampling units) in the data influence variation (responses for the same
measures require multiple level sampling (e.g., of participants, of sti- participant are correlated, responses for the same items are correlated)
muli) and entail a hierarchical or multilevel structure in the data so ideally, the numbers for all sampling units should be increased. Ul-
(grouping trial-level observations, say, under participants or stimuli) timately, these considerations may change the design of the study.
(Scherbaum & Ferreter, 2009; Snijders & Bosker, 1993; Snijders, 2005). Brysbaert and Stevens (2018) provide an easy to read tutorial on
If, for example, a researcher presents all 20 stimuli to each of 20 par- conducting power analysis to detect fixed effects. They show how to use
ticipants, in each condition of a factorial design, the data sample can be the online application from Westfall et al. (2014, jakewestfall.-
characterized in terms of the lowest level of sampling (the individual shinyapps.io/two_factor_power/) as well as power analysis using si-
observations, n = 400, of each response by a participant to a stimulus) mulated data in R. For the online application from Westfall et al.
but also in terms of higher-level groupings, or sampling units (the (2014), researchers need (a) an estimate of the effect size for the fixed
number of participants and the number of items), while the mixed-ef- effect (b) estimates for the variance components – i.e. the proportion of
fects model may incorporate terms to estimate effects or interactions the total variance that each random effect in the model accounts for and
between effects within and across levels of the hierarchical data (c) the number of participants and the number of items. For power
structure (i.e. effects due to participant attributes, stimulus properties, analysis from simulation, researchers would ideally use pilot data or
or trial conditions). In addition, for LMMs, we can usefully consider the data from a published study. It is also possible, with some skill, to
power to accurately estimate fixed effect coefficients, random effect generate data sets that give an ‘idealised’ experiment outcome (e.g. a
variances, averages for particular sampling units or interactions across significant effect of some reasonable size) and base power analysis on
those units (Scherbaum & Ferreter, 2009; Snijders, 2005). From hereon that. It is worth stressing that without the full reporting of random
we will focus only on power to detect fixed effect predictors. effects in publications and more common sharing of data we are se-
For fixed effects, power in LMMs does not increase simply as the verely limiting our ability to conduct useful a-priori power analysis.
total sample of observations increases. Observed outcome values within Appendix 2 lists packages available for LMM power analyses, but we
a grouping (e.g., trial response values for a given participant) may be strongly recommend Brysbaert and Stevens (2018) as a starting point.
more or less correlated. If this correlation (the intra-class correlation for
a given grouping) is high, adding more individual data points for a Assumptions for LMMs
grouping does not add more information (Scherbaum & Ferreter, 2009). Researchers should check whether the assumptions of LMMs have
In other words, if the responses across trials from a particular partici- been met. For LMMs, we take the same assumptions as for regression
pant are highly correlated, the stronger explanatory factor is the par- (linearity, random distribution of residuals, homoscedasticy; Maas &
ticipant, not the individual trials or conditions (as we saw in the ex- Hox, 2004, 2005) except that LMMs are used because the independence
ample in Section 1.1). Getting the participant to do more trials does not assumption is violated because we know that data are grouped in some
increase power. This also means that accurate power estimation for way, so observations from those groups are correlated. For LMMs, we
LMMs requires us to estimate or know the variation within and between assume that residual errors and random effects deviations are normally
sampling units, e.g., for trials within subjects (Scherbaum & Ferreter, distributed (Crawley, 2012; Field & Wright, 2011; Pinheiro & Bates,
2009; Snijders & Bosker, 1993). This is one of the reasons why reporting 2000; Snijders & Bosker, 2011). The simplest way to check these as-
random effect variances is so important for the field. sumptions is to plot residuals and plot random effects. The script as-
The general recommendation is to have as many sampling units as sociated with Section 1.1 provides some R code for plotting random
possible, since this is the main limitation on power (Snijders, 2005), effects. For plotting residuals and checking model assumptions, we refer
where sampling units consist of the sets by which the lowest level of readers to the excellent tutorial by Winter (2013). It has been shown
observations (e.g., trial-level observations) are grouped, where group- that non-normally distributed random effects do not substantially affect
ings can be expected to cause correlations in the data (Bell et al., 2010; the estimation of fixed effect coefficients but do affect the reliability of
Maas & Hox, 2005; Maas and Hox, 2004). Fewer sampling units will the variance estimates for the random effects themselves (Maas & Hox,
mean that effects estimates are less reliable (underestimated standard 2004).
errors, greater uncertainty over estimates; Bell et al., 2010; Maas & Hox,
2004; 2005). When looking across a range of simulation studies, Selecting random effects
Scherbaum and Ferreter (2009) concluded that increasing numbers of The literature suggests that two approaches can sensibly be taken.
sampling units is the best way to improve power (this held for the ac- Researchers may choose to select random effects according to experi-
curacy of estimating fixed effect coefficients, random effect variances mental design (Barr et al., 2013; Brauer & Curtin, 2018), and this can
and cross-level interactions). For psychological research, this means result in a maximal to minimal-that-converges modelling process (more
30–50 participants, and 30–50 items or trials for each of those parti- on this below). Alternatively, researchers can select random effects that
cipants completing each condition (i.e. a total sample of 900–2500 data improve model fit (Bates et al., 2015; Linck & Cunnings, 2015; Magezi,
points; Scherbaum & Ferreter, 2009). For example, assuming typical 2015). This results in a minimal to maximal-that-improves-fit process.
effect sizes of 0.3–0.4 (scaled in standard deviations), Brysbaert and In both cases, the random effects part of the model is built first. Once it
Stevens (2018) recommend a minimum of 40 participants and 40 items is established, fixed effects are added.

17
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Selecting random effects according to experimental design has been Model comparison and model selection
recommended for confirmatory hypothesis testing (Barr et al., 2013) There is a tradition of data analysis in psychological research in
and this is the most common situation for researchers in experimental which factorial ANOVAs are used to test all possible main effects and
psychology. The steps are to identify the maximal random effects interactions, given a study design, in an approach that appears objec-
structure that is possible for the design, and then to see if this model tive. We acknowledge that this approach appears to relieve the re-
converges (whether the model can be fit to the data). Brauer and Curtin searcher of the need to make decisions about the model (though it may
(2018) helpfully summarise Barr et al. (2013) with three rules for se- require decisions about the data, Steegen, Tuerlinckx, Gelman, &
lecting a maximal random effects structure, add: (1) random intercepts Vanpaemel, 2016; and though decisions may be involved in subtle
for any unit (e.g., subjects or items) that cause non-independence in the ways, Gelman & Hennig, 2017; Simmons et al., 2011). It is tempting,
data; (2) random slopes for any within-subject effects; and (3) random therefore, for researchers to adhere to a prominent set of re-
slopes for interactions that are completely within-subjects. commendations as the ‘one true way’ to complete analysis, disregarding
Many readers will have found that complex random effect structures the fact that LMMs require an explicit modelling approach. Comparable
may prevent the model from converging. This often occurs because the with other modelling approaches (e.g., growth curve modelling, struc-
random effects specified in the model are not present in the data (Bates tural equation modelling), however, we advocate that there should be a
et al., 2015; Matuschek et al., 2017). For example, when a random clear statement of the criteria used when selecting model parameters
effect is included to estimate variance associated with differences be- and these should be principally driven by the research questions.
tween participants in the effect of a within-subjects interaction between
variables, while in the data the interaction does not substantially vary A pragmatic approach to life with multiple models. It would be productive
between participants, researchers would commonly find that the for the field if we acknowledge that the approach we take during
random effect cannot be estimated. Solutions to convergence problems analysis is typically to choose one course given alternatives. We should
may include the simplification of model structure (Brauer & Curtin, ask the questions “How was your study designed?” and “What do you
2018; Matuschek et al., 2017), using Principal Components Analysis to want to know from the data?” and “Given that, why have you taken the
determine the most meaningful slopes (Bates et al., 2015), switching to approach you have taken?”. So, it is inevitable that researchers will end
alternate optimization algorithms (see comments by Bolker, 2015), or up building and testing multiple models when working with LMMs. In
indeed to alternate programming languages or approaches (e.g., Bayes the context of testing data from an experimental design (e.g., the kind
estimation, Eager & Roy, 2017). We strongly recommend the summary of factorial design that would traditionally be analysed using an
provided by Brauer and Curtin (2018), where a step-by-step guide is ANOVA), it is sensible for the fixed effects to be defined around the
provided for dealing with convergence issues and, in particular, steps to experimental conditions (see, e.g., Barr et al., 2013; Schad, Vasishth,
take for simplification from a maximal model. Hohenstein, & Kliegl, 2018). However, researchers may have fixed
Alternatively, researchers may select random effects that improve effect variables that they wish to analyse in addition to the
model fit (Linck & Cunnings, 2015; Magezi, 2015). Matuschek et al. experimental conditions. These could be added after the experimental
(2017) demonstrated that models are more sensitive (in the detection of conditions, added at the same time, or tested for inclusion. Naturally,
fixed effects) if random effects are specified according to whether the approach taken will depend on the hypotheses. As we have stated
Likelihood Ratio Test (LRT) model comparisons warrant their inclusion, above, there is no single correct approach that will apply across all
that is, according to whether or not the random effects improve model situations.
fit. Matuschek et al. (2017) contend that we cannot know in advance There are several approaches to model selection. In a controlled
whether a random effect structure is supported by the data, and that in experimental study, the hypotheses about the fixed effects may be en-
the long run, fitting models with random effects selected for better tirely specified in terms of the expected impact of the experimental
model fit means that the researcher can effectively manage both Type I conditions, and these could then be entered all at once (as for ANOVA).
and Type II error rates. So, under this process, the random effects are Alternatively, researchers may be interested in finding the simplest
built up successively and tested at each point to see if they improve explanation for the data. In this case, they might start with the most
model fit, beginning with intercepts, slopes for main effects, then in- complex model, incorporating all possible effects implied by the ex-
tercepts and slopes, and then interactions between main effects. Re- perimental design, and remove terms that do not influence model fit
searchers may find that certain random effect terms do not improve (i.e., where a simpler model may explain the data comparably to a more
model fit, or that the model does not converge with some terms. In the complex model). The approach taken by a researcher should be justified
model output, random effect variances may be estimated as close to with respect to their research questions, and assumptions.
zero. Either outcome suggests the random effect being modelled is not
present in the sample. Where covariances are modelled (correlations Model comparison. Model comparison can be completed using
between intercepts and slopes), perfect correlations between random information criteria (e.g., the Akaike Information Criterion, AIC, and
effect terms can indicate over-fitting. That is, all the variance explained the Bayesian Information Criterion, BIC; see discussions in Aho,
by the random slope is in fact already explained by fitting the random Derryberry, & Peterson, 2014) and Likelihood Ratio Tests (LRTs).
intercept (leading to a perfect correlation between these terms). In this LRTs apply when models are nested (the terms of the simpler model
case, it is unlikely that the inclusion of the slope would improve model appear in the more complex model) and the models are compared in a
fit. pairwise fashion (see discussions in Luke, 2016; Matuschek et al.,
Our focus on random effects reflects the novelty of this requirement 2017). If not nested, models can be evaluated by reference to
for psychological research, and the conceptual and computational information criteria. Aho et al. (2014) argue that AIC and BIC may be
challenges involved: what effects can be specified? (Barr et al., 2013); differently favoured in different inferential contexts (e.g., in their
what effects allow a model to converge? (Eager & Roy, 2017). More account, whether analyses are exploratory (AIC) or confirmatory
broadly however, our discussion reflects a general point about model (BIC)), and we highlight, for interested readers, a rich literature
specification and selection: why should we want to build all models in surrounding their use (e.g., Burnham & Anderson, 2004; see, also,
the same way? The two options we have outlined above for selecting McElreath, 2015). However, LRT model comparisons are often useful as
random effects are both reasonable and well-motivated. It should be left a simple means to evaluate the relative utility of models differing in
up to individual researchers to choose the approach they prefer and to discrete components (models varying in the presence vs. absence of
give the rationale for that choice. hypothesized effects). The LRT statistic is formed as twice the log of the
ratio of the likelihood of the more complex (larger) model divided by
the likelihood of the less complex (smaller) model (Pinheiro & Bates,

18
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

2000). It can be understood as a comparison of the strength of the format for reporting LRT model comparisons concisely. When multiple,
evidence, given the same data, for the more complex versus the simpler equally plausible, models of the data are possible, a fruitful approach is
model. The likelihood comparison yields a p-value (e.g., using the to examine the variation in estimates across a series of models and
anova() function in R) because the LRT statistic has an approximately report this as a test of the robustness of effects (Patel et al., 2015).
χ2 distribution, assuming the null hypothesis is true (that the simpler In an era of online publication, it is straightforward for appendices
model is adequate), with degrees of freedom equal to the difference and supplementary materials to house additional information. The
between the models in the number of terms. provision of analysis scripts and data with publication are a straight-
When comparing models using LRTs, successive models should forward means to repeat or modify analyses if researchers (and re-
differ in either their fixed effects or their random effects but not both. viewers) so wish. With the increasing use of pre-registration, re-
This is because (a) models tested with LRTs must be nested and (b) a searchers will specify in advance the modelling approach they will use.
change in the random effect structure will change the values of the fixed This may include an actual model to be fit (i.e. a model equation), but
effects (and vice versa). Models can be generated using maximum at minimum it should include the dependent variable(s), fixed effects,
likelihood (ML) or restricted maximum likelihood (REML). Both covariates, a description of how random effects were chosen and the
methods solve model fitting by maximizing the likelihood of the data method by which model selection will take place (e.g. simple to com-
given the model. When comparing models that differ in their fixed ef- plex, covariates first etc.). To be truly comprehensive it should also
fects, it is recommended to use ML estimation for the models. This is have an a-priori power analysis (see Section 4.1.2); this alone would
because REML likelihood values depend on the fixed effects in the mean the model (or alternative models) are well specified beforehand.
model (Faraway, 2016; Zuur, Ieno, Walker, Saveliev, & Smith, 2009).
When comparing models that differ in their random effects, it is re- Testing the significance of fixed effects
commended to use REML estimation for the models. This is because ML Researchers familiar with ANOVA will know that significance tests
estimates of random variance components tend to be underestimated in typically require the specification of model and error (denominator)
comparison with REML estimates (Zuur et al., 2009). degrees of freedom. Computing degrees of freedom for significance tests
Researchers may be concerned whether there need to be corrections in LMMs is a non-trivial problem (Baayen, 2008; Bates, 2006; Luke,
for multiple comparisons when multiple models are being compared 2016). For models with a hierarchical structure it is not clear how to
using LRTs. If a complex model is being built and LRTs are being used at define the denominator degrees of freedom (e.g., by number of ob-
each step to judge the inclusion or exclusion of a particular effect, servations, number of participants, or number of random effects). As
should there be an adjustment to the alpha level to reflect the volume of Luke (2016) notes, researchers may prefer to use model comparison
comparisons being made? The problem can be framed in terms of the with LRTs to evaluate the significance of a fixed effect as this method
simplification of a model where greater complexity is rejected because does not require computation of denominator degrees of freedom. The
the more complex model is found, by means of the LRT comparison, to lme4 package in R (Bates et al., 2015) provides a summary guide to
fit the data no better than the simpler model. A simpler model, in that how p-values can be obtained for fitted models (search for help(“pva-
circumstance, will be associated with too narrow confidence limits and lues”) when lme4 is installed), with a number of different options for
too small p-values, however good the overall fit, because degrees of confidence intervals, model comparison and two named methods for
freedom corresponding to the dismissed complexity (the rejected larger computing degrees of freedom (Kenward-Roger, Satterthwaite).
model) are then not accounted for in the estimation of standard errors Clearly, one reason why multiple methods for computing p-values ap-
for the simpler model (cf. Harrell, 2015). More generally, p-values pear in the literature is that a variety of options are available.
depend upon the researcher following their intentions: adhering to Luke (2016) used simulations to compare different methods for
prior sampling targets, or completing as many statistical comparisons as computing significance in LMMs. In pairwise model comparisons, ob-
were planned (J. Kruschke, 2014; J.K. Kruschke, 2013). Therefore, our served likelihood ratios are associated with p-values under the as-
advice would be that, firstly, researchers should be explicit about the sumption that the distribution of the LRT statistic approximates the χ2
models they fit and evaluate. Secondarily, if researchers plan to perform distribution. Alternatively, the t statistics associated with model coef-
significance tests, they should consider the utility of pre-registering ficients can be treated as z scores, where t > 1.96 effects can be taken
experimental data collection and analysis plans (Nosek, Ebersole, to be significant (at the .05 alpha level). Luke (2016) found that in-
DeHaven, & Mellor, 2018). terpreting t as z is anti-conservative, especially for small samples of
participants and items and, critically, that this risk is independent of the
Using multiple models to test for robust effects. It is worth considering how total number of observations because one cannot compensate for small
variation in data preparation and model building can be harnessed to numbers of participants with large numbers of items. In our literature
clarify the stability or sensitivity of effect estimates. Steegen et al. review, LRTs and t-as-z approaches were the most commonly used in
(2016) described multiverse analyses, in which all possible data sets are published manuscripts. Luke (2016) reports that Satterthwaite and
constructed from a sampling of the alternative ways in which raw data Kenward-Rogers approximations when applied to models estimated
can be prepared for analysis (e.g., with variations on outlier exclusion, with REML yield relatively robust significance tests across different
variable coding) and the analysis of interest is then performed across samples sizes. Following Luke (2016), we recommend the use of these
these data sets. P-value plots can be used to show how effects vary methods when p-values are needed for fixed effects. If researchers want
across differently collated data sets, indicating the robustness of results, to complete the equivalent of ANOVA omnibus and follow up tests, they
or potential holes in theory or measurement. Patel et al. (2015) describe can perform an LRT when a fixed effect is added to the model (omnibus
the “vibration of effects” or VoE which shows the variation in effect test) and then compute contrasts (the follow up tests) from the model
estimates across different models. This is particularly applicable in (see Schad, Vasishth, Hohenstein & Kliegl, 2018, for detailed guidance
cases where there are many ways to specify models, and many possible on performing contrasts in R). In summary, once the final model is
variables or covariates of interest. VoE analysis shows how the established, it can be estimated with REML, and significance tests for
influence of a variable changes across models and as more covariates model coefficients can be performed using Satterthwaite or Kenward-
are included (adjustment variables). Rogers approximate degrees of freedom.
Alternatively, some researchers argue for abandoning dichotomous
Reporting model building. The problem we have identified, the arbitrary “above or below 0.05” thresholds (Amrhein, Greenland, & McShane,
variation in reporting and analytic practice, is not insoluble. When 2019; Wasserstein & Lazar, 2016; Wasserstein, Schirm, & Lazar, 2019).
multiple models have been fit to reach a final ‘best model’, best practice This is in line with a now substantial body of work arguing for a change
is to report the process of comparison. Appendix Table A5.1 offers a in how Null Hypothesis Significance Testing (NHST) and frequentist

19
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

statistics are used. For example, reporting means or coefficient esti- is tight – many institutions provide data storage and archive facilities
mates and confidence intervals but not p-values (2013b; Cumming, for their researchers, and the Open Science Framework provides facil-
2013a) or interpreting p-values as just another piece of information ities for data storage and archive, as well as pre-registration and project
about the likelihood of the result (Wasserstein, Schirm & Lazar, 2019). documentation.
We strongly advise readers to familiarize themselves with the American Knowing in advance that an analysis script will be shared on pub-
Statistical Association’s statement on p-values (Wasserstein & Lazar, lication will likely make researchers more systematic and attentive to
2016). their code and annotations in the first place. It should also encourage
An increasing number of researchers advocate the adoption of more supportive discussion (rather than criticism) around analysis
Bayesian analysis methods (Kruschke, 2013; McElreath, 2015) in which processes and best practice methods, and give the less experienced an
estimates for fixed effects coefficients and random effects variances (or easy way to learn from experts.
covariances) are associated with posterior distributions that allocate
varying probabilities to different potential effect values. Researchers Conclusion
familiar with lme4 model syntax (Baayen et al., 2008; Bates et al.,
2015) can apply the same syntax to fit Bayesian mixed-effects models We completed a survey of current practice and a review of published
(using the brms library, Bürkner, 2017). With Bayesian models, re- papers for LMMs. Concerns raised in the survey were broadly corro-
searchers can identify the credible interval encompassing the plausible borated by data from a review of published papers. In response to this,
estimates for an effect (see Vasishth, Nicenboim, Beckman, Li, & Kong, we have reviewed current guidelines for the implementation and re-
2018, for a helpful recent tutorial; see Nicenboim & Vasishth, 2018, for porting of LMMs, and provided a summary of best practice. A summary
an example report in this journal) instead of seeking to test (only) the of that summary is provided below. The survey highlighted that many
existence of the effect (Kruschke, 2013). Bayesian model fitting en- researchers felt they had a lack of knowledge, or were unable to
courages the incorporation of prior beliefs about the varying plausi- properly deal with the complexity of LMMs. We hope this paper has
bility of potential estimates for target effects. For example, researchers gone some way to remedying this deficit (perceived or real), and en-
interested in the effect of word attributes on response latency in reading couraging researchers to spend time preparing analyses in a such a way
tasks would, perhaps, suppose a priori that the coefficient for a hy- that fully transparent reporting is painless.
pothesized effect in this domain would be captured by an estimate as-
sociated with a normal probability distribution centered on 0, with a Bullet points for best practice
standard deviation of plus or minus 10. This quantifies the belief that
psycholinguistic effects vary in size and direction, are of the order of • Plan to collect data for as many stimuli and as many participants as
tens of milliseconds, and that some hypothesized effects may tend to possible.
zero. Relevant to earlier discussion, recent work has shown that pro- • Complete power analysis prior to data collection. This will require
blems encountered with convergence for more complex mixed-effects that you specify the model and consider plausible effect sizes.
models can be avoided through using Bayesian model fitting given the • Acknowledge that the choices you make during analysis are con-
specification of prior information (Eager & Roy, 2017). Essentially, this sidered, justified and one path amongst many.
is because the incorporation of prior information directs model fitting • During analysis, check that assumptions of LMMs have been met.
processes away from extreme values (e.g. random effects variances • If using LMMs to control for unexplained variance (e.g. when re-
close to zero) that can cause problems for convergence. Regardless of placing ANOVA), fit random effects first.
whether models are fit with frequentist or Bayesian methods, reporting • Provide a clear rationale for selection of fixed effects and any model
of the modelling process needs to be entirely transparent. comparison or model selection process.
• Appendix 5 provides example tables for concisely reporting model
Reporting comparison and model outputs (https://osf.io/bfq39/files/)
The standard for publication should be that other researchers can • Provide the model equation(s) for the final model or models to be
reproduce the study itself, as well as the study’s results on the basis of reported.
the reported method, analysis approach and data (if available) (e.g., • If reporting p values, estimate the final model or models to be re-
Open Science Collaboration, 2015). It is our judgment that many issues ported using REML and report Satterthwaite or Kenward-Rogers
arise because of ‘under-reporting’ – that is, insufficient information approximate degrees of freedom for p values for fixed effect coef-
provided in publications on the analysis steps (Gelman & Loken, 2013; ficients.
Silberzahn & Uhlmann, 2015; Simmons et al., 2011) and for LMMs • Report point estimates, standard errors and confidence intervals for
more specifically, incomplete reporting of model results. Table 7 pro- the fixed effect coefficients.
vides guidance for the reporting of LMMs (more specific guidance on • Report random effect variances from the final model in full.
Generalised Linear Mixed-effects Models can be found in Bolker et al., • Whenever possible, share analysis code and data on publication.
2009).
We have been asked what to do about the extensive documentation CRediT authorship contribution statement
required by what we see as best practice, comprehensive, reporting.
The simple solution is for researchers to share their data analysis scripts Lotte Meteyard: Conceptualization, Methodology, Funding acqui-
with publication. Scripts show exactly what decisions have been taken sition, Project administration, Investigation, Supervision, Formal ana-
and exactly how models were selected and compared. When provided lysis, Writing - original draft, Writing - review & editing. Robert A.I.
with data, they allow any other researcher to replicate entirely the Davies: Conceptualization, Methodology, Formal analysis, Validation,
reported results. Researchers using R may also consider making their Resources, Writing - original draft, Writing - review & editing.
whole analysis reproducible (Marwick, Boettiger, & Mullen, 2018). This
can be achieved with packages such as docker, which creates a con- Acknowledgements
tainer (a stand-alone application, Gallagher, 2017). This recreates the
complete environment of the original analysis (for a tutorial, see This work was supported by a British Academy Skill Acquisition
Powell, 2019). The package holepunch will create a docker file, de- Grant SQ120069 and University of Reading 2020 Research Fellowship
scription and image on GitHub for a particular analysis that can then be to LM. We would like to thank all the students, researchers and col-
run independently (Ram, 2019). For long term storage of scripts and leagues who have discussed mixed models with the authors over the last
analysis information there are a number of options where journal space few years, and those who responded to the survey. Our thanks go also to

20
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Elizabeth Volke who helped LM prepare the survey and collate results Cassidy, S. A., Dimova, R., Giguère, B., Spence, J. R., & Stanley, D. J. (2019). Failing
whilst visiting the School of Psychology & Clinical Language Sciences, grade: 89% of introduction-to-psychology textbooks that define or explain statistical
significance do so incorrectly. Advances in Methods and Practices in Psychological
University of Reading, on an Erasmus studentship. LM would like to Science, 2(3), 233–239. https://doi.org/10.1177/2515245919858072.
thank Morgan and Marcella Meteyard Whalley, for enabling the time Chabris, C. F., Hebert, B. M., Benjamin, D. J., Beauchamp, J., Cesarini, D., van der Loos,
M., ... Laibson, D. (2012). Most reported genetic associations with general in-
and mental space to get this project started and finished. We are telligence are probably false positives. Psychological Science, 1(23), 1314–1323.
grateful to reviewers and colleagues whose comments improved the Chang, Y.-H., & Lane, D. M. (2016). Generalizing across stimuli as well as subjects: A non-
manuscript considerably. mathematical tutorial on mixed-effects models. The Quantitative Methods for
Psychology, 12(3), 201–219. https://doi.org/10.20982/tqmp.12.3.p201.
Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics
Appendix A. Supplementary material in psychological research. Journal of Verbal Learning and Verbal Behavior, 12(4),
335–359.
Cohen, J. (1983). The cost of dichotomization. Applied Psychological Measurement, 7(3),
Supplementary data to this article can be found online at https:// 249–253.
doi.org/10.1016/j.jml.2020.104092. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple correlation/re-
gression analysis for the behavioral sciences. UK: Taylor & Francis.
Coleman, E. B. (1964). Generalizing to a language population. Psychological Reports,
References 14(1), 219–226.
Crawley, M. J. (2012). The R book. John Wiley & Sons.
Cumming, G. (2013a). Understanding the new statistics: Effect sizes, confidence intervals, and
Aarts, E., Verhage, M., Veenvliet, J. V., Dolan, C. V., & van der Sluis, S. (2014). A solution meta-analysis. Routledge.
to dependency: Using multilevel analysis to accommodate nested data. Nature Cumming, G. (2013b). The new statistics why and how. Psychological Science, 25(1), 7–29.
Neuroscience, 17(4), 491–496. Cunnings, I. (2012). An overview of mixed-effects statistical models for second language
Aho, K., Derryberry, D. W., & Peterson, T. (2014). Model selection for ecologists: The researchers. Second Language Research, 28(3), 369–382.
worldview of AIC and BIC. Ecology, 95(3), 631–636. Davies, R. A. I., Arnell, R., Birchenough, J., Grimmond, D., & Houlson, S. (2017). Reading
Amrhein, V., Greenland, S., & McShane, B. (2019). Retire statistical significance. Nature, through the life span: Individual differences in psycholinguistic effects. Journal of
567, 305–307. Experimental Psychology: Learning, Memory, and Cognition, 43(8), 1298–1338.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using. R. Eager, C., & Roy, J. (2017). Mixed effects models are sometimes terrible. arXiv preprint
Cambridge University Press. arXiv:1701.04858.
Baayen, R. H. (2010). A real experiment is a factorial experiment? The Mental Lexicon, Faraway, J. J. (2016). Extending the linear model with R: Generalized linear, mixed effects
5(1), 149–157. and nonparametric regression models. Chapman and Hall/CRC.
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with Field, A., Miles, Z., & Field, Z. (2009). Discovering statistics using R. London, UK: Sage
crossed random effects for subjects and items. Journal of Memory and Language, 59(4), Publications.
390–412. Field, A., & Wright, D. B. (2011). A primer on using multilevel models in clinical and
Baayen, R. H. (2013). languageR: Data sets and functions with “Analyzing Linguistic Data: experimental psychopathology research. Journal of Experimental Psychopathology,
A practical introduction to statistics”. R package version 1.4.1. http://CRAN.R- 2(2), 271–293.
project.org/package=languageR. Frisson, S., Koole, H., Hughes, L., Olson, A., & Wheeldon, L. (2014). Competition between
Baayen, R. H., & Milin, P. (2015). Analyzing reaction times. International Journal of orthographically and phonologically similar words during sentence reading: Evidence
Psychological Research, 3(2), 12–28. from eye movements. Journal of Memory and Language, 73, 148–173.
Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. J. (2004). Gallagher, S. (2017). Mastering docker. USA: Packt Publishing.
Visual word recognition of single-syllable words. Journal of Experimental Psychology: Gelman, A. (2014). The connection between varying treatment effects and the crisis of
General, 133(2), 283. unreplicable research: A Bayesian perspective. Journal of Management, 41(2),
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for 632–643 DOI: 0149206314525208.
confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of
68(3), 255–278. the Royal Statistical Society: Series A (Statistics in Society), 180, 967–1033.
Bates, D. M. (2006). [R] lmer, p-values and all that. Post on the R-help mailing list, May Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical
19th, available at: https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html. models. Cambridge, UK: Cambridge University Press.
Bates, D. M. (2007). Linear mixed model implementation in lme4. Manuscript, university Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons
of Wisconsin - Madison, January 2007. can be a problem, even when there is no “fishing expedition” or “p-hacking” and the
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects research hypothesis was posited ahead of time. Department of Statistics, Columbia
models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10. University. Retrieved from http://www.stat.columbia.edu/~gelman/research/
18637/jss.v067.i01. unpublished/p_hacking.pdf.
Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. arXiv Goldhammer, F., Naumann, J., Stelter, A., Tóth, K., Rölke, H., & Klieme, E. (2014). The
preprint arXiv:1506.04967. time on task effect in reading and problem solving is moderated by task difficulty and
Bell, B. A., Morgan, G. B., Kromery, J. D., & Ferron, J. M. (2010). The impact of small skill: Insights from a computer-based large-scale assessment. Journal of Educational
cluster size on multilevel models: A Monte Carlo examination of two-level models Psychology, 106(3), 608–626.
with binary and continuous predictors. JSM Proceedings, Survey Research Methods Goldstein, H. (2011). Multilevel statistical models, Vol. 922. John Wiley & Sons.
Section, 1(1), 4057–4067. Harrell, F. E., Jr (2015). Regression modeling strategies: With applications to linear models,
Bickel, R. (2007). Multilevel analysis for applied research: It's just regression!. Guilford Press. logistic and ordinal regression, and survival analysis. Springer.
Boisgontier, M. P., & Cheval, B. (2016). The ANOVA to mixed model transition. Hox, J. (2010). Multilevel analysis: Techniques and applications. New York, NY: Routledge.
Neuroscience & Biobehavioral Reviews, 68, 1004–1005. IBM Corp (2013). IBM SPSS Statistics for Windows, Version 22.0. Armonk, NY: IBM Corp.
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., Ioannidis, J. P. (2005). Why most published research findings are false. Chance, 18(4),
& White, J. S. S. (2009). Generalized linear mixed models: A practical guide for 40–47.
ecology and evolution. Trends in Ecology & Evolution, 24, 127–135. Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or
Bolker, B. (2015). GLMM. Retrieved August 01, 2016, from http://glmm.wikidot.com/ not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446.
faq. Jaeger, T. F., Graff, P., Croft, W., & Pontillo, D. (2011). Mixed effect models for genetic
Bowen, N. K., & Guo, S. (2011). Structural equation modeling. Oxford University Press. and areal dependencies in linguistic typology. Linguistic Typology, 15(2), 281–320.
Brauer, M., & Curtin, J. J. (2018). Linear mixed-effects models and the analysis of non- JASP Team (2016). JASP (Version 0.8.0.0) [Computer software].
independent data: A unified framework to analyze categorical and continuous in- Judd, C. M., Westfall, J., & Kenny, D. A. (2012). Treating stimuli as a random factor in
dependent variables that vary within-subjects and/or within-items. Psychological social psychology: A new and comprehensive solution to a pervasive but largely ig-
Methods, 23, 389–411. nored problem. Journal of Personality and Social Psychology, 103, 54.
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research Kliegl, R., Nuthmann, A., & Engbert, R. (2006). Tracking the mind during reading: The
in Psychology, 3(2), 77–101. influence of past, present, and future words on fixation durations. Journal of
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data Experimental Psychology: General, 135, 12–35.
analysis methods. London, UK: Sage. Kliegl, R., Wei, P., Dambacher, M., Yan, M., & Zhou, X. (2011). Experimental effects and
Brysbaert, M. (2007). The language-as-fixed-effect-fallacy: Some simple SPSS solutions to a individual differences in linear mixed models: Estimating the relationship between
complex problem. London: Royal Holloway, University of London. spatial, object, and attraction effects in visual attention. Frontiers in Psychology, 1,
Brysbaert, M., & Stevens, M. (2018). Power analysis and effect size in mixed effects 238.
models: A tutorial. Journal of Cognition, 1(1). Kliegl, R. (2014). Reduction of complexity of linear mixed models with double-bar syntax.
Bürkner, P. C. (2017). brms: An R package for Bayesian multilevel models using Stan. RPubs.com/Reinhold/22193.
Journal of Statistical Software, 80, 1–28. Kreft, I. G., & de Leeuw, J. (1998). Introducing multilevel modeling. London, UK: Sage.
Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: Understanding AIC and Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S., & Baker, C. I. (2009). Circular analysis
BIC in model selection. Sociological Methods & Research, 33, 261–304. in systems neuroscience: The dangers of double dipping. Nature Neuroscience, 12(5),
Burstein, L., Miller, M.D., & Linn, R.L. (1981). The use of within-group slopes as indices of 535–540.
group outcomes. Center for the Study of Evaluation, Graduate School of Education, Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental
UCLA, Los Angeles California. CSE Report 171. Psychology: General, 142, 573–603.
Carp, J. (2012a). The secret lives of experiments: Methods reporting in the fMRI litera- Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan.
ture. Neuroimage, 63(1), 289–300. Academic Press.
Carp, J. (2012b). On the plurality of (methodological) worlds: Estimating the analytic Kuznetsova, A., Brockhoff, P. B. & Christensen, R. H. B. (2016). lmerTest: Tests in Linear
flexibility of fMRI experiments. Frontiers in Neuroscience, 6, 149.

21
L. Meteyard and R.A.I. Davies Journal of Memory and Language 112 (2020) 104092

Mixed Effects Models. R package version 2.0-30. http://CRAN.R-project.org/ Journal of Memory and Language, 41(3), 416–426.
package=lmerTest. Ram, K. (2019). Hole punch. Retrieved 15 August, 2019, from https://karthik.github.io/
Li, X., Bicknell, K., Liu, P., Wei, W., & Rayner, K. (2014). Reading is fundamentally similar holepunch/index.html.
across disparate writing systems: A systematic characterization of how words and Rasbash, J., Charlton, C., Browne, W. J., Healy, M., & Cameron, B. (2009). MLwiN Version
characters influence eye movements in Chinese reading. Journal of Experimental 2.1. Centre for Multilevel Modelling, University of Bristol.
Psychology: General, 143(2), 895. Rabe-Hesketh, S., & Skrondal, A. (2012). Multilevel and longitudinal modeling using stata
Lieberman, M. D., & Cunningham, W. A. (2009). Type I and Type II error concerns in fMRI (3rd ed.). College Station, TX: Stata Press.
research: Re-balancing the scale. Social Cognitive and Affective Neuroscience, 4(4), Rasbash, J., Browne, W., Goldstein, H., Yang, M., Plewis, I., Healy, M., ... Lewis, T.
423–428. (2000). A user’s guide to MLwiN. London: Institute of Education286.
LimeSurvey Project Team & Schmitz, C. (2015) LimeSurvey: An Open Source survey tool Rietveld, T., & van Hout, R. (2007). Analysis of variance for repeated measures designs
/LimeSurvey Project Hamburg, Germany. URL http://www.limesurvey.org. with word materials as a nested random or fixed factor. Behavior Research Methods,
Linck, J. A., & Cunnings, I. (2015). The utility and application of mixed-effects models in 39(4), 735–747.
second language research. Language Learning, 65(S1), 185–207. Roach, A., Schwartz, M. F., Martin, N., Grewal, R. S., & Brecher, A. (1996). The
Locker, L., Hoffman, L., & Bovaird, J. A. (2007). On the use of multilevel modeling as an Philadelphia naming test: Scoring and rationale. Aphasiology, 24, 121–133.
alternative to items analysis in psycholinguistic research. Behavior Research Methods, Rossini, A. J., Heiberger, R. M., Sparapani, R. A., Maechler, M., & Hornik, K. (2004).
39(4), 723–730. Emacs speaks statistics: A multiplatform, multipackage development environment for
Lorch, R. F., & Myers, J. L. (1990). Regression analyses of repeated measures data in statistical analysis. Journal of Computational and Graphical Statistics, 13(1), 247–261.
cognitive research. Journal of Experimental Psychology: Learning, Memory, and https://ess.r-project.org/.
Cognition, 16(1), 149. Schad, D. J., Vasishth, S., Hohenstein, S., & Kliegl, R. (2018). How to capitalize on a priori
Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in R. Behavior contrasts in linear (mixed) models: A tutorial. arXiv preprint arXiv:1807.10451.
research methods, 49, 1494–1502. Scherbaum, C. A., & Ferreter, J. M. (2009). Estimating statistical power and required
Maas, C. J. M., & Hox, J. J. (2004). The influence of violations of assumptions on mul- sample sizes for organisational research using multilevel modeling. Organizational
tilevel parameter estimates and their standard errors. Computational Statistics & Data Research Methods, 12(2), 347–367.
Analysis, 46, 427–440. Schluter, D. (2015). Fit models to data. Retrieved August 1, 2016, from https://www.
Maas, C. J. M., & Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling. zoology.ubc.ca/~schluter/R/fit-model/.
Methodology, 1(3), 86–92. Silberzahn, R., & Uhlmann, E. L. (2015). Many hands make tight work. Nature,
Magezi, D. A. (2015). Linear mixed-effects models for within-participant psychology 526(7572), 189.
experiments: An introductory tutorial and free, graphical user interface (LMMgui). Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undi-
Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00002. sclosed flexibility in data collection and analysis allows presenting anything as sig-
Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical word re- nificant. Psychological Science, 22(11), 1359–1366.
producibly using R (and friends). PerrJ Preprints, 6, e2192v2. Snijders, T. A. (2005). Power and sample size in multilevel linear models. In B. S. Everitt,
MATLAB and Statistics Toolbox Release 2012b, The MathWorks, Inc., Natick, & D. C. Howell (Vol. Eds.), Encyclopedia of statistics in behavioral science: Vol. 3, (pp.
Massachusetts, United States. 1570–1573). Chicester (etc.): Wiley.
Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., & Bates, D. (2017). Balancing Type I Snijders, T. A., & Bosker, R. J. (2011). Multilevel analysis (2nd ed.). London, UK: Sage.
error and power in linear mixed models. Journal of Memory and Language, 94, Snijders, T. A., & Bosker, R. J. (1993). Standard errors and sample sizes for two-level
305–315. research. Journal of Educational Statistics, 18(3), 237–259.
McCoach, B. D., Rifenbark, G. G., Newton, S. D., Xiaoran, L., Kooken, J., Yomtov, D., ... Stan Development Team (2016). Stan modeling language users guide and reference
Bellara, A. (2018). Does the package matter? A comparison of five common multi- manual, Version 2.14.0. http://mc-stan.org.
level modeling software packages. Journal of Educational and Behavioral Statistics, Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency
43(5), 594–627. through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702–712.
McElreath, R. (2015). Statistical rethinking: A Bayesian course with examples in R and Stan. Stevenson, C. E., Hickendorff, M., Resing, W. C., Heiser, W. J., & de Boeck, P. A. (2013).
Chapman and Hall/CRC. Explanatory item response modeling of children's change on a dynamic test of ana-
Meteyard, L., & Bose, A. (2018). What does a cue do? comparing phonological and se- logical reasoning. Intelligence, 41(3), 157–168.
mantic cues for picture naming in aphasia. Journal of Speech, Language, and Hearing Th. Gries, S. (2015). The most under-used statistical method in corpus linguistics: multi-
Research, 61(3), 658–674. level (and mixed-effects) models. Corpora, 10(1), 95–125.
Muthén, L. K., & Muthén, B. O. (2011). Mplus user's guide (sixth ed.). CA: Los Angeles. Tremblay, A., & Newman, A. J. (2015). Modeling nonlinear relationships in ERP data
Murayama, K., Sakaki, M., Yan, V. X., & Smith, G. M. (2014). Type I error inflation in the using mixed-effects regression with R examples. Psychophysiology, 52(1), 124–139.
traditional by-participant analysis to metamemory accuracy: A generalized mixed- Trueswell, J. C., Medina, T. N., Hafri, A., & Gleitman, L. R. (2013). Propose but verify:
effects model perspective. Journal of Experimental Psychology: Learning, Memory, and Fast mapping meets cross-situational word learning. Cognitive Psychology, 66(1),
Cognition, 40(5), 1287. 126–156.
Nagin, D. S., & Odgers, C. L. (2010). Group-based trajectory modeling in clinical research. Vasishth, S., Nicenboim, B., Beckman, M. E., Li, F., & Kong, E. J. (2018). Bayesian data
Annual Review of Clinical Psychology, 6, 109–138. analysis in the phonetic sciences: A tutorial introduction. Journal of Phonetics, 71,
Nava & Marius (2017). Glmer mixed models inconsistent between lme4 updates. 147–161.
Retrieved July 11, 2019, from https://stackoverflow.com/questions/20963216/ Venables, W. N. (2014). S-PLUS and S. Wiley StatsRef: Statistics Reference Online.
glmer-mixed-models-inconsistent-between-lme4-updates. Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in
Nicenboim, B., & Vasishth, S. (2018). Models of retrieval in sentence comprehension: A fMRI studies of emotion, personality, and social cognition. Perspectives on
computational evaluation using Bayesian hierarchical modeling. Journal of Memory Psychological Science, 4(3), 274–290.
and Language, 99, 1–34. Wager, T. D., Lindquist, M., & Kaplan, L. (2007). Meta-analysis of functional neuroima-
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration ging data: Current and future directions. Social Cognitive and Affective Neuroscience,
revolution. Proceedings of the National Academy of Sciences, 115(11), 2600–2606. 2(2), 150–158.
https://doi.org/10.1073/pnas.1708274114. Walls, T. A., & Schafer, J. L. (2006). Models for intensive longitudinal data. New York, NY:
Open Science Collaboration. (2015). Estimating the reproducibility of psychological sci- Oxford University Press.
ence. Science, 349(6251), aac4716. Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.
Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on 05”. Editorial. The American Statistician, 73 (Issue supplement 1: Statistical Inference
replicability in psychological science: A crisis of confidence? Perspectives on in the 21st Century: A World Beyond p < 0.05), 1–19.
Psychological Science, 7(6), 528–530. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA's statement on p-values: Context,
Patel, C. J., Burford, B., & Ioannidis, J. P. (2015). Assessment of vibration of effects due to process, and purpose. The American Statistician, 70(2), 129–133.
model specification can demonstrate the instability of observational associations. West, B. T., & Galecki, A. T. (2011). An overview of current software procedures for fitting
Journal of Clinical Epidemiology, 68(9), 1046–1058. linear mixed models. The American Statistician, 65(4), 274–282.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. New York, NY: Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in
Springer-Verlag. experiments in which samples of participants respond to samples of stimuli. Journal of
Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D. & R Core Team (2016). nlme: Linear and Experimental Psychology: General, 143(5), 2020.
Nonlinear Mixed Effects Models_. R package version 3.1-128, URL: http://CRAN.R- Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic
project.org/package=nlme. applications. arXiv preprint arXiv:1308.5499.
Poldrack, R. A., & Gorgolewski, K. J. (2014). Making big data open: Data sharing in Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood
neuroimaging. Nature Neuroscience, 17(11), 1510. estimation of semiparametric generalized linear models. Journal of the Royal
Powell, D. (2019). Conducting reproducible research with Docker (Part 1 of 3). Retrieved Statistical Society (B), 73(1), 3–36.
from http://www.derekmpowell.com/posts/2018/02/docker-tutorial-1/. Wood, S. N. & Scheipl, F. (2016). gamm4: Generalized additive mixed models using 'mgcv'
Core Team, R. (2017). R: A language and environment for statistical computing. Vienna, and 'lme4'. R package version 0.2-4. http://CRAN.R-project.org/package=gamm4.
Austria: R Foundation for Statistical Computing. https://www.R-project.org/. Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects
Raaijmakers, J. G. (2003). A further look at the“ language-as-fixed-effect fallacy”. models and extensions in ecology with R. Springer Science & Business Media.
Canadian Journal of Experimental Psychology/Revue canadienne de psychologie Zwaan, R. A., Magliano, J. P., & Graesser, A. C. (1995). Dimensions of situation model
expérimentale, 57(3), 141. construction in narrative comprehension. Journal of Experimental Psychology:
Raaijmakers, J. G., Schrijnemakers, J. M., & Gremmen, F. (1999). How to deal with “the Learning, Memory, and Cognition, 21, 386–397.
language-as-fixed-effect fallacy”: Common misconceptions and alternative solutions.

22

You might also like