Data Visualization and Analysis in Second Language Research
Data Visualization and Analysis in Second Language Research
The Second Language Acquisition Research Series presents and explores issues
bearing directly on theory construction and/or research methods in the study
of second language acquisition. Its titles (both authored and edited volumes)
provide thorough and timely overviews of high-interest topics and include
key discussions of existing research findings and their implications. A special
emphasis of the series is reflected in the volumes dealing with specific data col-
lection methods or instruments. Each of these volumes addresses the kinds of
research questions for which the method/instrument is best suited, offers
extended description of its use, and outlines the problems associated with its
use. The volumes in this series will be invaluable to students and scholars
alike and perfect for use in courses on research methodology and in individual
research.
DATA VISUALIZATION
AND ANALYSIS IN
SECOND LANGUAGE
RESEARCH
Guilherme D. Garcia
First published 2021
by Routledge
605 Third Avenue, New York, NY 10158
and by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2021 Taylor & Francis
The right of Guilherme D. Garcia to be identified as author of this work
has been asserted by him in accordance with sections 77 and 78 of the
Copyright, Designs and Patents Act 1988
All rights reserved. No part of this book may be reprinted or reproduced
or utilised in any form or by any electronic, mechanical, or other means,
now known or hereafter invented, including photocopying and
recording, or in any information storage or retrieval system, without
permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks or
registered trademarks, and are used only for identification and
explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
A catalog record for this book has been requested
CONTENTS
List of Figures x
List of Tables xii
List of Code Blocks xiii
Acknowledgments xvi
Preface xviii
Part I
Getting Ready 1
1 Introduction 3
1.1 Main Objectives of This Book 3
1.2 A Logical Series of Steps 5
1.2.1 Why Focus on Data Visualization Techniques? 5
1.2.2 Why Focus on Full-Fledged Statistical Models? 6
1.3 Statistical Concepts 7
1.3.1 p-Values 7
1.3.2 Effect Sizes 9
1.3.3 Confidence Intervals 10
1.3.4 Standard Errors 11
1.3.5 Further Reading 12
2 R Basics 14
2.1 Why R? 14
2.2 Fundamentals 16
vi Contents vi
Part II
Visualizing the Data 61
3 Continuous Data 63
3.1 Importing Your Data 65
3.2 Preparing Your Data 66
3.3 Histograms 68
3.4 Scatter Plots 70
3.5 Box Plots 75
3.6 Bar Plots and Error Bars 77
3.7 Line Plots 80
3.8 Additional Readings on Data Visualization 82
3.9 Summary 82
3.10 Exercises 83
4 Categorical Data 86
4.1 Binary Data 88
vii Contents vii
Part III
Analyzing the Data 107
Glossary 253
References 257
Subject Index 261
Function Index 263
FIGURES
CODE BLOCKS
I wrote most of this book during the COVID-19 pandemic in 2020—in part
thanks to a research grant I received from Ball State University (ASPiRE Junior
Faculty Award). I was fortunate to be engaged in a project that not only inter-
ested me a lot but could be executed from home—and that kept me sufficiently
busy during that stressful year. Thankfully, writing this book was perfectly
compatible with the need for social distance during the pandemic.
My interest in quantitative data analysis concretely started during my first
years as a graduate student at McGill University, back in 2012. During my
PhD (2012–2017), I developed and taught different workshops involving
data analysis and the R language in the Department of Linguistics at McGill
as well as in the Department of Education at Concordia University. After
joining Ball State in 2018, I also organized different workshops on data analysis.
This book is ultimately the result of those workshops, the materials I developed
for them, and the feedback I received from numerous attendees from different
areas (linguistics, psychology, education, sociology, and others).
Different people played a key role in this project. Morgan Sonderegger was
the person who first introduced me to R through graduate courses at McGill.
Natália Brambatti Guzzo provided numerous comments on the book as a
whole. This book also benefitted from the work of three of my students,
who were my research assistants over the summer of 2020: Jacob Lauve,
Emilie Schiess, and Evan Ward. They not only tested all the code blocks in
the book but also had numerous suggestions on different aspects of the manu-
script. Emilie and Jacob were especially helpful with the packaging of some sec-
tions and exercises. Different chapters of this book benefitted from helpful
xvii Acknowledgments xvii
feedback from Jennifer Cabrelli, Ronaldo Lima Jr., Ubiratã Kickhöfel Alves,
Morgan Sonderegger, Jiajia Su, Lydia White, and Jun Xu.
The chapter on Bayesian data analysis is an attempt to incentivize the use of
Bayesian data analysis in the field of second language research (and in linguistics
more generally). My interest in Bayesian statistics began in 2015, and a year
later I took a course on Bayesian data analysis with John Kruschke at Univer-
sität St. Gallen. The conversations we had back then were quite helpful, and his
materials certainly influenced my views on statistics and data visualization as a
whole.
Finally, I wish to thank the team at Routledge, especially Ze’ev Sudry and
Helena Parkinson, with whom I exchanged dozens of emails throughout
2019 and 2020. I’m grateful for their patience and work while copy-editing
and formatting this book.
PREFACE
In this book, we will explore quantitative data analysis using visualization tech-
niques and statistical models. More specifically, we will focus on regression analysis.
The main goal here is to move away from t-tests and ANOVAs and towards full-
fledged statistical models. Everything we do will be done using R (R Core Team
2020), a language developed for statistical computing and graphics. No back-
ground in R is necessary, as we will start from scratch in chapter 2 (we will
also go over the necessary installation steps).
The book is divided into three parts, which follow a logical sequence of
steps. In Part I, we will get you started with R. In Part II, we will explore
data visualization techniques (starting with continuous data and then moving
on to categorical data). This will help us understand what’s going on in our
data before we get into the statistical analysis. Finally, in Part III, we will
analyze our data using different types of statistical models—Part III will also
introduce you to Bayesian statistics. Much like Part II, Part III also covers con-
tinuous data before categorical data.
It’s important to note that this is not a book about statistics per se, nor is it
written by a statistician: I am a linguist who is interested in phonology and
second language acquisition and who uses statistical methods to better under-
stand linguistic data, structure, and theory. Simply put, then, this is a book
about the application of statistical methods in the field of second language acqui-
sition and second language research.
Intended Audience
This book is for graduate students in linguistics as well as second language acqui-
sition researchers more generally, including faculty who wish to update their
xix Preface xix
quantitative methods. Given that we will focus on data visualization and statistical
models, qualitative methods and research design will not be discussed.
The data used in the book will be focused on topics which are specific to
linguistics and second language research. Naturally, that doesn’t mean your
research must be aligned with second language acquisition for this book to
be beneficial.
Background Knowledge
As mentioned earlier, this book does not expect you to be familiar with R. You
are expected to be somewhat familiar with basic statistics (e.g., means, standard
deviations, medians, sampling, t-tests and ANOVAs). We will briefly review
the most crucial concepts in §1.3 later, and they will be discussed every
now and then, but some familiarity is desirable. Bear in mind that the
whole idea here is to move away from t-tests and ANOVAs towards more
comprehensive, flexible, and up-to-date methods, so you should not worry
too much about it if your memory is not so great when it comes to t-tests
and ANOVAs.
Many suggestions can be easily found online, and combining those resources
with the present book will certainly strengthen your understanding of the
topics we will explore.
As you implement the R code in this book, you might run into warning or
error messages in R. Most of the time, warning messages are simply informing
you about something that just happened, and you don’t need to do anything.
Error messages, on the other hand, will interrupt your code, so you will need to
fix the problem before proceeding. Appendix A (Troubleshooting) can help
with some of those messages. Alternatively, simply googling a message will
help you figure out what’s going on—the R community online is huge and
very helpful. All the code in this book has been tested for errors on computers
running Mac OS, Linux Ubuntu, and Windows 10, so you should not have
any substantial issues.
typical words. These will be used to represent code in general, as it’s easier to
read code with such fonts. As a result, all the code blocks in the book will
contain monospaced fonts as well. Words in bold represent terms that may
not be familiar and which are explained in the glossary at the end of this
book. Lastly, a list with the main statistical symbols and acronyms used in
this book can be found in Appendix C.
When we discuss coding in general, we tend to use words and constructions
which are not necessarily used in everyday language and which may be new to
people who have never seen, studied, or read about code in general. For
example, files that contain code are typically referred to as scripts (as mentioned
earlier). Scripts contain several lines of code as well as comments that we may
want to add to them (general notes). When we run a line of code we are essen-
tially asking the computer to interpret the code and do or return whatever the
code is requesting. You will often see the word directory, which simply means
“folder where the relevant files are located”.
In our scripts, we will almost always create variables that hold information for
us (just like variables in math), so that later we can call these variables to access the
information they hold. For example, when we type and run city = “Tokyo” in
R, we create a variable that holds the value Tokyo in it. This value is simply a
word in this case, which we can also refer to as a string. Later, if we just run city,
we will get Tokyo, that is, its content, printed on the screen. When we import
our data into R, we will always assign our data to a variable, which will allow us
to reference and work on our dataset throughout our analysis.
A line of code is our way of asking R to do something for us: instead of click-
ing on buttons, we will write a line of code. For example, we may want to store
information (e.g., city) or evaluate some expression (e.g., 5 + 3). In the latter
case, R will print the answer on the console—that is, it will display the answer
in a specific area of your screen. If R finds a problem, it will throw an error
message on the screen.
Lines of code throughout this book will involve commands or functions.
Functions always have a name and brackets and are used to perform a task
for us: for example, the function print() will print an object on the screen.
The object we want to print needs to go inside the function as one of its argu-
ments. Each function expects different types of arguments (they vary by func-
tion), and if we fail to include a required argument in a function, we will
usually get an error (or a warning). This is all very abstract right now, but
don’t worry: we will explore all these terms in detail starting in chapter 2,
which introduces the R language—and as you work your way through the
book they will become familiar (and concrete).
Finally, all the important points discussed earlier can be found in the summary
here. I hope that your own research can benefit from this book and that you
enjoy learning how to visualize and analyze your data using R.
xxii Preface xxii
PART I
Getting Ready
3
1
INTRODUCTION
We will explore three types of models—yes, only three. They will handle
three different types of data and will equip you with the most fundamental
tools to analyze your data. Most of what we do in L2 research involves contin-
uous data (e.g., scores, reaction time), binary data (e.g., yes/no, correct/
incorrect), and ordinal data (e.g., Likert scales). By the end of this book, we
will have examined models that can handle all three situations.
You may have noticed that almost all statistical analyses in our field rely on p-
values. This is a central characteristic in Frequentist statistics (i.e., traditional sta-
tistical methods). But there is another possibility: Bayesian data analysis, where
p-values do not exist. This book has a chapter dedicated to Bayesian statistics
(chapter 10) and discusses why second language research can benefit from
such methods.
This book assumes that you have taken some introductory course on research
methods. Thus, concepts like mean, median, range, and p-values should sound
familiar. You should also know a thing or two about ANOVAs and t-tests.
However, if you think you may not remember these concepts well enough,
§1.3 will review the important bits that will be relevant to us here. If you think
you don’t have enough experience running t-tests, that’s OK: if this book is suc-
cessful, you will not use t-tests anyway.
1.3.1 p-Values
The notion of a p-value is associated with the notion of a null hypothesis (H0),
for example that there’s no difference between the means of two groups. p-values
are everywhere, and we all know about the magical number: 0.05. Simply put,
p-values mean the probability of finding data at least as extreme as the data we have—
assuming that the null hypothesis is true.2 Simply put, they measure the extent to
which a statistical result can be attributed to chance. For example, if we
compare the test scores of two groups of learners and we find a low (=significant)
p-value (p = 0.04), we reject the null hypothesis that the mean scores between
the groups are the same. We then conclude that these two groups indeed come
from different populations, that is, their scores are statistically different because
the probability of observing the difference in question if chance alone generated
the data is too low given our arbitrarily set threshold of 5%.
How many times have you heard or read that a p-value is the probability that
the null hypothesis is true? This is perhaps one of the most common
8 Getting Ready 8
hypotheses which are compatible with the data and which are more likely to be
true than an implausible hypothesis. Nuzzo (2014, p. 151) provides a useful
example: if our initial hypothesis is unlikely to begin with (say, 19-to-1 odds
against it), even if we find p = 0.01, the probability that our hypothesis is
true is merely 30%.
Problems involving p-values have been known for decades—see, for example,
Campbell (1982, p. 698). For that reason, many journals these days will require
more detailed statistical results: whereas in the past a p-value would be sufficient
to make a point (i.e., old statistics), today we expect to see effect sizes and con-
fidence intervals as well (i.e., new statistics). That’s one of the reasons that we
will focus our attention on effect sizes throughout this book.
make you question whether all the hours invested in recording all of your classes
were worth it.
yes: the true difference, as we saw earlier, is 1.98 points, and 1.98 is in the inter-
val [1.13, 3.00].
Confidence intervals are important because they provide one more level of
information that complements the effect size. Because we are always dealing
with samples and not entire populations, clearly we can’t be so sure that the
answer to our questions are as accurate as a single number, our effect size. Con-
fidence intervals add some uncertainty to our conclusions—this is more realis-
tic. Once we have our 95% confidence interval, we can examine how wide it
is: the wider the interval, the more uncertainty there is. Wider intervals can
mean different things; perhaps we don’t have enough data, or perhaps there’s
too much variation in our data.
deviation of these means, you’ll get the standard error of the sample mean,
which quantifies the variation in the means from multiple samples. The
larger the sample size of our samples (here n = 5), the lower the standard
error will tend to be.5 The more data you collect from a population, the
more accurate your estimate will be of the true mean of that population,
because the variation across sample means will decrease.
Given the example from the previous paragraph, you might think that the
only way to estimate the standard error is to collect data multiple times. Fortu-
nately, that is not the case. We saw in the first paragraph of this section that we
can estimate the standard error even if we only have one sample by dividing the
standard deviation of the sample by the square root of the sample size
(SE ¼ psn. ). This calculation works well assuming that we have a sufficiently
large sample size. If that’s not the case, we can alternatively bootstrap our stan-
dard error. Bootstrapping involves randomly (re)sampling from our own sample
(instead of the population). This allows us to have a sampling distribution of the
sample means even if our sample size is not ideal. We then take the standard
deviation of that distribution, as described earlier. Finally, note that we can
also calculate the standard error of other statistics (e.g., the median) by using
the same methods.
In summary, while the standard deviation tells us about the variation of raw
data, the standard error (from the mean) tells us about the estimated variation of
means. Both statistics are informative, so we could in principle have both of
them in a figure.
Notes
1. In reality, of course, sampling is often not random (e.g., many of our participants are
students on campus or live nearby).
2. And assuming that the p-values were computed appropriately.
3. Note that we cannot prove that the null hypothesis is true in Frequentist statistics. As
a result, the absence of a difference between two groups doesn’t prove that there is no
difference.
4. You may remember that in a normal (Gaussian) distribution, 95% of the area under
the curve lies within 1.96 standard
p... deviations from the mean.
5. But note that we divide s by n, not by n. Therefore, while it is true that SE is
inversely proportional to n, doubling your sample size will not halve your SE.
2
R BASICS
2.1 Why R?
R (R Core Team 2020) is a computer language based on another computer
language called S. It was created in New Zealand by Ross Ihaka and Robert
Gentlemen in 1993 and is today one of the most (if not the most) powerful
tools used for data analysis. I assume that you have never heard of or used R
and that therefore R is not installed on your computer. I also assume that
you have little or no experience with programming languages. In this
chapter, we will discuss everything you need to know about R to understand
the code used in this book. Additional readings will be suggested, but they are
not required for you to understand what is covered in the chapters to come.
You may be wondering why the book does not employ IBM’s SPSS, for
example, which is perhaps the most popular statistical tool used in second lan-
guage research. If we use Google Scholar citations as a proxy for popularity, we
can clearly see that SPSS was incredibly popular up until 2010 (see report on
http://r4stats.com/articles/popularity/). In the past decade, however, its
15 R Basics 15
popularity has seen a steep decline. Among its limitations are a subpar graphics
system, slow performance across a wide range of tasks, and its inability to handle
large datasets effectively.
There are several reasons that using R for data analysis is a smart decision.
One reason is that R is open-source and has a substantial online community.
Being open-source, different users can contribute packages to R, much like
different Wikipedia users can contribute new articles to the online encyclope-
dia. A package is basically a collection of tools (e.g., functions) that we can
use to accomplish specific goals. As of October 2020, R had over 15,000
packages, so chances are that if you need to do something specific in your
analysis, there is a package for that—naturally, we only need a fraction of
these packages. Having an active online community is also important, as
users can easily and quickly find help in forum threads.
Another reason that R is advantageous is its power. First, because R is a lan-
guage, we are not limited by a set of preestablished menu options or buttons. If
we wish to accomplish a goal, however specific it may be, we can simply create
our own functions. Typical apps such as SPSS have a more user-friendly Graph-
ical User Interface (GUI), but that can certainly constrain what you can do with
the app. Second, because R was designed specifically for data analysis, even the
latest statistical techniques will be available in its ecosystem. As a result, no
matter what type of model you need to run, R will likely have it in the
form of a package.
R is also fast, and most people would agree that speed is important
when using a computer to analyze data. It is accurate to say that R is faster
than any tool commonly used in the field of SLA.1 This difference in speed is
easy to notice when we migrate from software such as SPSS, which has a
GUI, to a computer language, which instead has a command line. In R, we
will rarely use our mouse: almost everything we do will be done using the
keyboard.
Last but not least, R makes it easy to reproduce all the steps in data analysis,
an advantage that cannot be overstated as we move towards open science.
Reproducibility is also pedagogically valuable—whenever you see a block of
code in this book, you can run it in R, and you will be able to follow all
the steps in the analysis exactly. This efficiency in reproducing analytical steps
is possible because R, being a language, relies on lines of codes as opposed
to clicks on buttons and menus. The possibility to have a detailed script that
contains all the steps you took in your analysis also means that you can go
back to your study a year later and be sure to understand what exactly your
analysis was doing.
In some ways, using R is like using the manual mode on a professional
camera as opposed to using your smartphone: you may have to learn a few
things about photography first, but a manual mode puts you in charge,
which in turn results in better photos.2
16 Getting Ready 16
2.2 Fundamentals
Installing R
1. Go to https://cloud.r-project.org
2. Choose your operating system
If you use Linux: choose your distro and follow the instructions3
If you use Mac OS: download the latest release (a .pkg file)
If you use Windows: click on base and download R (an .exe file)
For help, visit https://cran.r-project.org/bin/windows/base/rw-FAQ.
html
Installing RStudio
1. Go to https://rstudio.com and click on “Download RStudio”
2. Choose the free version and click “Download”
3. Under “Installers”, look for your operating system
You should now have both R and RStudio installed on your computer. If you
are a Mac user, you may also want to install XQuartz (https://www.xquartz.org)
—you don’t need to do it now, but if you run into problems generating figures
or using different graphics packages later on, installing XQuartz is the solution.
Because we will use RStudio throughout the book, in the next section, we
will explore its interface. Finally, RStudio can also be used online at http://
rstudio.cloud for free (as of August 2020), which means you technically don’t
need to install anything. That being said, this book (and all its instructions) is
based on the desktop version of RStudio, not the cloud version—you can
install R and RStudio and then later use RStudio online as a secondary tool.
For reference, the code in this book was last tested using R version 4.0.2
(2020-06-22)—“Taking Off Again” and RStudio Version 1.3.1073 (Mac
OS). Therefore, these are the versions on which the coding in this book is based.
17 R Basics 17
2.2.2 Interface
Once you have installed both R and RStudio, open RStudio and click on File
. New File . R Script. Alternatively, press Ctrl + Shift + N (Windows) or
Cmd + Shift + N (Mac)—keyboard shortcuts in RStudio are provided in
Appendix B. You should now have a screen that looks like Fig. 2.1. Before
we explore RStudio’s interface, note that the interface is virtually the same
for Mac, Linux, and Windows versions, so while all the examples given in
this book are based on the Mac version of RStudio, they also apply to any
Linux and Windows versions of RStudio. As a result, every time you see a key-
board shortcut containing Cmd, simply replace that with Ctrl if you are using a
Linux or Windows version of RStudio.
What you see in Fig. 2.1 is that RStudio’s interface revolves around different
panes (labeled by dashed circles). Panes B, C, and D were visible when you first
opened RStudio—note that their exact location may be slightly different on
your RStudio and your operating system. Pane A appeared once you created
a new R script (following the earlier steps). If you look carefully, you will
note that a tab called Untitled1 is located at the top of pane A—immediately
below the tab you see a group of buttons that include the floppy disk icon
for saving documents. Much like your web browser, pane A supports multiple
tabs, each of which can contain a file (typically an R script). Each script can
contain lines of code, which in turn means that each script can contain an anal-
ysis, parts of an analysis, or multiple analyses. If you hit Cmd + Shift + N to
create another R Script, another tab will be added to pane A. Next, let’s
examine each pane in detail.
Pane A is probably the most important pane in RStudio. This is the
pane where we will write our analysis and our comments, that is, this is RStu-
dio’s script window. By the end of this book, we will have written and run
several lines of code in pane A. For example, click on pane A and write 2 +
5. This is your first line of code, that’s why you see 1 on the left margin of
pane A. Next, before you hit enter to go to the next line, run that line of
code by pressing Cmd + Enter. You can also click on the Run button to the
left of Source in Fig. 2.1. You should now see the result of your calculation
in pane B: [1] 7.
Pane B is RStudio’s console, that is, it is where all your results will be printed.
This is where R will communicate with you. Whereas you will write your ques-
tions (in the form of code) in pane A, your answers will appear in pane B—
when you ran line 1 earlier, you were asking a simple math question in pane
A and received the calculated answer in pane B. Finally, note that you can
run code directly in pane B. You could, for example, type 2 + 5 (or 2 + 5
without spaces) in pane B and hit Enter, which would produce the same
output as before. You could certainly use pane B for quick calculations and
simple tasks, but for an actual analysis with several lines of code and comments,
18 Getting Ready
18
you certainly want the flexibility of pane A, which allows you to save your script
much like you would save a Word document.
Pane C has three or four visible tabs as you first open RStudio—it can vary
depending on the version of RStudio you are using, but we will not get into
that in this book. The only tab we care about now is Environment. Pane C is
where your variables will be listed (e.g., the objects we import and/or create in
our analyses, such as our datasets). If that does not mean much to you right
now, don’t worry: it will be clear why that is important very soon.
Pane D is important for a number of useful tasks—there are five tabs in pane
D in Fig. 2.1. The first tab (Files) lists the files and folders in the current direc-
tory. This is important for the following reason: R assumes that you are
working in a particular folder on your computer. That folder is called
working directory. If you wish to load some data, R will assume that the
data is located in the current working directory. If it cannot find it there, it
will throw an error on your console (pane B). You could either move your
data to the working directory, or you could tell R exactly where your data
file is located—we will see how to do that soon (§2.4).
The next tab in pane D is Plots. This is where all our figures will appear once
we start working on data visualization in chapter 3. We then have Packages,
which is a list of packages that we can use in our analysis. As mentioned
before, R has thousands and thousands of packages, but most of what we
will do in this book will just require a handful of powerful packages. We
don’t actually need to use this tab to install or load packages; we can do all
that from panes A or B by writing some lines of code (§2.2.3).
Finally, pane D also has a tab called Help and a tab called Viewer. Whenever
you ask for help in R by running help(…), you will be taken to the Help tab.
The Viewer tab is mostly used because RStudio allows us to work with several
types of documents, not just R scripts. For example, you can use RStudio as
your text editor and choose to produce a PDF file. That file would appear
in the Viewer tab (or in a separate window)—we won’t explore this pane in
this book.
RStudio panes are extremely useful. First, they keep your work environment
organized, since you do not need multiple windows floating on your screen. If
you use R’s native editor, panes A and B are actually separate windows, as is the
window for your figures. RStudio, in contrast, organizes everything by adding
panes to a single environment. A second advantage of RStudio’s panes is that
you can hide or maximize all panes. If you look again at Fig. 2.1, you will
notice that panes A–D have two icons at the top right-hand corner. One
icon looks like a box, and one looks like a compressed box. If you click on
them, the pane will be maximized and minimized, respectively. For
example, if you choose to maximize pane D, it will be stretched vertically,
thus hiding pane C—you can achieve the same result by minimizing pane
C. This flexibility can be useful if you wish to focus on your script (pane A)
20 Getting Ready 20
and hide pane B while you work on your own comments, or you could hide
pane D because you are not planning to generate figures at the moment. These
possibilities are especially important if you are working from a small screen.
If you like to customize your working environment to your liking, you can
rearrange the panes in Fig. 2.1. Simply go to RStudio . Preferences (or hit
Cmd + ,) on a Mac. On Linux or Windows, go to Tools . Global
Options…. Next, click on Pane Layout. I do not recommend making any
changes now, but later on you may want to make some adjustments to
better suit your style. Finally, you can also change the appearance of
RStudio by going to RStudio . Preferences again and clicking on Appearance
(on a Mac) or by going to Tools . Global Options… . Appearance (on
Windows). Many people (myself included) prefer to write code on a darker
background (one great example being the “Monokai” theme). If you choose
Monokai from the list under Editor theme, you can see what that looks like.
As with the panes, you may want to wait until you are comfortable with
RStudio to decide whether you want to change its looks.
Now that you are familiar with RStudio’s panes, we can really start looking
into R. The next section will help you learn the fundamental aspects of the R
language. We will explore everything you need to know about R to use this
book efficiently.
2.2.3 R Basics
For this section, we will create one R script that contains a crash course in R,
with code and comments that you will add yourself. First, let’s create a folder
for all the files we will create in this book. Call it bookFiles. Inside that folder,
create another folder called basics—this is where we will work in this chapter.
Second, make sure you create a new script (you can use the one we created
earlier if you haven’t closed it yet). In other words, your screen should look
like Fig. 2.1. Third, save your empty script. Go to File . Save (or hit Cmd
+ S on your keyboard). Choose an intuitive name for your file, such as
rBasics, and save it inside basics, the folder you just created. RStudio will
save it as rBasics.R—all R scripts are .R files. You will probably want to
save your script every time you add a new line of code to it, just in case.
Finally, all the code in question should be added to the newly created
rBasics.R, so at the end of this section you will have a single file that you
can go back to whenever you want to review the fundamentals.
The first thing we will do is write a comment at the top of our script (you
can delete 2 + 5 from your script if it’s still there). Adding comments to our
scripts is important not only when we are learning R but also later, when
we are comfortable with the language. In a couple of months, we will likely
not remember what our lines of code are doing anymore. Even though
some lines are self-explanatory, our code will become more complex as we
21 R Basics 21
explore later chapters in this book. Having informative comments in our scripts
will help us understand what our past self did—and it will also help others
reproduce our analyses. Comments in R must begin with a hashtag (#).
Let’s add a general comment at the top of our script (line 1), which will be
a “title” for our script: # R Basics. If you try to run this line (line 1), R
will simply print “# R Basics” in your console (pane B).4
FILE ORGANIZATION
There are ten code blocks in this chapter, which will be added to five dif-
ferent scripts—rBasics.R was our first script, so we will create four other
scripts in this chapter. All five scripts should be located in your basics
folder, which in turn is located in the bookFiles folder we created earlier.
Remember to save your scripts after making changes to them (just like
you would save any other file on your computer). Consult Appendix D if
you wish to see how all the files in this book are organized or to which
scripts you should add different code blocks.
Comments can also help with the overall organization of our R scripts. Some
of the scripts we will create in this chapter (and in later chapters) will have mul-
tiple code blocks in them. For instance, code blocks 1, 2, and 3 should all go in
rBasics.R. To keep the blocks visually separated inside the script, it’s a good
idea to add comments to the script to identify each code block for future ref-
erence. A suggestion is shown here, where a “divider” is manually created with
a # and a series of equal signs (=====)—recall that everything that is to the
right of a hashtag is a comment and so is not interpreted by R. We will discuss
file organization again later (§2.4.2) and once more in chapter 3.
R code
1 # ========== CODE BLOCK X, CH. X
2
3 # code block content goes here
4
5 # ========== END OF CODE BLOCK
6
7 # ========== CODE BLOCK X, CH. X
8
9 # code block content goes here
10
11 # ========== END OF CODE BLOCK
2.2.3.1 R as a Calculator
We saw earlier that R can be used as a powerful calculator. All the common
mathematical operations are easy to remember: division (5 / 2), multiplication
(5 * 2), addition (5 + 2), and subtraction (5 – 2). Other math operations
include exponentiation (5 ** 2 or 5 ^ 2), modulus5 (5 %% 2 = 1), and
integer division (5 %/% 2 = 2). Let’s add all these operations to our script
and check their outputs. In Fig. 2.2, you can see our new script, rBasics.R,
starting with a comment R basics in line 1. Line 2 is empty to create some
white space, and line 3 creates a divider.
In line 5, we have a short comment explaining what the following line (6)
does. This is obvious here, but it can save you some time in situations where
the meaning of a function is not as apparent to you. Note that the cursor is
currently in line 6. You can press Cmd + Enter anywhere in a given line
and RStudio will run that line (it will also move to the next line automatically
for you). Alternatively, you can also run multiple lines by selecting them and
then pressing Cmd + Enter. In Fig. 2.2, I have already run all lines, as you can
see in the console—note that the output of a comment is the comment itself.
In the console, we will see, for example, that 5 %% 2 is 1 (the remainder of
5 . 2) and that 5 %/% 2 is 2 (the integer that results from 5 . 2). You don’t
have to reproduce the script in the figure exactly; this is just an example
of how you could add some math operations and some comments to an
R script.
Like with any decent calculator, you can also run more complex calculations.
For example, choose any available line in rBasics.R and type 4 * sqrt(9) / (2
** 5) – pi and run it (your output should be −2.766593)—as you open brack-
ets, notice that RStudio will automatically close them for you. You can also
type it in your console and hit Enter to run the code. Here we see two functions
p...
for the first time, namely, sqrt(9) ( 9) and pi (π).
FIGURE 2.2 Starting Your First Script with RStudio: Math Operations
24 Getting Ready 24
When you create a variable, you will notice that nothing is actually printed
in your console (except the variable assignment itself). But if you now look at
pane C (Environment), you will notice x is there, holding the value 10 (i.e., the
result of 5 * 2). From now on, you can simply type x and run it, and R will
print the value it holds (10). If you now type x = x + 10 in the next line of
your script, you are telling R to “update” the value of x and which will
now be 20 (x = 10 + 10). If you want to go back to what x was before,
you can rerun the line where you specify x = 5 * 2 and voilà. By having dif-
ferent lines of code, you can go back to any stage of your analysis at any time. If
you are used to Cmd + Z in other applications, having a script gives you all the
redos you will ever need, with the advantage that you can choose which part of
your analysis you wish to jump to by rerunning specific lines of code—the
History tab in pane C will list all the lines of code we have already run.6
This is illustrated in lines 1 and 2 in code block 1.7
Assigning simple operations or single numbers to variables is useful, but we
need to be able to assign multiple values to a variable too. The problem is that
if you run a line such as y = 1, 2, 3, you will get an error. To assign all three
numbers to y, we need to use the c() function, which concatenates or combines
different elements. If you run y = c(1, 2, 3) (line 4 in code block 1), you will
now see that a new variable (y) has been added to your environment (pane C).
As a result, if you simply type y in your script and run the line, R will print 1
2 3. As always, you can also type y in the console and hit Enter. Our new
object, y, is a vector, which we’ve just created with the c() function.
Vectors are extremely important in R, as they allow us to hold however
many values we need in a single variable. What if we want to have a vector
that contains different words as opposed to numbers? Let’s create a new vari-
able, myWords, and assign three values to it: English, Spanish, French.
Because these are words, we need quotation marks around each one, so you
need to type myWords = c(“English”, “Spanish”, “French”)—this is demon-
strated in line 6 in code block 1.
myWords is a vector that contains strings, or characters, as opposed to y,
which contains only numbers. myWords also has a better name than y, as it
conveys what the contents of the variable are—you should always try to have
intuitive names for variables (x, for example, is not intuitive at all). Variable
names cannot start with numbers, have spaces, or have special symbols that
may already have a meaning in R—a hashtag, for example. But you can still
use underscores, periods, hyphens and, crucially, lower- and uppercase letters
(variable names are case-sensitive). myWords uses camelcase, where beginnings
of non-initial words have capital letters instead of spaces. This is generally a
good way to have short, intuitive, and easy-to-read variable names. Finally,
you could rename y by reassigning it to a new variable: myNumbers = y.
Now you have both variables in your environment, but you can forget about
y and only use myNumbers instead (line 11 in code block 1).
25 R Basics 25
Vectors have an important requirement: all the elements they contain must
belong to the same class. As a result, you can have a vector with just numbers
(like x) or a vector with just strings (myWords), but you cannot have a vector
that contains numbers and strings at the same time. If you try doing that
(mixedVector in line 8 in code block 1), R will force your numbers to be
strings by adding quotes around the numbers (“1”, “2”, “English”). It does
that because it cannot force strings to be numbers.
Now that we know what vectors are, we need to understand how we can
access their contents. Naturally, you could simply type myWords and run it,
and RStudio would print “English” “Spanish” “French” in your console.
But what if you wanted to access just the first element in the vector? You
can easily do that using what we call slice notation. To access the first
element in myWords, simply type myWords[1]. The number within the
square brackets represents the index of the element you want to access. What
if you wanted both the first and the third elements in myWords? You can
treat these two positions (indices) as numbers inside a vector and use that
vector within the square brackets: myWords[c(1, 3)]. If you wanted the
second and third elements, you could type myWords[c(2, 3)].
Besides accessing elements in a vector, we can also use different functions to
ask a wide range of questions about the contents of said vector—lines 18–25 in
code block 1. For example, length(myWords) will return 3, which is the
number of items the vector contains. str(myWords) will return the class of
the vector—I return to this function in §2.3. Because myWords contains
strings, it is a character vector, abbreviated as chr in R if you run line 19 in code
block 1. If you were to run str(myNumbers), you would get num, because
all the members in myNumbers belong to the class numeric. Since we’re
dealing with a single vector, the same result can be achieved by running class
(). Finally, another useful function is summary(), which will return different
outputs depending on the object you wish to analyze. For the vectors in ques-
tion, it will tell us about the length of the vector as well as its class. Later, when
we explore more complex objects, summary() will provide more information.
In code block 1, we also see the function rev(), which reverses a vector. If
myNumbers contains 1 2 3, then rev(myNumbers) will print 3 2 1. This also
works with myWords, so running line 22 in code block 1 will give us French
Spanish English. In line 23 in code block 1, we see rep(myWords, times = 2).
Can you guess what this function is doing? It repeats the vector n times (here, n
= 2). The output looks exactly like this: [1] “English” “Spanish” “French”
“English” “Spanish” “French”. Notice that all our outputs begin with [1].
That simply tells us that the first word that we see in that row is item
number 1 in the vector. In other words, this is just R helping us visually
count. While this may seem unnecessary for the vectors we are examining
here, it can certainly be helpful when we deal with longer vectors, which
involve multiple lines.
26 Getting Ready 26
R code
1 x = 5 * 2 # x is now 10
2 x = x + 10 # x is now 20; run line 1 and it will be 10 again
3
4 y = c(1, 2, 3) # a vector that concatenates three numbers
5
6 myWords = c("English", "Spanish", "French")
7
8 mixedVector = c(1, 2, "English")
9
10 # Renaming a variable:
11 myNumbers = y
12
13 # Slice notation:
14 myWords[1] # = "English"
15 myWords[c(1, 3)] # = "English" "French"
16
17 # Some useful functions:
18 length(myWords)
19 str(myWords)
20 class(myWords)
21 summary(myWords)
22 rev(myWords)
23 rep(myWords, times = 2)
24 mean(myNumbers)
25 sd(myNumbers)
The last two functions in code block 1 are mean() and sd(), for calculating
the mean and the standard deviation of a numeric vector. Of course, these
functions cannot be computed for character vectors such as myWords.
We have just explored vectors that contain two classes of objects, namely,
character (strings) and numeric8 (numbers). But these are not the only
classes of objects we will encounter. For example, we have the logical class.
As an example, let’s say you want to find out whether 5 is greater than 2.
You could type 5 > 2 and run it in R, and the answer (output) would be
TRUE (note that all letters are capitalized). Likewise, if you type 5 == 5,
the answer is TRUE: 5 equals 5. If you type 5 %in% c(1,3,5), the answer is
also TRUE—you can read the %in% operator as “is present in”.9 Both
TRUE and FALSE belong to the logical class, and so does NA (“Not Avail-
able”, i.e., a missing value). Thus, we could add a vector to our collection
of variables which contains elements from the logical class: myLogicals =
c(TRUE, TRUE, FALSE, TRUE, NA). As we explore the chapters in this
book, we will come across these (and other) classes, and you will see how
they can be helpful in a typical analysis.
27 R Basics 27
2.2.3.3 Lists
Thus far we have discussed vectors that contain numbers, strings, and logical
values. As mentioned earlier, however, a vector cannot contain different
classes of objects in it. Lists, on the other hand, can. As a result, we could
have a list with all three classes we have examined earlier. To create a list,
we use the list() function. Let’s create a list with different classes of objects
in it, and let’s call it myList1—see line 2 in code block 2 (remember to add
this code block to rBasics.R). myList1 contains three numbers (1, 2, 3),
three strings (“English”, “Spanish”, “French”), and two logicals (TRUE, NA).
Notice that the numbers in the list are the same numbers in myNumbers
and the strings in the list are the same strings in myWords. We could also
create a list by typing list(myNumbers, myWords, c(TRUE, NA, FALSE))—
see line 5 in code block 2, where this list is assigned to a new variable,
myList2. These are seemingly identical lists, but if you run both of them and
then call them separately, you will see that they are in fact structurally different.
In myList1, items were manually and individually added to the list, whereas in
myList2, all items are grouped together in separate vectors. As a result, if you
run summary() on both lists (lines 7 and 8 in code block 2), you will see that
myList1 has nine entries, each of which contains a single item (Length = 1).
myList2, on the other hand, has only three entries, each of which contains
three items (Length = 3). To actually see what the lists look like, run lines
10 and 11 in code block 2.
R code
1 # Each item is manually added to the list:
2 myList1 = list(1, 2, 3, "English", "Spanish", "French", TRUE, NA, FALSE)
3
4 # We use existing vectors to fill the list:
5 myList2 = list(myNumbers, myWords, c(TRUE, NA, FALSE))
6
7 summary(myList1)
8 summary(myList2)
9
10 myList1 # run this line to see what myList1 looks like
11 myList2 # then run this line: do you see the difference?
12
13 # Slice notation in lists:
14 myList2[[2]][3] # second entry, third item (= second item in myWords)
15
16 # Assign names to list entries:
17 names(myList2) = c("Numbers", "Languages", "Logicals")
18
19 # Now we can access all languages using our new names:
20 myList2[["Languages"]] # As opposed to myList2[[2]]
21 myList2$Languages # The same can be achieved with "$"
import it into R. That being said, data frames can also be useful when we want
to create new data to explore the predictions of a statistical model—we will do
this later on in Part III in this book. Let’s take a look at a simple example that
builds on the vectors we have already created. Here, we will make a data frame
from myList2.
To create a data frame in R, we use the data.frame() command, as shown in
code block 311—remember to add this code block to rBasics.R. We then add
column names and contents (every column in a data frame must have the same
number of rows)—you can choose any name you want, but they must not start
with special symbols, and they should not have spaces in them. Alternatively,
because we want to have a data frame that has the exact content of myList2,
we can use the as.data.frame() function (line 7 in code block 3)—but first
we need to give our list entries names (line 6). We have already defined
myNumbers and myWords in the same script (check to see whether these
two objects/variables are in your Environment pane in RStudio. If you call
(i.e., run) a variable, say, ABC, which no longer exists, you will get an error
along the lines of Error: object ‘ABC’ not found. To avoid that, make sure
you are still using the same script (rBasics.R) and that you have not closed
RStudio in the meantime. If you have closed it, then rerun the lines of code
where the variables are assigned and everything should work.
Our new variable, df, created in lines 1–3 or (6–)7 of code block 3, is a data frame
that contains three columns. Each column has a different class of object, and each
column has a name. We can now use a number of functions to better understand
our data frame. For example, you may remember str() and summary() from code
block 1. Both functions are very useful: str() will tell us the class of each column. If
you run str(df), you will see all three columns preceded by $. We have already used
dollar signs with lists to access different entries (as an alternative to using double
square brackets). Data frames work the same way: always remember that if you
want to access a column within a data frame, you first need to type the name of
the data frame, then a dollar sign, and then the name of the column you want
to access.
In df, column 1 is called Number and is a numeric column/variable (num);
column 2, called Languages, is a character column (chr) with three different
values in it; and column 3, called Logicals, contains values from the logical
class (logi). Thus, if you wanted to print the Languages column on your
screen, you would run df$Languages. Likewise, if you wanted to “extract”
the column in question and assign it to a new variable, you would run
newVariable = df$Languages—here we are simply copying (not removing)
the column from df.
We can also use slice notation with data frames. For example, if you
wanted to access the content of the Languages column, besides typing
df$Languages, you could also type df[,2]. Because data frames have two
dimensions (rows and columns), we have to specify both of them inside the
30 Getting Ready 30
R code
1 df = data.frame(Numbers = myNumbers,
2 Languages = myWords,
3 Logicals = c(TRUE, NA, FALSE))
4
5 # Faster/simpler way in this case:
6 names(myList2) = c("Numbers", "Languages", "Logicals") # name list entries
7 df = as.data.frame(myList2) # create data frame from list
8
9 # Check class of each column:
10 str(df)
11
12 # Bird s-eye view of data frame:
13 summary(df)
14
15 # Calculate the mean, median, and standard deviation of Numbers column:
16 mean(df$Numbers) # slice notation: mean(df[, "Numbers"]) or mean(df[,1])
17 median(df$Numbers) # slice notation: median(df[, "Numbers"]) or median(df[,1])
18 sd(df$Numbers) # slice notation: sd(df[, "Numbers"]) or sd(df[,1])
19
20 # Visualizing data frame:
21 head(df, n = 2) # top 2 rows
22 tail(df, n = 2) # bottom 2 rows
23
24 # Exporting a data frame as a csv file:
25 write.csv(df, file = "df.csv",
26 row.names = FALSE,
27 quote = FALSE)
One very important function that we have not discussed yet is head(). If we
run head(df), R will print the first six rows in df (this is the default number of
rows printed). Because df only has three rows, all three will be printed—line 21
explicitly adds the argument n = 2 to head(), so it will only print the top two
rows. Another function, tail(), does exactly the same thing starting from the
bottom of the data frame—see line 22 in code block 3. tail() will print the
bottom rows of our data here (again, the default is n = 6).
Both head() and tail() are helpful because we will rarely want to print an
entire data frame on our screen. To some people, this is likely one of the
most striking interface differences between software like Excel and R. In R,
we do not stare at our dataset at all times. Doing so is hardly ever informative,
since some data frames will have too many rows and/or too many columns.
Ideally, you want to know (i) what columns/variables you have as well as
their classes (str()) and (ii) what your data frame looks like (head()). If you
do want to have a spreadsheet-like view of your data frame, you can click
on your data frame object in the Environment pane. You can also use the
View() function (upper case V), which will open a new tab in pane A and
show you the entire spreadsheet.
Finally, you may want to export the data frame we have just created as a csv
file. To do that, use the write.csv() function. In lines 25–27 in code block 3,
note that the function has four arguments12: first, we tell write.csv() what
object (i.e., variable) we want to save. Next, we give the file a name (here,
we want to save it as df.csv). Third, we use row.names = FALSE to tell R
that we do not want an additional column with line numbers in our file.
Lastly, we use quote = FALSE because we do not want R to use quotes
around values (by default, quote = TRUE, which means every value in
every row and column will be surrounded by quotes if you open the file in
a text editor). The file will be saved in your current working directory—see
the glossary. Later, we will see more concise options to export a data frame
as a csv file.
Data frames are crucial, and chances are you will use them every single time
you analyze your data in R. Thus, understanding how they work is key. In
§2.5, we will explore another type of data structure, tibbles, which are very
similar to data frames. Tibbles make our lives a bit easier, so they will be our
focus throughout the book. Don’t worry—everything we just discussed is
transferrable to tibbles. Remember that most of the time you will not create
a data frame within R. Rather, you will load your data into R, and the
dataset will be interpreted as a data frame (or as a tibble, as we’ll see in §2.5).
always the case—see §2.4.1. Most of the time, your dataset is a spreadsheet
somewhere on your computer. Maybe you have an Excel file somewhere,
and that is the file you wish to analyze. In this section, we will see how to
load your file into R so you can start analyzing your data.
Before we actually import your data into R, however, we will discuss two
important components of quantitative data analysis. First, we will check
whether your data file is actually ready to be imported. Second, we will
explore a powerful tool in RStudio that will help you keep all your projects
organized.
DATA FILE
We will use a simple csv file to practice reading your data into R. Make sure
you have downloaded sampleData.csv so you can follow every step.
2.4.2 R Projects
Whether you use SPSS or R, every research project that we develop has a
number of files. Examples include folders for papers, reading materials,
33 R Basics 33
abstracts, and data files. Hopefully, all these folders are located in a single folder
that gathers all the files that are related to a given research project. File organi-
zation is a good habit to cultivate, and RStudio offers us an incredibly handy
tool for that: a file extension called .Rproj.
To understand what R Projects are, follow these steps. In RStudio, go to
File . New Project…. You will then have some options, two of which are
New Directory and Existing Directory. As the names suggest, you should
pick the former if you don’t have a folder for a project yet and the latter in
case you already have a folder where you want to place your data analysis
files. We already created a directory earlier called basics, and that’s where
we will save our R Project. Therefore, choose Existing Directory and click
on browse to locate the basics folder. Finally, click on Create Project. Your
project will inherit the same name as the directory in which you create it, so
it will be called basics.RProj. We will use this R Project for all the coding
in the remainder of this chapter.
Once you have created your R Project, you will notice that RStudio
will reappear on your screen. Only three panes will be visible (no script is
open), so you can see your console, your environment, and pane D (from
Fig. 2.1), where your Files tab is located. In that tab, you can see the contents
of your newly created directory, where your R Project is located—you should
be able to see only one file in the directory: basics.Rproj. You can confirm that
this is the only file in the folder if you open that folder on your computer.
An Rproj file has no content in and of itself. It only exists to “anchor” your
project to a given directory. Therefore, you could have multiple R Projects
open at the same time, each of which would be self-contained in a separate
RStudio session, so you would end up with multiple RStudios open
on your computer. Each project would know exactly what directory to
point to—that is another advantage of working with projects as opposed to
single scripts. You do not necessarily need to use R Projects, but they can cer-
tainly help you manage all the files in your project. This book will use R Proj-
ects several times, and you’re encouraged to do the same (your future self will
thank you). I return to this point in chapter 3 (e.g., Fig. 3.1). Finally, you can
place your rBasics.R file (created earlier for code blocks 1, 2, and 3) in the same
directory as basics.Rproj, so there will be two files in the directory—you can
delete df.csv, created in code block 3, since we won’t use that file anymore.
study design. For example, we could be examining the impact of two pedagog-
ical approaches (target and control) on students’ learning (as measured by test
scores). We will only use sampleData.csv to practice importing files into R—
in later chapters we will examine more realistic hypothetical data.
Place the file sampleData.csv (§2.4.1) in the directory where your .Rproj
file is, which means your directory basics will now have three files (four if
you count df, created in lines 23–25 in code block 3): one .R script
(rBasics.R), one .csv, and one .Rproj. Next, start a new script by clicking
on File . New File . R Script (the same steps from §2.2.2), or press Cmd
+ Shift + N to achieve the same result. Save your new script as dataImport.
R, so that the file name is self-explanatory. You should now have four files
in your directory.
There are several options to import sampleData.csv into R. One option is
to use the function read.csv()—you may remember that we used write.csv()
in code block 3 to export our data frame.13 In your script (dataImport.R),
write read.csv(“sampleData.csv”) and run the line to see what happens.
You will notice that the entire dataset is printed in your console. But we
want to assign our data to a variable, so that we can analyze it later. Let’s
name our variable ch2.
When you run ch2 = read.csv(“sampleData.csv”), R will do two things:
first, import the data file; second, assign it to a variable named ch2. As a result,
even though the dataset is not printed in the console, a variable has been added
to your environment. This is exactly what we want. Imagine reading a dataset
with 1,000 rows and having the entire dataset printed in your console (!). Being
able to see an entire dataset is only useful if the dataset is small enough (and that
is almost never the case). Notice that ch2 is not a file—it’s a variable inside
RStudio. In other words, ch2 is a “virtual copy” of our data file; if we
change it, it will not affect sampleData.csv. As a result, the actual data file
will be safe unless we manually overwrite it by saving ch2 using write.csv
(ch2, file = “sampleData.csv”), for example.
Once our variable ch2 has been created, we can use different functions to
explore it. If you go back to code block 3, you will see a number of functions
that we can now use to examine ch2. For example, we can run summary(ch2)
to have a sense of what values each column contains as well as some basic sta-
tistics for numeric columns. We could also run str(ch2) to see the class of each
variable (column). If you run it, you will notice that we have two columns that
are chr: participant, which contains ten unique values (subject_1, subject_2,
etc.), and group, which contains two unique values (control, target). Thus, we
have ten participants and two groups in the data. We also have three num var-
iables: testA, testB, and testC. Let’s suppose that these are equivalent to a pre-
test, a post-test, and a delayed post-test, for example.
Naturally, you could apply functions directly to specific columns. Recall that
in a data frame, every column is a vector that can be accessed using a dollar sign
35 R Basics 35
R code
1 # Import data file and assign it a variable: ch2
2 ch2 = read.csv("sampleData.csv")
3
4 # Summarizing the data:
5 summary(ch2)
6
7 # Checking variable classes:
8 str(ch2)
9
10 # Visualizing data frame:
11 head(ch2, n = 3)
12 # View(ch2) # opens new tab in pane A with dataset (remove first "#" to run this line)
or slice notation. For example, if you wanted to calculate the mean of testA,
you could run mean(ch2$testA). Finally, we can visualize the top and
bottom rows of our data frame by using head() and tail(). By default, these
functions will print six rows of data, but you can change that (see line 11 in
code block 4).
collection of packages. Don’t worry: by the end of the book, you will certainly
be very familiar with tidyverse. Finally, you may recall that data tables were
mentioned earlier (§2.3). If you’d like to use data tables instead of data
frames (e.g., because you have too much data to process), you should definitely
check the tidytable package (Fairbanks 2020). This package offers the speed of
data tables with the convenience of tidyverse syntax, so you don’t have to learn
anything new.
To install tidyverse, we will use the function install.packages().15 During
the installation, you might have to press “y” in your console. Once the instal-
lation is done, we need to load the package using the function library(). The
top of your script (dataPrep.R) should look like code block 5. Technically,
these lines of code don’t need to be at the top of the document; they must,
however, be placed before any other lines that require them—overall, it is
best to source, install, and load packages in the preambles of files. Finally,
once a package is installed, you can delete the line that installs it (or add a
hashtag to comment it out)16—this will avoid rerunning the line and reinstall-
ing the package by accident. We are now ready to use tidyverse.
When you install and load tidyverse, you will notice that this package is
actually a group of packages. One of the packages inside tidyverse is dplyr
(Wickham et al. 2020), which is used to manipulate data; another is called
tidyr (Wickham and Henry 2019), which helps us create organized data;
another package is called ggplot2, which is used to create figures. We will
explore these packages later—you don’t need to load them individually if
you load tidyverse.
R code
1 # Script preamble: where you source scripts and load packages
2 source("dataImport.R") # Runs everything in dataImport.R
3 install.packages("tidyverse") # Comment this out once package is installed
4 library(tidyverse) # Loads package
Throughout this book, we will rely on the concept of tidy data (Wickham
et al. 2014). Simply put, a tidy dataset is a table where every variable forms a
column and each observation forms a row. Visualize ch2 again by running
head(ch2)—shown in Table 2.1. Note that we have three columns with test
scores, which means our data is not tidy. This is not ideal because if we
wanted to create a figure with “Test” on the x-axis and “Score” on the y-
axis, we would run into problems. A typical axis contains information from
one variable, that is, one column, but “Test” depends on three separate
columns at the moment. We need to convert our table from a wide format
to a long format. Wide-to-long transformations are very common, especially
because many survey tools (e.g., Google Forms) will produce outputs in a
wide format.
The data frame we want has a column called test and another column called
score—shown in Table 2.2. The test column will hold three possible values,
testA, testB, and testC; the score column will be a numeric variable that
holds all the scores from all three tests. Let’s do that using tidyverse, more spe-
cifically, a function called pivot_longer(). The discussion that follows will
include code block 6, which you should place in dataPrep.R—see Table
D.1 in Appendix D. You don’t need to skip ahead to the code block yet;
we will get there shortly.
R code
1 # From wide to long using pivot_longer():
2 long = ch2 %>%
3 pivot_longer(c(testA, testB, testC),
4 names_to = "test",
5 values_to = "score")
6
7 # From wide to long using gather():
8 long = ch2 %>%
9 gather(key = test,
10 value = score,
11 testA:testC)
12
13 head(long, n = 3) # Check result to see "test" and "score" columns
14
15 # From long to wide using pivot_wider():
16 wide = long %>%
17 pivot_wider(names_from = test,
18 values_from = score)
19
20 # From long to wide using spread():
21 wide = long %>%
22 spread(test, score)
23
24 head(wide, n = 3) # Equivalent to ch2 (original data)
R code
1 head(ch2, n = 3) # BEFORE---output printed below:
2 # participant group testA testB testC
3 # 1 subject_1 control 4.4 4.9 8.4
4 # 2 subject_2 control 9.7 9.4 4.4
5 # 3 subject_3 control 5.1 7.9 7.9
6
7 head(long, n = 3) # AFTER---output printed below:
8 # participant group test score
9 # 1 subject_1 control testA 4.4
10 # 2 subject_2 control testA 6.5
11 # 3 subject_3 control testA 5.1
CODE BLOCK 7 Data Frame before and after Wide-to-Long Transformation Using
tidyverse
Function Description
arrange() Orders data by a given variable
filter() Filters data (creates a subset, i.e., removes rows)
group_by() Groups data based on a specific variable
mutate() Creates new column
select() Selects columns in data (i.e., includes/removes columns)
summarize() Summarizes data (typically used with group_by())
41 R Basics 41
R code
1 # Mean scores by group:
2 long %>%
3 group_by(group) %>%
4 summarize(meanScore = mean(score)) %>%
5 arrange(desc(meanScore))
6
7 # Mean scores by participant:
8 long %>%
9 group_by(participant) %>%
10 summarize(meanScore = mean(score)) %>%
11 arrange(desc(meanScore))
12
13 # Removing (filtering out) controls:
14 targets = long %>%
15 filter(group == "target") %>%
16 droplevels()
17
18 # Removing column "group":
19 targets = long %>%
20 filter(group == "target") %>%
21 select(-group)
22
23 # Long-to-wide + new column (using pivot_wider()):
24 testDiff = long %>%
25 pivot_wider(names_from = test,
26 values_from = score) %>%
27 mutate(difference = testC - testA) # Final score minus initial score
transpose the dataset and print all columns as rows. Data frames, on the other
hand, will print everything, which is not very useful (that’s why we often use
head()). There are other small differences between these two objects, but
they do not matter for now—just remember that even if you start out with a
data frame, the output of summarize() in tidyverse will be a tibble. Through-
out the book we will use tibbles, but you can treat tibbles and data frames as
synonyms.
Next, let’s see how we can remove certain rows, that is, filter our data. In
lines 14–16 of code block 8, we are choosing to keep only participants in the
target group in our data. We do that with filter(group == “target”), which
is equivalent to filter(group != “control”): == means “is the same as”; !=
means “is different from”. The other function used in lines 14–16 is drople-
vels(). This is why we’re using that: remember that group is a factor with
two levels, control and target. Now that we have chosen to keep only target
participants, we still have two levels, but one of them (control) has no data
points. Because of that, we may want to drop empty levels—this is like asking
42 Getting Ready 42
R to forget that there ever were two groups in the data. That’s what droplevels()
does—and that’s why it must come after we filter the data (order matters).
Instead of dropping the levels of group, we could actually remove the
column from our data. If every participant now is in the target group, then
we do not care about group anymore (it’s redundant information). That’s
what lines 19–21 are doing in code block 8 with select(-group) (the minus
sign subtracts the column from the tibble). If we also wanted to remove the
participant column (for some reason), we would type select(-c(participant,
group)). We need to concatenate them with c() because we now have more
than one item in there.
Finally, let’s see how to perform a more complex set of tasks. In lines 24–27
of code block 8, we are first transforming the test and score columns to a wide
format (long-to-wide transformation)—see code block 6. Then, in line 27, we
are creating a new column (mutate()) called difference, which will hold the
difference (i.e., progress) between testC and testA. Notice that line 27 only
works because its input is the output of pivot_wider(), so the actual input
here is a tibble with testA, testB, and testC as separate columns again
(much like ch2)—run testDiff to see what it looks like.
Being able to perform multiple actions using %>% can be incredibly useful,
as we will see throughout this book. Crucially, we now have a variable in our
script (testDiff) that contains a column with the difference between the two
tests. As a result, you could use that particular column to create figures later on.
2.6 Figures
Once we have imported our data file and prepared our data for analysis, we can
start generating some plots. Plots can help us visualize patterns of interest in the
data before any statistical analysis. Indeed, we can often decide what goes into a
statistical analysis by carefully examining different plots.
Before we generate our first plot, let’s first create another script and save it as
eda.R (exploratory data analysis).19 This will be our third script (fourth,
if you count rBasics.R) in the R Project created earlier (dataImport.R,
dataPrep.R, eda.R)—see Appendix D. At the top of the script, we will
source dataPrep.R, our previous step from earlier. Remember, by using
source(), we don’t actually need to open other scripts—remember to use “”
around the file name before running source(). In fact, when you start typing
the name of the script you want to source within source(), RStudio will
show you a list of files that match what you are typing.20 The next time you
open your R Project, you can go directly to eda.R and run the line that
sources dataPrep.R. You no longer have to run lines to import and prepare
your data since those are run automatically when you source the scripts that
contain those lines. Finally, because you have loaded tidyverse in dataPrep.R,
you don’t need to load it again in your eda.R.
43 R Basics 43
In this section, we will generate a simple plot using one of the packages
inside tidyverse called ggplot2. There are several packages that focus on data
visualization in R, and you may later decide that you prefer another package
over ggplot2—an incomplete list of relevant packages can be found at
http://cloud.r-project.org/web/views/Graphics.html. However, ggplot2 is
likely the most comprehensive and powerful package for plots out there. R
also has its own base plotting system (which doesn’t require any additional
packages). Because every package will have some learning curve, the key is
to select one package for data visualization and learn all you need to know
about it. In this book, we will use ggplot2.
FIGURE 2.3 Bar Plot with Standard Error Bars Using ggplot2
44 Getting Ready 44
R code
1 source("dataPrep.R") # This will load all the steps from earlier
2
3 # Your first ggplot:
4 ggplot(data = long, aes(x = group, y = score)) +
5 stat_summary(geom = "bar", # this will add the bars
6 alpha = 0.3,
7 color = "black",
8 width = 0.5) +
9 stat_summary() + # this will add the error bars
10 labs(y = "Mean score",
11 x = "Group") +
12 theme_classic()
the necessary packages, and prepare the data, and we’d be ready to go. This
automates the whole process of analyzing our data by splitting the task into sep-
arate scripts/components (which we created earlier). Chances are we won’t
even remember what the previous tasks are a month from now, but we can
always reopen those scripts and check them out.
In line 4, we have our first layer, where we point ggplot2 to our data var-
iable (long), and indicate what we want to have on our axes (aes(x = group, y
= score)). The aes() argument is the aesthetics of our plot—it has nothing to do
with the actual formatting (looks) of the plot, but with the contents of the axes.
In line 5, we have our second layer—to improve readability, note that we can
break lines after each +. We use a plus sign to tell ggplot2 that we’re adding
another layer to the plot (i.e., we are not done yet).
Line 5 uses a very important function: stat_summary(). This function is
great because it provides a combination of data visualization and basic statistics.
Inside stat_summary(), we have geom = “bar”, which is simply telling
ggplot2 that we want bars; alpha = 0.3, which is an optional argument, is
adding some transparency to our bars (by default, bars are filled with solid
dark gray). Transparency goes from alpha = 0 (transparent) to alpha = 1
(solid). The last two arguments of stat_summary() are color = “black” and
width = 0.5, both of which are optional; they define the color of the
borders of our bars and their widths. We then have another stat_summary()
in line 9. This time, because we are not specifying that we want bars,
ggplot2 will assume its default value, which is a point range (a dot for the
mean and a line representing the standard error). Notice that we can add mul-
tiple layers of code to build a highly customizable figure.
Finally, the last two layers in our figure are labs() and theme_classic(). The
former lets us adjust the labels in a figure, that is, rename the axes’ labels in the
figure. The latter is applying a theme to our figure; theme_classic() optimizes
45 R Basics 45
the figure for publication by removing the default light gray background, for
example. Try running the code without that line to see what it looks like,
but remember to also delete the + at the end of the previous line.
Note that we never actually told ggplot2 that we wanted to add standard
errors to our bars. Instead, we simply typed stat_summary(). This is because
ggplot2 (and R, more generally) will assume default values for certain argu-
ments. One example is the function head(), which will display the first six
rows of your data unless you explicitly give it another number (e.g., head
(ch2, n = 10)). Like head(), stat_summary() assumes that you want standard
errors from the means using a point range. If you want to something else, you
have to explicitly indicate that within the function—that’s what line 5 does in
code block 9. When we run stat_summary(), what ggplot2 is actually inter-
preting is stat_summary(fun = mean_se, geom = “pointrange”).
text size within ggplot()—an option we will explore later in the book (in
chapter 5). In later chapters, code blocks that generate plots will have a
ggsave() line, so you can easily save the plot.
As mentioned earlier, you can run the ggsave() line after running the lines
that generate the actual plot (you may have already noticed that by pressing
Cmd + Enter, RStudio will take you to the next line of code, so you can
press Cmd + Enter again). Alternatively, you can select all the lines that gen-
erate the plot plus the line containing ggsave() and run all of them together.
Either way, you will now have a file named plot.jpg in the figures directory
(folder) of your R Project.24
To learn more about ggsave(), run ?ggsave()—the Help tab will show you
the documentation for the function in pane D. Formats such as pdf or png are
also accepted by ggsave()—check device in the documentation. We will
discuss figures at length throughout the book, starting in §2.6.2.
Finally, you can also save a plot by clicking on Export in the Plots tab in pane
D and then choosing whether you prefer to save it as an image or as a PDF
file. This is a user-friendly way of saving your plot, but there are two
caveats. First, the default dimensions of your figure may depend on your
screen, so different people using your script may end up with a slightly different
figure. Second, because this method involves clicking around, it’s not easily
reproducible, since there are no lines of code in your script responsible for
saving your figure.
Take the famous bar plot that only shows mean values and nothing else.
Most of the time, you don’t need a figure to communicate only means, natu-
rally. If you plan to use bar plots, you should always add error bars (if they make
sense) and ideally have at least one more dimension to visualize. In a study that
explores different teaching methods and their impact on learning English as a
second language (e.g., teaching in-person vs. online), you will likely have
more than one native language in your pool of participants (by design).
Your bar plot could then have along its x-axis the languages in question, and
the fill of the bars could represent the teaching methods of interest—see
Fig. 2.4.
change in score (e.g., pre- and post-test) of the hypothetical study mentioned
earlier; the x-axis shows the different native languages of the participants in
said study; and the fill of the bars discriminates the two methods under exam-
ination. Note that the font sizes are appropriate (not too small, not too large).
In addition, the key is positioned at the top of the figure, not at the side. This
allows us to better use the space (i.e., increase the horizontal size of the figure
itself). No colors are needed here, since the fill of the bars represents a factor
with only two levels (“In-person” and “Online”). Finally, the bars represent
the mean improvement, but the plot also shows standard errors from the
mean. If the plot only showed means, it would be considerably less informa-
tive, as we would know nothing about how certain we are about the means in
question (i.e., we would not incorporate information about the variance in the
data).
The take-home message from Fig. 2.4 is that an effective figure doesn’t
need to be fancy or colorful. The reader, in this case, is not distracted by
small font sizes or too many colors that don’t represent any specific values
in the data. Instead, bars are grouped together in an intuitive way: each
language group contains one bar for in-person classes and one bar for
online classes, which allows the reader to easily compare the effects of the
methods in question for each language group. Imagine now that the x-axis
represented instead the teaching methods and the fill color represented the
different native languages. In that case, the reader would need to look across
categories on the x-axis for each language to visually estimate the effect of
the main factor (method).
Assuming what was discussed in §2.6.2.1, Fig. 2.4 implies that the statistical
analysis will assess the effects of native language and method on learners’
improvement. On top of that, the figure suggests that teaching method
matters (in-person being more effective than online teaching in this hypothet-
ical example) and that its effects are relatively similar across all three languages
in question. All these pieces of information will help the reader understand
what is likely going on before the actual statistical analysis is presented and
discussed.
Both t-tests and ANOVAs are used when you have a continuous and nor-
mally distributed (Gaussian) dependent variable (score in long). A t-test requires
one independent26 variable (e.g., group in long), while an ANOVA can have
multiple variables. We will later refer to these as the response variable and the
predictor or explanatory variable, respectively, to better align our terminology
with the statistical techniques we will employ.
Our tibble, long, has a continuous response variable (score). However, our
scores do not follow a normal distribution (an assumption in parametric tests).
You can see that by plotting a histogram in R: ggplot(data = long, aes
(x = score)) + geom_histogram()—we will go over histograms in
more detail in chapter 3, more specifically in §3.3. Clearly, we cannot assume
normality in this case.27 Here, however, we will—for illustrative purposes
only. R, much like any other statistical tool, will run our test and will not
throw any errors: it is up to us to realize that the test is not appropriate
given the data.
To run a t-test in R, we use the t.test() function—code block 10 (line 4).
This will run a Welch two-sample t-test—which, by default, is two-tailed (i.e., is
non-directional) and assumes unequal variances (i.e., that both groups don’t
have the same variance).28 If we have a directional hypothesis, we can run a
one-tailed t-test by adding alternative = “greater” if your assumption is
that the difference in means is greater than zero. You could also use alternative
= “less” if you assume the difference is less than zero. By default, t-tests assume
alternative = “two.sided” (i.e., two-tailed).
To run an ANOVA, we use the aov() function (code block 10 (line 13)).
Naturally, you can assign these functions to a variable, so the variable will
“save” your output as an object in your environment (see line 16). Like any
other function in R, both of them require certain arguments. In addition,
you must remember a specific syntax to be able to run these functions: Y *
X, where Y is our response variable and X is our explanatory variable. In
code block 10, the t-test and ANOVA are run directly. If you examine line
13, you will notice that we are running the ANOVA inside the function
summary(). This is because aov() will not give us the output that we typically
want—run the line without summary() to see the difference.
The output of a typical t-test in R will include a t-value, degrees of freedom,
and a p-value. It will also state the alternative hypothesis (i.e., that the true dif-
ference in means is not equal to zero) and the 95% confidence interval (CI).
Finally, it will show us the means of each group being compared in the data
in question. The output of a typical ANOVA, on the other hand, will not
print a p-value—that’s why summary() is used in code block 10, which pro-
vides not only a p-value but also an F-value. It will, however, print the sum
of squares, the degrees of freedom, and the residual standard error. This
should all be familiar, but don’t worry too much if it isn’t—we won’t be
running t-tests and ANOVAs in this book.
53 R Basics 53
You may have already figured out that we won’t be able to answer important
questions with a simple t-test without making some adjustments to our data.
First of all, if we run a t-test on the scores of the two groups of participants,
we can’t accommodate the different test scores into our test. In other words,
we are treating the scores as if they came from the same type of test. This is
exactly what Fig. 2.3 conveys: it ignores that we have three different tests and
only focuses on the two groups of participants that we have in our study.
Clearly, this is not ideal.
Imagine if we had a column in our data that subtracted the score of the pre-
test from the delayed post-test. This new column would tell us how much
improvement there was between these two tests—positive values would mean
a student improved. The problem, however, is that to create such a column,
we need pre-test and delayed post-test scores to be their own column, which
means our long tibble will not work. We have two options: go back to ch2,
which is a wide tibble, or long-to-wide transform long. Since ch2 should be
loaded in your environment, we can add our new column to it using the
mutate() function in lines 7–8 in code block 10. This technique simplifies
our data to some extent: we go from two tests to the difference between
them—note that we have not removed any columns from ch2. This, in turn,
allows us to indirectly “compress” two tests of the variable test in a t-test that
already has its explanatory variable defined (group) and therefore wouldn’t
have room for yet another variable (test). Our response variable now contains
information about any potential improvement between testA and testC.
If we wanted to check whether the mean scores are different between tests
without simplifying our data, we could run a one-way ANOVA, which would
then ignore group as a variable. If we instead wanted to consider both group
and test, we would run a two-way ANOVA—both examples are provided
in code block 10.
R code
1 source("eda.R")
2
3 # T-test: t(27.986) = -0.25, p > 0.1
4 t.test(score s group, data = long)
5
6 # T-test comparing score differences (C-A) between groups:
7 ch2 = ch2 %>%
8 mutate(AtoC = testC-testA)
9
10 t.test(AtoC s group, data = ch2)
11
12 # One-way ANOVA: F(2,27) = 15.8, p < 0.001
13 summary(aov(score s test, data = long))
14
15 # Two-way ANOVA:
16 twoWay = aov(score s test + group, data = long)
17 summary(twoWay)
18
19 TukeyHSD(twoWay, which = "test")
20 # $test
21 # diff lwr upr p adj
22 # testB-testA 2.58 1.3159753 3.8440247 0.0000802
23 # testC-testA 0.32 -0.9440247 1.5840247 0.8055899
24 # testC-testB -2.26 -3.5240247 -0.9959753 0.0004168
25
26 plot(TukeyHSD(twoWay, which = "test"))
variable of interest has its own line. Line 19 runs our post-hoc test using
TukeyHSD(). The which argument inside TukeyHSD() exists because we
have two variables of interest here, but only one has more than two levels
(test). In lines 22–24, we see all three pairwise comparisons (testB − testA,
testC − testA, and testC − testB), their differences (diff), the lower (lwr)
and upper (upr) bounds of their 95% CI, and their adjusted p-value (to
reduce the probability of Type I error as a result of multiple comparisons).
Because the output of our post-hoc test has a lot of numbers in it, it may be
better to visualize the comparisons of interest. This brings us to line 26, which
plots the results using the plot() function—note that here we are not using
ggplot2. Line 26 will generate a figure that contains the output shown in
lines 20–24, but unlike line 19, it will not print an output. This is because
plot(), being the first function from left to right, dictates what R will do.29
In the figure generated by line 26 in code block 10, all three comparisons
are shown along the y-axis.30 Two of the three comparisons are significant—
those printed in lines 22 and 24 of code block 10. The x-axis shows the actual
55 R Basics 55
2.10 Summary
In this chapter, we covered a lot. We also created multiple scripts. First, we
created rBasics.R, where we explored some general functions in R. We
then created an R Project (basics.Rproj) as well as the following scripts:
dataImport.R, dataPrep.R, eda.R, and stats.R. These different scripts simulate
the common data analysis steps in a typical study. Always go back to Appendix
D if you need to review which files have been created, their location, and
where code blocks should be placed.
56 Getting Ready 56
2.11 Exercises
It’s a good idea to create a new script for the exercises here, and to do the same
for the exercises in later chapters. This will make it more convenient to access
them in the future.
R code
1 # Creating a variable to hold five languages:
2 lgs = c(Spanish, English, French, German, Italian)
3
4 # Access second entry in lgs:
5 lgs(2)
6
7 # Create new variable for Romance languages in lgs:
8 rom = lgs[1,3,5]
PROBLEMATIC CODE A
58 Getting Ready 58
2. Add a new column to long (using mutate()). The column will contain the
scores of all three tests in question, but let’s divide each score by 10 so we
end up with a column where scores range from 0 to 1 (instead of 0 to 10).
Call the new column scoreProp.
3. Using the ch2 dataset, create a summary using the summarize() function
where you have the mean and standard deviation for each of the three tests
by group. Do you need a wide or a long table here?
Notes
1. Python, another computer language, can be faster for some tasks and has a less steep
learning curve relative to R, given its more consistent syntax. However, while
Python is used for data analysis these days, R was designed specifically with that
goal in mind. As a result, R is arguably the most specialized tool for statistical
computing.
2. This analogy may soon become less accurate given the advances in computational
photography.
3. Linux repositories usually have a slightly older version of R, so I’d recommend fol-
lowing the instructions on R’s website.
4. Because line 1 (“R Basics”) has just a comment without any commands to be inter-
preted under it, you will have to select the line first, and then run it to force RStudio
to print the comment in your console. By default, RStudio won’t print a comment
with no commands under it if you just hit Cmd + Enter without selecting the line
first. You can select a line with your keyboard by using the arrow keys to move your
cursor to the beginning or end of a line and then pressing Cmd + Shift + ! or
Cmd + Shift + , respectively.
5. This operation will return the remainder of a division—cf. integer division, which
discards the remainder of a division and returns the quotient.
6. You can use Cmd + Z in RStudio’s script window (pane A).
7. The line numbers in a code block are not supposed to match the line numbers in
RStudio (each code block starts with line 1). Line numbers are used in code
blocks to facilitate our discussion on the code itself.
8. Another similar class that relates to numbers in R, integer, is often confused with
numeric—the same happens with the double class. Don’t worry: R will handle
these different classes automatically, and for our purposes here we do not need to
differentiate them. Indeed, you can successfully use R for years without even real-
izing that these three separate classes exist.
59 R Basics 59
9. The operator in question is especially useful when you wish to examine only a subset
of the data. We will practice this in the exercises at the end of this chapter.
10. You can also run help(matrix).
11. Notice how indentations are added as we break a line after a comma (e.g., lines 1–3).
This is RStudio helping us organize our code by aligning the different arguments of
a function (data.frame() here). We will see this throughout the book. If you select
your code and press Cmd + I, RStudio will reindent all the lines to keep everything
organized for you.
12. To learn how many arguments a function accepts/requires, you can press Tab after
you have entered the first argument and a comma: write.csv(df,). This will open a
floating menu for you with different possible arguments for the function you are
using. Not all functions will explicitly list all possible arguments, but this will help
you understand what different functions can do.
13. If your region uses commas as decimal separators, you can use read.csv2()—this
function assumes that your columns are separated by colons, not commas.
14. Note that we spell ls() as L-S (the first character is not a number).
15. Conversely, you can use the function remove.packages() to uninstall a package. To
cite a package, you can run citation(“name_of_package”) in R.
16. Pressing Cmd + Shift + C will comment a line out.
17. pivot_longer() and pivot_wider() are updated versions of gather() and spread(),
which accomplish the same goals.
18. In code block 8, we are not assigning some of our actions to variables. You may or
may not want to assign certain actions to variables, depending on whether you wish
to analyze them further in your script.
19. Approach promoted by American mathematician John Tukey (1915–2000).
20. Overall, pressing Tab will auto-complete your commands in RStudio.
21. Because this is a hypothetical study, it does not matter what these scores represent.
22. This folder will help us keep our files organized by separating figures from other
files, such as R scripts.
23. To avoid pixelated figures, use at least 300 dpi—this resolution is what most publish-
ers require. If you like to zoom in on your figures, 1000 dpi will be more than
enough.
24. You could also specify another folder, say “*/Desktop/plot.jpg”, or any other path
that works for you—the path in question will save the file on the desktop of a Mac.
25. Note that every time you run the function, the set of numbers generated will be dif-
ferent. If you want the numbers to be replicable, however, you need to be able to gen-
erate the same numbers multiple times. To do that, use the function set.seed(x) right
before you run rnorm()—where x is any number of your choice (it acts like a tag, or
an ID). The same applies to sample(), which is also used here.
26. This is not a very accurate term, as in reality independent variables are rarely
independent.
27. Most studies in second language research do not verify statistical assumptions—see
Hu and Plonsky (2020) for a recent review. Typical solutions include the removal
of outliers and the use of non-parametric tests. Better solutions would be, in
order of complexity, data transformation (e.g., log-transform reaction time data),
the use of robust regression (instead of removing outliers), or the use of different dis-
tributions in Bayesian models that better fit the distribution at hand (e.g., chapter
10).
28. That’s why you will see degrees of freedom with decimal places in your output here.
If you want an integer as your degrees of freedom, add var.equal = TRUE to your
t.test() function.
29. Alternatively, using %>%, we could run TukeyHSD(twoWay, which = “test”)%>
%plot().
60 Getting Ready 60
30. If you don’t see all three comparisons on the y-axis, try clicking on Zoom in your
plot window (R will adjust how much text is displayed on the axes of a plot based
on the resolution of your display).
31. I recommend that you wait until chapter 10 to install brms, bayesplot, and rsta-
narm, as you will need to follow some specific instructions. If you create a vector
with all the package names in it, you can run the function only once: install.pack-
ages(c(“arm”, “lme4”, “MuMIn”, …)).
32. How often you need to update a package will depend on the package and on your
needs. To check for package updates, click on the Packages tab in pane D and then
on “Update”.
61
PART II
3
CONTINUOUS DATA
In this part of the book, we will expand the discussion on figures started in
§2.6. We will examine different types of figures and how to actually create
them using R. Everything we examine in this part will come up again in
Part III, where we will statistically analyze the patterns that we see in the
data. To get started with data visualization in R, our focus in this chapter
will be continuous variables, which are in general easier to code than categor-
ical variables (our focus in chapter 4).
Continuous variables are quite common in second language research. Exam-
ples include test scores, reaction times, age of acquisition, number of hours
exposed to a given language (e.g., per week), and so on. It’s important to
note that no continuous variable is truly continuous. We know that test
scores are typically bound, ranging from 0 to 100, for example. Likewise, reac-
tion times can’t be negative, so they have a lower bound. Thus, none of these
variables actually go from negative infinity (−1) to positive infinity (+1).
However, we assume (for practical purposes) that they are continuous
enough to be treated as continuous in our statistical analyses as long as they
involve a range with values that are not discrete.
Let’s take age as an example. If your study asks participants about their age,
you may end up with a continuous variable if their responses result in an
actual range, say, from 19 to 65, where you have different participants spread
across that range. However, you could be in a situation where your participants
are all in the same age groups: 19, 24, or 28 years old. In that case, you’d still
have a range (technically), but it would make more sense to treat “age” as a dis-
crete (ordered) variable, not as a continuous variable per se, given that you only
have three different ages in your pool of participants.
64 Visualizing the Data 64
DATA FILE
We will use a csv file to make figures in R. Make sure you have downloaded
feedbackData.csv so you can follow every step. This file simulates a hypo-
thetical study on two different types of feedback, namely, explicit correc-
tion and recast. The dataset contains scores for a pre-, post-, and
delayed post-test. Three language groups are examined: speakers of
German, Italian, and Japanese. Other variables are also present in the
data—we will explore this dataset in detail later.
At this point, before proceeding to the next sections, you should create a
new folder inside bookFiles and name it plots. Next, create a new R
Project inside plots—the new R Project will be called plots.Rproj and will
be your second R Project (basics.Rproj was created in chapter 2). These
two R Projects are in two separate directories (basics and plots, respectively),
so this will keep things organized. The folder we just created, plots, will be
our working directory for chapters 3, 4, and 5, given that all three focus on
visualization. This file organization is shown in Fig. 3.1. A complete list
with all the files used in this book can be found in Appendix D—make
sure you keep track of all the necessary files by visiting the appendix every
now and then.
Once you’ve created your new project, plots.Rproj, open it and create the
first script of the project—let’s call it continuousDataPlots.R. All the code
blocks in the sections that follow should go in that script. Fig. 3.1 also
lists two other scripts under plots, namely, categoricalDataPlots.R and
optimizingPlots.R, which will be created later on, in chapters 4 and 5,
65 Continuous Data 65
basics.Rproj
rBasics.R
dataImport.R
basics (ch. 2) dataPrep.R
eda.R
stats.R
figures/...
bookFiles
plots.Rproj
feedbackData.csv
rClauseData.csv
plots (ch. 3, 4, 5)
continuousDataPlots.R
categoricalDataPlots.R
optimizingPlots.R
figures/...
respectively. Lastly, we will again create a folder called figures, which is where
we will save all the figures created in this part of the book—see Fig. 3.1.
period of time (e.g., one semester). One group received only recasts as feed-
back; the other group received only explicit correction. Throughout the seme-
ster, all participants were assessed ten times using two separate tasks. Their
scores in each assessment can be found in the task_… columns. For
example, task_B3 provides scores for assignment 3 from task B—for our pur-
poses here, it doesn’t matter what tasks A and B were. Participants’ native lan-
guages were German, Italian, or Japanese. Participants also differed in terms of
how many hours per week they used/studied English outside of the classroom
(Hours column). Finally, AgeExp lists how old participants were when they
first started studying English. Recall that if you run str(feedback), you will
see the number of levels in each of the factors1 just described (e.g., L1 is a
factor with 3 levels, German, Italian, and Japanese). Importantly, you will
also see that ID is a factor with 60 levels, which means we have 60 participants
in the study—one row per participant in the data.
because scores are spread across ten columns—and this would clearly be a
problem, as we’ll see later. The bottom line is: you want to work with tidy
(long) data.
In code block 11, we have a series of steps that mirror what we often have to
do in our own research. The first line, you may recall from chapter 2, simply
“cleans” the environment—this is useful if you have another script open with
many variables which you don’t want to use anymore (this is not the case here
because this is the only script open (continuousDataPlots.R) and the only
script in the directory where we also find plots.Rproj, as discussed earlier).
Line 2 loads tidyverse, which itself loads ggplot2 (among other packages).
Line 5 imports the data using read_csv() and assigns it to a variable (feedback).
Note that read_csv() is not read.csv(): even though they accomplish pretty
much the same task, read_csv() comes from tidyverse and creates a tibble—
recall the brief discussion in chapter 2. It’s really up to you if you prefer to
use data frames (read.csv()) or tibbles (read_csv()). Finally, just as read_csv()
can replace read.csv(), so too write_csv() can replace write.csv().
Lines 8–9 transform all characters columns into factor columns, so that dif-
ferent values in the columns are treated as levels; line 12 prints the top rows of
the dataset,2 line 15 prints all column names (variables), and line 18 lists the var-
iables and their respective classes. As we’ll be using this script for all the plots in
this chapter, you won’t have to reload the package or re-import the data (unless
you close RStudio and reopen it, of course).
The key in code block 11 is lines 21–25, which are responsible for our wide-
to-long transformation and which create a new variable, longFeedback. The
code should be familiar, as we discussed this type of transformation when we
examined code block 6.
What’s new in code block 11 is line 25. If you run lines 21–24 (removing the
%>% at the end of line 24), you will notice that we end up with two columns:
Task and Score (defined in lines 22 and 23). The former will contain information
about the task itself and the item: task_A1, task_A2, and so on. Line 25 separates
this column into two columns, namely, Task and Item—in other words, line 25
separates the column we created in line 22 into two columns. The sep = 6 argu-
ment inside the separate() function specifies where we want to split the column:
the sixth character from the left edge of the word. Because our Task column
always contains values with seven characters (t1 a2 s3 k4 – 5 A6 17), we can
easily separate this column by referring to a fixed position within the value.
Finally, line 27 visualizes longFeedback to make sure it looks good (you can
also use View(longFeedback)). It should resemble Table 3.2—and it does.
Remember: when we wide-to-long transform our data, all other columns
remain intact. The main structural difference is that now we have a much
longer dataset: while feedback had 60 rows (one per participant), longFeedback
now has 600 rows (ten scores per participant). All the figures discussed in the
next sections will rely on longFeedback.
68 Visualizing the Data 68
3.3 Histograms
Histograms are useful to visualize the distribution of a given continuous var-
iable. For example, in longFeedback, we may want to see the distribution of
scores—recall that you can access that column by running longFeedback
$Score. If we create a histogram of that variable, we will see that the mean
R code
1 rm(list = ls()) # To remove variables from your environment (you shouldn t have any)
2 library(tidyverse)
3
4 # Read file as tibble:
5 feedback = read_csv("feedbackData.csv")
6
7 # Transform chr to fct:
8 feedback = feedback %>%
9 mutate_if(is.character, as.factor)
10
11 # Print top rows:
12 head(feedback) # or simply run "feedback"
13
14 # Print all column names:
15 names(feedback)
16
17 # List variables:
18 str(feedback)
19
20 # Wide-to-long transform:
21 longFeedback = feedback %>%
22 pivot_longer(names_to = "Task",
23 values_to = "Score",
24 cols = task_A1:task_B5) %>%
25 separate(col = Task, into = c("Task", "Item"), sep = 6)
26
27 head(longFeedback)
28
29 nrow(feedback) # 60 rows before transformation
30 nrow(longFeedback) # 600 rows after transformation
R code
1 # Histogram:
2 ggplot(data = longFeedback, aes(x = Score)) +
3 geom_histogram(bins = 15,
4 color = "white",
5 binwidth = 4,
6 alpha = 0.5) +
7 labs(x = "Scores",
8 y = NULL) +
9 theme_classic()
10
11 # Save plot in figures folder:
12 # ggsave(file = "figures/histogram.jpg", width = 4, height = 2.5, dpi = 1000)
figure—this is known as a rug plot layer. For German speakers, for example, we
can see a higher concentration of data points between 4 and 8 weekly hours of
study relative to the Japanese group.
Fig. 3.3 shows that more hours of study correlate with higher scores for
all three groups—that is, a positive correlation/effect. Indeed, if you run a
Pearson’s product-moment correlation test in R, you will see that ρ = 0.24,
p < 0.001—notice that this correlation does not distinguish the different
native languages in question, so here our (micro) statistical analysis would not
be aligned with the data visualization (cf. §2.6.2.1). You can run a correlation
test in R by running cor.test(longFeedback$Score, longFeedback$Score).
Let’s now inspect code block 13, which generates Fig. 3.3. Line 3,
geom_point(), specifies that we want to create a scatter plot—as usual, alpha
= 0.1 simply adds transparency to the data points—try removing the + at the
end of line 3 and running only lines 2 and 3 to see what happens (it’s useful
to see what each line is doing). Line 4, stat_smooth(), is adding our trend
lines using a particular method (linear model, lm, our focus in chapter 6)—
the default color of the line is blue, so here we’re changing it to black. Line 5
creates facets. The command facet_grid() is extremely useful, as it allows us
to add one more dimension to our figure (here, native language, L1). Notice
that we use a tilde (*) before the variable by which we want to facet our data.
As long as we have a categorical/discrete variable, we can facet our data by it.
We can even add multiple facets to our figures. For example, facet_grid(A * B)
will generate a figure with variable A as columns and variable B as rows. In
longFeedback, we could facet our data by both L1 and Task—each of which
has two levels. The result would be a 3-by-2 figure. Alternatively, if
you prefer a wide figure (with 6 horizontal facets), you can type facet_grid
(* A + B), and if you want a tall figure (with 6 vertical facets), you can type
facet_grid(A + B * .)—don’t forget the “.” on the right-hand side. The argu-
ment labeller = tells ggplot2 that we want to label our facets (i.e., we want to
print the name of the factor, not just the names of the levels). In our code,
we’re asking ggplot2 to label both the variable (the name of the factor), for
example, L1, and the values (the content of the levels), for example, German.
Line 8 in code block 13, geom_rug(), adds the marginal ticks along the axes.
Notice that we can also add transparency to our rugs, by adding alpha = 0.2 to
geom_rug()—you could also add color (the default is black). Whenever you
want to learn what arguments a given function takes, you can run help
(geom_rug) or ?geom_rug() (for the example in question).
The trends shown in Fig. 3.3 ignore the different tasks and assignments in the
hypothetical study in question. In other words, we are looking at scores as a
whole. But what if the relationship between Score and Hours is different
depending on the task (A or B)? Let’s adjust our figure to have two different
trend lines based on the two levels of Task in the data. The new plot is
shown in Fig. 3.4—note that we’re using different line types for each task
72 Visualizing the Data 72
R code
1 # Scatter plot:
2 ggplot(data = longFeedback, aes(x = Hours, y = Score)) +
3 geom_point(alpha = 0.1) +
4 stat_smooth(method = lm, color = "black") +
5 facet_grid(sL1, labeller = "label_both") +
6 labs(x = "Weekly hours of study",
7 y = "Score") +
8 geom_rug(alpha = 0.2) +
9 theme_classic()
10
11 # Save plot:
12 # ggsave(file = "figures/scatterPlot1.jpg", width = 6, height = 2.3, dpi = 1000)
FIGURE 3.4 A Scatter Plot with Multiple Trend Lines Using ggplot2
(solid line for A, dashed line for B). As we can see, the correlation between
scores and hours of study doesn’t seem to be substantially affected by Task.
The code to generate Fig. 3.4 can be found in code block 14. Pay attention
to line 4, where we have stat_smooth(). One of the arguments in that func-
tion is aes()—the same argument we use in our first layer to specify our axes.
There we define that we want trend lines to be of different types based on a var-
iable, namely, Task. Notice that we are specifying that locally. In other words,
only stat_smooth() has the information that we want to differentiate the two
tasks. The other plotting layers, such as geom_point(), will therefore disregard
Task. To specify an argument globally, you can add it to the first layer, in line 2,
inside aes() in the ggplot() function (we already do that by telling ggplot()
what our axes are; the other layers simply inherit that information).
Here’s an example to help you understand the effects of local and global
specifications in a figure using ggplot2. Imagine that we wanted to produce
Fig. 3.4, but using colors instead of line types. You could have as your
first layer ggplot(data = feedback, aes(x = PreTest, y = PostTest, color
73 Continuous Data 73
R code
1 # Scatter plot:
2 ggplot(data = longFeedback, aes(x = Hours, y = Score)) +
3 geom_point(alpha = 0.1) +
4 stat_smooth(aes(linetype = Task), method = lm, color = "black") +
5 facet_grid(sL1, labeller = "label_both") +
6 labs(x = "Weekly hours of study",
7 y = "Score") +
8 geom_rug(alpha = 0.2) +
9 scale_color_manual(values = c("black", "gray")) +
10 theme_classic()
11
12 # Save plot:
13 # ggsave(file = "figures/scatterPlot2.jpg", width = 6, height = 2.3, dpi = 1000)
CODE BLOCK 14 Producing a Scatter Plot with Multiple Trend Lines Using ggplot2
of each point represents a continuous variable (Age). Our new plot is shown in
Fig. 3.5.
Fig. 3.5 is essentially plotting three continuous variables (mean Score, Hours,
Age) and separating them based on a fourth discrete variable using facets (L1).
As it turns out, Age doesn’t seem to affect the other variables in Fig. 3.5—
which is not very surprising. The different point sizes used to represent different
ages can be added to our plot by adding aes(size = Age) to geom_point()
(local specification) or by adding it to the first layer, which is what code
block 15 shows. Clearly, this figure is not very informative, mostly because
Age doesn’t seem to make a difference here: we see larger circles at the top
and at the bottom.
FIGURE 3.5 A Scatter Plot with Point Size Representing a Continuous Variable
R code
1 # Group and summarize data:
2 ageSummary = longFeedback %>%
3 group_by(ID, L1, Hours, Age) %>%
4 summarize(meanScore = mean(Score))
5
6 head(ageSummary)
7
8 # Scatter plot: representing Age with size
9 ggplot(data = ageSummary, aes(x = Hours, y = meanScore, size = Age)) +
10 geom_point(alpha = 0.2) +
11 facet_grid(sL1, labeller = "label_both") +
12 labs(x = "Mean weekly hours of study",
13 y = "Mean score") +
14 scale_color_manual(values = c("black", "gray")) +
15 theme_classic()
16
17 # Save plot:
18 # ggsave(file = "figures/scatterPlot3.jpg", width = 6, height = 3, dpi = 1000)
CODE BLOCK 15 Producing a Scatter Plot with Three Continuous Variables Using
ggplot2.
75 Continuous Data 75
Code block 15 shows how Fig. 3.5 is generated as well as how the data to
be plotted is prepared (summarized). In lines 2–4, we create a new variable,
ageSummary, which groups our data by four variables, ID, L1, Hours, and
Age—line 3. Line 4 then summarizes the data by creating a column that calcu-
lates the mean Score. Note that all four variables are constant in the data:
Learner_1 always has the same value for L1 (Japanese), Hours (8.1), and Age
(38). In other words, there’s no nesting involved in our grouping: we’re
simply grouping by all four variables because we want our resulting tibble
(ageSummary) to have these variables as columns, otherwise we wouldn’t be
able to refer to these variables in our figure—run line 6 to visualize the top
rows of ageSummary to make sure you understand what lines 2–4 are doing.
We will discuss data summarization again in chapter 4.
The take-home message here is that you can present different variables using
different layers in your figure. Layers can display multiple dimensions of data
using shapes, colors, line types, sizes, facets, and so on. Naturally, which
options will work depends on what your figure looks like: if you’re creating
a scatter plot with no trend lines, using linetype to represent a given variable
will be semantically vacuous.
Finally, take your time to inspect code blocks 13, 14, and 15. The fact that
they look very similar is good news: ggplot2 always works the same way, layer
by layer, so it’s a consistent package in terms of its syntax. In the next sections,
we will examine other useful plots for continuous data, and you’ll have the
chance to see more layers using ggplot2—they will again look similar.
Median
²
outlier
Q1 (25%) Q3 (75%)
IQR
FIGURE 3.6 The Structure of a Box Plot
76 Visualizing the Data 76
interquartile range (IQR). Simply put, the box represents the 50% most frequent
data points in any given dataset, that is, the “bulk” of the data. Within the box,
we can see a vertical line representing the median (second quartile, or 50th per-
centile), so we know that 50% of the data is below the line and 50% of the data
is above the line. The whiskers represent the remainder of the data excluding
outliers. Technically speaking, the lower whisker (on the left) extends to Q1
−1.5 . IQR, whereas the upper whisker (on the right) extends to Q3 + 1.5
. IQR. Thus, the box and the whiskers account for 99.3% of the data. You
may not have outliers, of course (Fig. 3.7 doesn’t), but if you do, they will
be displayed as individual points in your box plot.
You don’t have to remember the exact percentage points involved in a box
plot. Just remember that a box plot shows the spread of the data as well as its
median, so it provides a very informative picture of our datasets. Focus on the
box first, and then check how far the whiskers go.
Let’s now inspect Fig. 3.7. Here we have a 2-by-3 plot, and we have four
different dimensions (variables): on the y-axis we have Score, a continuous var-
iable; on the x-axis we have Feedback, a factor with two levels; and then we
have two sets of facets, one for L1 (columns) and another for Task (rows). In
addition to box plots, note that we also have semitransparent data points on the
background, so the reader has access to the actual data points as well as the box
plots showing the spread of the data.
Overall, we can see in Fig. 3.7 that Recast seems to yield higher scores rel-
ative to Explicit correction. As a rule of thumb, the more two box plots
overlap, the less likely it is that they are statistically distinct. For example, if
you inspect the facet for Japanese speakers, you will notice that box plots for
Explicit correction and Recast overlap almost completely.3 As a result, these
learners don’t appear to be affected by Feedback the same way as the other two
groups in the figure.
Code block 16 has some new lines of code. First, in line 3, we have
geom_jitter() with some transparency. This function spreads data points that
would otherwise be stacked on top of each other. Look back at our x-axis.
Notice that we have a discrete variable there. As a result, all the data points
would be vertically aligned—which wouldn’t make for a nice figure to look
at. By using geom_jitter(), we ask ggplot2 to spread those data points.
Next, in line 4, we have our main function: geom_boxplot(). Here, we’re
using alpha = 0, which means the box plots will be completely transparent
(they’re gray by default). Remember once more: to create a box plot, you
must have a discrete or categorical variable on your x-axis and a continuous
variable on your y-axis. We then use facet_grid() in line 5 to create our
two facets—at this point you should be familiar with lines 6 and 7 as well as
line 10. Finally, note that we’re not specifying the label for the x-axis here
because the name of the variable is already appropriate (Feedback). ggplot2
will label the axes by default with their respective variable names, so we only
need to specify our own labels if we’re not happy with the default labels—
this was the case with Hours in previous plots, which wasn’t self-explanatory.
Likewise, ggplot2 will automatically order the levels of a factor alphabetically
along an axis. As a result, Explicit correction is shown before Recast. We
could, however, reorder the levels based on Score in this particular case—
this is especially useful when several levels are present on the x-axis. To do
that, we would use the fct_reorder() function inside aes(): …, aes(x =
fct_reorder(Feedback, Score), y = Score).
R code
1 # Boxplot + jitter + facets:
2 ggplot(data = longFeedback, aes(x = Feedback, y = Score)) +
3 geom_jitter(alpha = 0.1) +
4 geom_boxplot(alpha = 0) +
5 facet_grid(Task s L1, labeller = "label_both") +
6 theme_classic() +
7 labs(y = "Score")
8
9 # Save plot:
10 # ggsave(file = "figures/boxPlot.jpg", width = 7, height = 4, dpi = 1000)
variable on the x-axis. Fig. 3.8, for example, plots the mean Score by Feedback
and L1. In addition to mean values, Fig. 3.8 also displays the standard error
from the mean using error bars. Error bars showing standard errors are very
important, as they allow us to estimate our level of uncertainty around the
means (see §1.3.4). You should not have bar plots without standard errors
when you show variable (experimental) data: the bars themselves don’t
provide a lot of information (only the means in this case).
You may have noticed that the bar plot in Fig. 3.8 is structurally equivalent
to the box plot in Fig. 3.7, since both figures plot the same variables. But what
differences do you notice? Box plots show us the spread of the data as well as
the median. Bar plots (and error bars) show us the mean and its standard error.
Look carefully at the standard error bars in Fig. 3.8, which are very small. You
will notice that there’s very little overlap between explicit instruction and
recast. In contrast, our box plots in Fig. 3.7 show massive overlap. This differ-
ence makes intuitive sense if you know what each figure is showing. But
assume that your readers don’t understand box plots that well and that you
are not showing them bar plots with error bars—your paper only has box
plots. Readers may conclude that there’s likely no effect of Feedback based
on the box plots. If you had shown them the bar plots, on the other hand,
they would likely conclude that Feedback has an effect, given that the error
bars don’t overlap in most facets in Fig. 3.8.
The take-home message here is this: these plots are showing the same data,
and the patterns they display are not contradictory. They are simply two differ-
ent perspectives of the same variables and effects. Finally, bear in mind that Fig.
3.8 is not aesthetically optimal for a couple of reasons. First, it’s quite large,
even though there’s a lot of empty space in the figure—we only have two
79 Continuous Data 79
bars per facet. We could certainly improve it by removing Task from a facet
and using different bar colors (fill) to represent them (e.g., two shades of
gray). In that case, we’d end up with four bars per facet, and our figure
would only have one row of facets (as opposed to two). Second, the standard
errors are very small, which makes it hard to inspect potential overlaps. We
could start our y-axis from 60, which would show us a “zoomed” version of
the figure. That may not be optimal, however, since it tends to overemphasize
small differences. Alternatively, we could remove the bars and keep only the
error bars (which would automatically adjust the y-axis to only include the rel-
evant range for the standard errors). Ultimately, a bar plot is probably not the
best option here (aesthetically speaking).
The code provided in code block 17 should be familiar by now (recall the
first plot we examined back in §2.6.1). The two main layers of the figure in
question are the bars (line 3) and the error bars (line 4). Both layers are the
result of stat_summary(), a very useful function that combines plotting and
some basic statistics. Here, we are not manually calculating means or standard
errors: both are calculated on the go by stat_summary() as we plot the data. If
we flipped the order of these layers, the error bars would be behind the bars—
which wouldn’t be a problem here because our bars are completely transparent.
Finally, you could use error bars to represent the standard deviation in the
data instead of the standard error. You can do that by replacing line 4 with
stat_summary(fun.data = mean_sdl, geom = “errorbar”, width = 0.2).
When we display the standard error, we are focusing on our uncertainty
about the sample mean; when we display the standard deviation, we are focus-
ing on the variation of the actual data being plotted.
You may be wondering: is it possible to combine the level of information of
a box plot with the mean and standard error of a bar plot with error bars? The
answer is yes. The great thing about ggplot2 (well, R in general) is that we have
total control over what happens to our data and figures. Fig. 3.9 combines a
box plot with two different error bars: solid error bars represent the SE from
the mean, and dashed error bars represent s in the data.4 The great thing
R code
1 # Bar plot:
2 ggplot(data = longFeedback, aes(x = Feedback, y = Score)) +
3 stat_summary(geom = "bar", alpha = 0.3, color = "black", width = 0.5) +
4 stat_summary(geom = "errorbar", width = 0.2) +
5 facet_grid(Task s L1, labeller = "label_both") +
6 theme_classic() +
7 labs(y = "Score")
8
9 # Save plot:
10 # ggsave(file = "figures/barPlot.jpg", width = 7, height = 4, dpi = 1000)
FIGURE 3.9 A Box Plot with Error Bars for Standard Errors and ±2 Standard
Deviations
about such a plot is that it gives us a lot of information about the data. Is it over-
kill? Probably, so you may want to comment out lines 6–8 and only show box
plots and SEs from the mean. Box plots that also display means and standard
errors are often everything you need to see about your data.
The code for Fig. 3.9 can be found in code block 18. The layers of the figure
are straightforward: we have one layer for the box plot (line 3), one layer for
the standard error bars (lines 4–5), and one layer for the standard deviation
bars (lines 6–8)—this time our box plot is gray, so it acts like a faded back-
ground image. Notice that position is added to both error bars such that SEs
move left (position = –0.2) and s move right (position = 0.2). If you don’t
do that, both error bars will be centered, which means they will be positioned
on top of each other. Also notice the linetype argument in line 8 to make the s
error bars dashed.
The nice thing about Fig. 3.9 is that it shows us that although the box plots
overlap quite a bit, the means and standard errors do not (except for the Japa-
nese speakers). Overall, the figure also shows us that the means and the medians
in the data are very similar (common with data points that are normally
distributed).
In summary, bar plots are simple but effective if you want to focus on means.
If you combine them with error bars and facets, they can certainly be informa-
tive. We also saw earlier that error bars can certainly be combined with box
plots. The key is to generate figures that hit the sweet spot in terms of infor-
mation and clarity.
R code
1 # Box plot + SE + SD:
2 ggplot(data = longFeedback, aes(x = Feedback, y = Score)) +
3 geom_boxplot(alpha = 0, color = "gray50") +
4 stat_summary(geom = "errorbar", width = 0.1,
5 position = position_nudge(x = -0.2)) +
6 stat_summary(fun.data = mean_sdl, geom = "errorbar", width = 0.3,
7 position = position_nudge(x = +0.2),
8 linetype = "dashed") +
9 facet_grid(sL1, labeller = "label_both") +
10 theme_classic() +
11 labs(y = "Score")
12
13 # Save plot:
14 # ggsave(file = "figures/boxSeSdPlot.jpg", width = 7, height = 3, dpi = 1000)
examine the example in Fig. 3.10, where we have scores on the y-axis and
weekly hours of study (Hours) on the x-axis.
Line plots resemble scatter plots structurally, since both axes are continuous. In
Fig. 3.10, we see three lines, one by group (L1). The lines represent the mean
scores of each group as a function of the number of weekly hours of study. We
see a similar trend for all groups: on average, more weekly hours lead to higher
scores.
If you examine code block 19, you will notice that the function
stat_summary() is again the key to our figure. In line 3, we specify that
we want to plot means with lines by adding geom = “line”. The third step
is to add aes(linetype = L1), which by now you should be familiar with:
this uses different types of lines depending on the value of the variable
L1—recall that this variable must be a factor. Because the factor in question
has three levels (German, Italian, and Japanese), three lines are plotted.
82 Visualizing the Data 82
R code
1 # Line plot:
2 ggplot(data = longFeedback, aes(x = Hours, y = Score)) +
3 stat_summary(geom = "line", aes(linetype = L1)) +
4 labs(y = "Score", x = "Weekly hours of study") +
5 theme_classic()
6
7 # Save plot:
8 # ggsave(file = "figures/linePlot.jpg", width = 6, height = 2.5, dpi = 1000)
Line plots can be especially useful if you collect longitudinal data. However,
as we can see in Fig. 3.10, they can also be useful to show trends across a time
variable (most of us collect extra-linguistic information that falls into that cat-
egory). Naturally, if you’re looking at ten different languages, a line plot may
not be appropriate (too many lines). But you should create the figure and
examine the trends nonetheless. Whether the figure will be in your actual
paper is a completely different story.
3.9 Summary
In this chapter, we discussed some general guidelines for visualizing data and
examined the most common plots for continuous variables using ggplot2.
All the figures discussed earlier were generated by using different layers of
code and should be saved in the figures folder (to keep scripts and figures sep-
arated within an R Project). The first layer identifies which data object should
be plotted and specifies the axes. Once the first layer is defined, we can use dif-
ferent layers to create and adjust our figures. Here is a summary of the main
functions we used.
. Scatter plots: geom_point(). Scatter plots are used when you have con-
tinuous variables on both axes in your figure. These plots are ideal for
displaying correlations. Refer to code block 13.
. Trend lines: stat_smooth(method = lm). Trend lines can be added to a
figure (e.g., scatter plot) by adding the function stat_smooth() (or
geom_smooth()) to our code. Refer to code block 13.
. Box plots: geom_boxplot(). Box plots are used when your x-axis is cat-
egorical or discrete. These plots are very informative as they show the
spread of the data as well as the median. Refer to code block 16. We also saw
how to include means and error bars on box plots—refer to code block 18.
. Bar plots: stat_summary(). Like box plots, bar plots are used when your
x-axis is categorical or discrete. Even though there is a function called
geom_bar(), you can generate bar plots using stat_summary(). You
should add error bars to your bar plot whenever it’s appropriate. Refer to
code block 17.
. Line plots: stat_summary(). Line plots are useful to show how a given
continuous variable (y-axis) changes as a function of another (ideally)
continuous variable (x-axis). Refer to code block 19.
. Facets: facet_grid(). Facets in ggplot2 allow us to add more variables to
our figures, which will be plotted on different panels. Adding a facet to our
bar plot in Fig. 3.8 would have made it more informative (e.g., facet_grid
(*proficiency)). Refer to code block 14, for example.
. aes() allows us to specify the variables that will be displayed along the y-
and x-axes. Crucially, it also allows us to specify how we wish to plot
additional dimensions. For example, if we wish to plot L1 using bars with
different colors (each color representing a proficiency level), we would add
aes(…, fill = L1) to our code—of course, if you do that you shouldn’t
facet by L1. If we add that line of code to our first layer, its scope will be
global, and all other layers will inherit that specification. If we instead have
a line plot, we must replace fill with color, since lines have no fill argu-
ment. Thus, by enriching the argument structure of aes() we can add more
dimensions to our plots without using facets (you can naturally use both).
Refer to code block 19, for example, where we use different line types for
the different languages in our dataset.
3.10 Exercises
R code
1 ggplot(data = longFeedback %>%
2 filter(Age > 40, Task = "task_A"),
3 aes(x = L1, y = Score)) +
4 stat_summary() +
5 facet_grid(sTask)
PROBLEMATIC CODE B
is compatible with bar plots, box plots, error bars, and so on. Try answering
this question with box plots and different fill colors for Task. Then try
again with bars (and error bars) to see how they compare.
2. Using longFeedback, create a histogram showing the distribution of
Score—see Fig. 3.2. This time, however, add fill = Feedback to aes() in
the first layer of your plot. The resulting plot will have two histograms
overlaid in a single facet.
1. We often want to plot parts of our data, but not the entire dataset. Take our
line plot in Fig. 3.10: we might want to remove one of the L1 groups from
our data to have a plot with only two groups. We have already discussed
how we can subset our data using the filter() function (Table 2.3). The nice
thing is that you can use that function inside ggplot(). Go ahead and create
Fig. 3.10 without the Japanese group. Simply add %>% after the data
variable in the first layer of the plot in code block 19—indeed, you could
do a lot of complicated operations using %>% inside ggplot().
2. Reproduce Fig. 3.3, but this time only plot scores greater than 60 and
Hours greater than 8. You can add multiple filters either by separating the
conditions with a comma or “&”. Hint: If you feel stuck, do the next
question first and then come back.
3. Observe problematic code B. RStudio will try to help you with syntax
errors when you run the code, but it won’t tell you much more than that.
Your task is to make the code work: it should generate a plot with mean
scores by group (and associated error bars). Hint: You may have to inspect
the data to figure out some of the issues in the code.
Notes
1. Once you run str(feedback), you will likely see that some variables are character var-
iables (chr), not factors (fct). To see them as factors, run lines 8–9 from code block
11.
2. Because feedback is a tibble, we could just run feedback here, without head().
3. Remember, however, that box plots do not show us the mean scores of the groups,
which is what we will focus on in our statistical analysis in chapter 6.
85 Continuous Data 85
4. By default, mean_sdl plots ±2 standard deviations from the mean. You can change
that by adding fun.args = list(mult = 1) to stat_summary() (see lines 6–8 in code
block 18).
5. You will often see line plots where the x-axis is categorical, in which case lines are
used to connect discrete points and to convey some trend. Whether or not this is rec-
ommended depends on a number of factors, including the nature of your data and
what you want to show.
4
CATEGORICAL DATA
DATA FILE
Make sure you have downloaded rClauseData.csv. This file simulates a
hypothetical study on relative clauses in second language English.
87 Categorical Data 87
This type of syntactic ambiguity has been well studied across languages
(e.g., Cuetos and Mitchell 1988, Fodor 2002, Fernández 2002, Goad et al.
2020), and different languages have been shown to favor different interpreta-
tions in such cases. English, for example, tends to favor LOW attachment, that
is, NP2. Spanish, in contrast, has been shown to favor HIGH attachment, that
is, NP1. Therefore, while an English speaker would be more likely to assume
that it’s the nurse who likes to dance in (1b), a Spanish speaker would be
more likely to assume that it’s the daughter.
Our data comes from a hypothetical auditory experiment comparing speak-
ers of English (controls) and speakers of Spanish learning English. Participants
heard different sentences that were ambiguous (e.g., (1b)) and were subse-
quently asked “who likes to dance?” Crucially, the experiment also investigates
the possible role of a pause, given that prosody has been shown to affect speak-
ers’ interpretations (Fodor 2002). In (2), we see all three relevant conditions ( #
= pause).
Getting Started
Create a new script called categoricalDataPlots.R. Because our topic of dis-
cussion is still data visualization, you can save your script in the same directory
used in chapter 3—we will also use the figures folder created in the previous
chapter, since we’re still in the same R Project. Thus, the scripts from chapters
3 and 4 (and 5) will belong in the same R Project, namely, plots.Rproj—refer
to Fig. 3.1. Finally, in categoricalDataPlots.R, load tidyverse, and import
rClauseData.csv (you can use either read.csv() or read_csv())—assign it to
variable rc (relative clause).
If you visualize the top rows of rc using head(rc), you will see that our
dataset has 12 variables: ID (participants’ ID); L1 (participants’ native language,
English or Spanish); Item (each sentence used in the experiment); Age (partic-
ipants’ age); AgeExp (learners’ age of exposure to English); Hours (learners’
weekly hours of English use); Proficiency (participants’ proficiency: Int or
Adv for learners or Nat for native speakers of English); Type (Filler or
Target items); Condition (presence of a pause: High, Low, or NoBreak); Cer-
tainty (a 6-point scale asking participants how certain they are about their
responses); RT (reaction time); and Response (participants’ choice: High or
Low). There are numerous NAs in rc. This is because native speakers are
coded as NA for AgeExp and Hours, for example. In addition, Condition
and Response are coded as NA for all the fillers2 in the data (given that we
only care about target items). So don’t worry when you see NAs.
Some of the variables in rc are binary, namely, L1, Type and, crucially,
Response—our focus later (§4.1). To examine how we can prepare our data
and plot some patterns, we will examine how Proficiency and Condition
may affect speakers’ responses—clearly the focus of the hypothetical study in
question.
The error bars in Fig. 4.1 give us some idea of what to expect when we
analyze this data later (in chapter 7): error bars that don’t overlap (or that
overlap only a little) may reveal a statistical effect. For example, look at the
bar for Adv in condition High, and compare that bar to Adv in condition
NoBreak. These two bars overlap a lot (they have almost identical values
along the y-axis). This suggests that advanced learners’ preferences are not dif-
ferent between the High and NoBreak conditions. Here’s another example: if
we had to guess and had no access to any statistical method, we could say that
“it’s more likely that native speakers are different from advanced learners in the
Low condition, than that advanced learners are different from intermediate
learners in the same condition”. The take-home message here is that Fig. 4.1
already tells us a lot about the patterns in our data. It helps us understand
what’s going on, and it also helps us move forward with our analysis.
Let’s now examine the code block that prepared our data and generated Fig.
4.1. Recall that we’re currently working with the categoricalDataPlots.R
script. Code block 20 therefore mirrors what the top of that script should
look like. Lines 2–7 are familiar at this point, as they simply load tidyverse
and our data (rClauseData.csv) and visualize the top rows of the data—if
you haven’t closed RStudio since chapter 3, tidyverse is already loaded, but
you can run line 2 anyway. The code is divided into two parts: first, we
prepare the data (lines 10–17), and then we plot it (lines 25–30). Once you
prepare the data, the plotting itself should be familiar.
Lines 10–17 create a new variable, which will be our new tibble with per-
centages. First, line 10 takes our data as the starting point (rc). Then, line 11
filters the data by removing all the fillers and focusing only on target items
(recall that the column Type has two values, Filler and Target). Third,
line 12 groups the data by a number of variables, namely, ID, Proficiency,
Condition, and Response—these variables will be present in our new tibble,
called props. Imagine that R is simply “separating” your data based on some
variables, such that every participant will have his/her own dataset, and
within each dataset R will further separate the data based on the variables we
want. Fourth, line 13 counts the responses, that is, how many High and Low
responses there are for each participant and condition.
At this point we have raw numbers only (counts), not percentages. In other
words, props right now contains a column n where we have the number of
times participants chose High and Low for each condition—and we also have
a column for proficiency levels, as discussed earlier. Our next step is to calculate
the percentages. The issue here is that we can calculate percentages in different
ways. What we want is the percentage that represents the number of times
someone chose High or Low for a given condition. That’s what line 14 does: it
groups the data again, this time by Proficiency, ID, and Condition. Why
don’t we have Response here? Assume a participant had five responses for
High and five responses for Low. We want to divide each count by both
91 Categorical Data 91
R code
1 # Remember to add this code block to categoricalDataPlots.R
2 library(tidyverse)
3
4 # Import data:
5 rc = read_csv("rClauseData.csv")
6
7 head(rc)
8
9 # Prepare data: percentages
10 props = rc %>%
11 filter(Type == "Target") %>%
12 group_by(Proficiency, ID, Condition, Response) %>%
13 count() %>%
14 group_by(Proficiency, ID, Condition) %>%
15 mutate(Prop = n / sum(n)) %>%
16 filter(Response == "Low") %>%
17 ungroup()
18
19 # Visualize result (starting with ID s1):
20 props %>%
21 arrange(ID) %>%
22 head()
23
24 # Figure (bar plots + error bars):
25 ggplot(data = props, aes(x = Proficiency, y = Prop)) +
26 stat_summary(geom = "errorbar", width = 0.2) +
27 stat_summary(geom = "bar", alpha = 0.3, color = "black") +
28 facet_grid(sCondition, labeller = "label_both") +
29 labs(y = "% of low attachment") +
30 theme_classic()
31
32 # ggsave(file = "figures/mainBinaryPlot.jpg", width = 7, height = 2.5, dpi = 1000)
CODE BLOCK 20 Preparing Binary Data for Bar Plot with Error Bars
rc ! props
s1 s2 sn ID
Our new tibble, props, contains three rows per participant (the proportions
for all three Low responses we care about). If you run lines 20–22 in code block
20, you will see that our participant (s1) in Fig. 4.2 fills the top three rows of
the output (line 21 is sorting the data by ID). In column Prop, the values for s1
should be 0.5, 0.8, and 0.5—which are the values underlined in Fig. 4.2.
Finally, because each participant now has three rows in our new data, and
because we have 30 participants, props has 90 rows, right? Well, not exactly.
You’ll notice that props has 88 rows. Why is that? When we process our
data the way we did, NAs are not counted. As a result, if a given participant
only had High responses for a particular condition, that means our participant
will have fewer than three rows in props—this is the case for participants s23
and s25, who have no Low responses for the High condition. That’s why
we ended up with 88 instead of 90 rows. There is a way to include empty
counts, but we are not going to worry about that since it won’t affect our
analysis.
Once we prepare our data, creating Fig. 4.1 is straightforward, given what
we discussed in chapter 3. After all, we now have a continuous variable,
Prop, in props. Lines 25–30 are all we need to produce Fig. 4.1.
The nice thing of having ID as one of our grouping variables earlier is that
we can now calculate not only the mean proportion of Low responses in the
data but also the standard error across participants. You can go back to code
block 20 and remove ID from lines 12 and 14. If you check the number of
rows now, instead of 88 rows we have only 9—we have three proficiency
levels and three conditions, and we’re only looking at Low responses. The lim-
itation here is that we wouldn’t know how individual participants behave: we’d
merely have the overall proportion of Low responses for all advanced learners
for each condition, for example. That’s why we should include ID as a group-
ing variable.
93 Categorical Data 93
R code
1 # Figure (bar plots + error bars + by-participant lines):
2 ggplot(data = props, aes(x = Condition, y = Prop)) +
3 stat_summary(geom = "line", alpha = 0.2, aes(group = ID)) +
4 stat_summary(geom = "errorbar", width = 0.2) +
5 stat_summary(geom = "bar", alpha = 0.1, color = "black") +
6 facet_grid(sProficiency, labeller = "label_both") +
7 labs(y = "% of low attachment") +
8 theme_classic()
9
10 # ggsave(file = "figures/varBinaryPlot.jpg", width = 7, height = 2.5, dpi = 1000)
Fig. 4.3 is very similar to Fig. 4.1. Like before, we have bars representing
means and error bars representing standard errors from the means. But now
we also have lines that represent the average response by participant, so we
can see how the variation across participants generates the error bars that we
saw before in Fig. 4.1.6 For example, compare condition High between Adv
and Int and you will see why the error bars have different heights: in the
High condition, advanced learners’ responses are more similar than intermedi-
ate learners. Finally, notice that Fig. 4.3 has Condition on its x-axis and
Proficiency across facets, so it’s structurally different from Fig. 4.1. Stop for a
minute and consider why our x-axis can’t represent Proficiency in Fig. 4.3.
The modification is necessary because a single participant sees all conditions
in the experiment, but a single participant can’t possibly have more than one
proficiency level at once. Therefore, if we had Proficiency along our x-axis,
we wouldn’t be able to draw lines representing individual participants.
The code used to generate Fig. 4.3 is shown in code block 21. The key here is
to examine line 3, which uses stat_summary()—but this time specifies a different
geom. Transparency (alpha) is important to keep things clear for the reader—
both lines and bars have low alpha values in the figure. Crucially, we specify
aes(group = ID) inside our stat_summary(), which means each line will repre-
sent a different ID in the dataset (props). If you use color = ID in addition to
94 Visualizing the Data 94
group = ID, each participant will have a different color (and a key will be added
to the right of our figure)—clearly this is not ideal, given the number of partic-
ipants. Finally, note that if we hadn’t added ID as a grouping variable in code
block 20, we wouldn’t have that column now, and Fig. 4.3 wouldn’t be possible.
When we create a bar plot with error bars for an ordinal variable, we are
accepting that using means and standard errors may not be the best way to
examine our data. One way to be more conservative would be to bootstrap
our standard errors (§1.3.4). stat_summary() assumes by default that you
want geom = “pointrange” and that your variable is normally distributed. If
your variable is not normally distributed, you can use stat_summary(fun.
data = mean_cl_boot). This function bootstraps error bars (see §1.3.4)—
you will notice, if you compare both methods, that this increases the size of
your error bars.7 The key is to assess how reliable the figure is given the statis-
tical patterns in the data, which we will examine in chapter 8. If the appropriate
statistical model is consistent with the patterns shown in a bar plot (even in one
with traditional standard errors), then our figure is fine.
Our current hypothetical study is clearly not centered around Certainty—our
main variables of interest were introduced and visualized in §4.1. As a result,
certainty levels here are merely extra pieces of information regarding our par-
ticipants’ behavior, and for that bar plots with standard errors are enough. What
else could we do with ordinal data, though?
There’s another common plot that is more aligned with the ordinal nature of
our variable Certainty. It’s also a bar plot, but it’s a different kind of bar plot.
Fig. 4.4 plots certainty levels with stacked bars. Different shades of gray repre-
sent different points on our 6-point scale. The height of each bar tells us how
representative each certainty level is for each proficiency level and for each
condition. This type of figure allows us to see where participants’ Certainty
responses are clustered. For example, native speakers seem more certain in
the Low condition than in the other two conditions. Learners, in contrast,
seem more certain in the High condition than in the other two conditions—
we can easily see that by looking for the highest concentration of dark bars
in the plot, because darker shades of gray here represent a higher level of cer-
tainty. This makes our figure intuitive, and we don’t have to worry about the
non-normality of our variable. We do, however, have to compute the percent-
ages being plotted, which resembles what we did for our binary data in Fig. 4.1.
Code block 22 provides all the code you need to generate Fig. 4.4. The first
thing we need to do is check whether R understands that our Certainty var-
iable is, in fact, an ordered factor (i.e., not just a continuous variable). Recall
that by running str(rc) we have access to the overall structure of our data,
including the class of each variable. If you run line 2, you will see that R
treats Certainty as a variable containing integers (int). This is expected: R
can’t know whether your numbers are true numbers or whether they come
from a scale with discrete points. Lines 5–6 fix that by recategorizing the var-
iable: now Certainty is an ordered factor.
Lines 9–14 prepare our data by computing counts and proportions for each
point on our scale. Most of it will be familiar from code block 20, which we
discussed at length in §4.1. This time we don’t include ID as one of our vari-
ables of interest because we’re not planning to plot by-participant trends (and
we have no error bars in Fig. 4.4)—in chapter 5, we will revisit Fig. 4.4 to
improve different aspects of our data presentation.
Bar plots are extremely useful to display ordinal variables—they are certainty
not the only way to plot such variables, but they do communicate our results in
a clear and familiar way. As we saw earlier, you can treat your scale as a con-
tinuous variable, in which case traditional bar plots and standard errors may be
used (with care) to plot the mean certainty in your data. Alternatively, you can
compute percentages for responses along your scale and use those percentages
to create a stacked bar plot where fill colors can help us focus on clusters of
R code
1 # Check if Certainty is an ordered factor:
2 str(rc)
3
4 # Make Certainty an ordered factor:
5 rc = rc %>%
6 mutate(Certainty = as.ordered(Certainty))
7
8 # Prepare our data (generate percentages):
9 cert = rc %>%
10 filter(Type == "Target") %>%
11 group_by(Proficiency, Condition, Certainty) %>%
12 count() %>%
13 group_by(Proficiency, Condition) %>%
14 mutate(Prop = n / sum(n))
15
16 # Create bar plot:
17 ggplot(data = cert,
18 aes(x = Proficiency,
19 y = Prop,
20 fill = as.ordered(Certainty))) +
21 geom_bar(stat = "identity", width = 0.5, color = "black") +
22 facet_grid(sCondition, labeller = "label_both") +
23 theme_classic() +
24 scale_fill_brewer("Certainty", palette = "Greys", direction = 1)
25
26 # ggsave(file = "figures/certaintyPlot.jpg", width = 7, height = 2.5, dpi = 1000)
responses along the scale (as opposed to using means). If your scalar data plays a
key role in your study, the latter method is recommended as you don’t have to
worry about the distribution of your ordinal data—which is almost never normal.
As we saw earlier, visualizing ordinal data is not so different from visualizing
binary data. Both tend to require some data preparation before running ggplot
()—in that sense, both data types can be more demanding than continuous
data. Fortunately, you can use the code examined earlier as a template for
future reference—adjusting some code is almost always easier than creating
the same code from scratch.
4.3 Summary
In this chapter, we discussed how to prepare our categorical data for plotting.
As we saw, when dealing with binary data we will often transform our data into
counts or percentages, which in turn allows us to employ the same visualization
techniques discussed in chapter 3. Here’s a brief summary of the chapter.
4.4 Exercises
R code
1 # Plotting by-speaker results:
2
3 ggplot(data = props %>% filter(Proficiency %in% "Nat", "Int"),
4 aes(x = reorder(ID, Prop), y = Prop, color = proficiency)) +
5 stat_summary()
6 labs(x = "Speaker", y = "Proportion of Low responses") +
7 theme_classic()
PROBLEMATIC CODE C
2. Generate Fig. 4.1 without native speakers (by filtering the data in the plot
itself ). Hint: you can use the inequality operator (!=).
3. Create a box plot version of Fig. 4.1 and add means and standard errors to
it—you should use props, created in code block 20. Hint: if you simply add
stat_summary() to your figure, it will plot means and standard errors (in
the form of point ranges).
Notes
1. Unless you have coded yes as 1 and no as 0, in which case the proportion of yes
responses is easy to calculate.
2. Fillers in this study could be sentences that are not semantically ambiguous or sen-
tences where the focus is not the head of a relative clause and so on.
3. R orders level labels alphabetically. We will see how to change that in chapter 5.
4. We’re also grouping by Proficiency, but because a participant cannot have two dif-
ferent proficiency levels, this will not affect how we calculate the percentages.
5. By default, only non-zero counts are included in the resulting tibble. If your dataset
has zero counts for certain combinations, not including them may impact calculations
based on the Prop column (e.g., means), which may in turn affect any figures that
depend on these calculations. To avoid this problem, you should use the complete
(x, y) function, where x represents the response variable of interest and y represents
a list where we specify how we want to fill in zero counts and proportions. For
example, y could be fill = list(n = 0, Prop = 0) in this case.
6. This is an example where we are using lines even though our x-axis is a discrete var-
iable (cf. §3.7). Here, however, lines play a different role, connecting the means of
each participant across conditions—otherwise it would be difficult to visually trace
the trends in question.
7. If the error bars don’t change much, you could keep the traditional error bars.
99
5
AESTHETICS: OPTIMIZING
YOUR FIGURES
Thus far we have examined several plots created with ggplot2. We have focused
our discussion on the content of each figure, so we ignored aesthetic character-
istics such as font size, label orders, and so on. These characteristics are the
focus of this chapter. Fortunately, ggplot2 gives us total control over the aesthet-
ics of our plots, which is important to generate publication-quality figures.
Before we get started, go ahead and create a new script, optimizingPlots.R,
which will be our third script inside the directory plots in Fig. 3.1. Our ulti-
mate goal in this chapter is to create Fig. 5.1. In doing that, we will explore a
number of aesthetic modifications that can be made using ggplot2. Plotting
ordinal data is a little trickier than plotting continuous data, so this figure
will be good practice for us.
You can compare Fig. 5.1 with our first attempt to plot certainty levels back
in Fig. 4.4. Here, we have important differences. First, our bars are now hor-
izontal, not vertical, so they mirror the horizontal scale in the experiment.
Second, we no longer have a key to the right of the figure. Instead, we
moved the certainty points onto the bars themselves. As a result, you don’t
have to keep looking back and forth to remember which level each color rep-
resents. Note, too, that the font color of the scale points alternates between
black and white in order to maximize the contrast between the certainty
values and the fill of the bars. Third, we finally have actual percentages on
our y-axis—only now we have flipped our axes, so our y-axis is our x-axis
and vice versa. Fourth, we have changed the labels for Proficiency (y-axis),
which are no longer abbreviated. We have also rotated those labels, so they
are parallel with the plot’s border and thus occupy less space on the page.
The labels for Condition also have a minor change: condition NoBreak is
100 Visualizing the Data 100
now No break, with a space. Finally, we now have a horizontal dashed line sep-
arating native speakers from learners—and, you may not have noticed this, but
we now have different font family (Verdana) and size. As you can see, a lot is
different here. We will spend the remainder of this chapter examining in detail
how to implement all these changes in R. Most of the changes we will discuss
are applicable to all the other plots we have seen so far, as long as the data and
plot you have in mind are compatible with the specification you add to your
code.
We start with some data preparation, shown in code block 23. Because this is
a new script, we first need to load some packages. This time, we’ll need
more than just tidyverse, which means you will have to install two more
packages: scales and extrafont. You can install both packages by running
install.packages(c(“scales”, “extrafont”)). The first package, scales, allows
us to adjust our axes based on different scales (it will give us the percentages
along the axis). The second package, extrafont, will allow us to change the
font family in our figures—most of the time this isn’t necessary, since the
default font family is more than appropriate, but some journals will require a
specific font for your figures.
Once you have installed our two new packages, you’re ready to run lines 1–4
in code block 23—line 4 simply loads all the fonts available on your computer
by using extrafont, so a list of fonts will be printed in your console. You can
check which fonts you can use by running line 7. Lines 9–14 are familiar at this
point: we’re importing our data and making the variable Certainty an ordered
factor.
Lines 17–22 compute proportions. Simply put, we want to know what per-
centage of responses we have for all six points along our certainty scale. Cru-
cially, we want the percentages to be computed based on participants’
proficiency levels and the three experimental conditions (High, Low,
101 Aesthetics: Optimizing Your Figures 101
R code
1 library(tidyverse)
2 library(scales) # To get percentages on axis
3 library(extrafont)
4 loadfonts()
5
6 # Check which fonts are available:
7 fonts()
8
9 # Import data:
10 rc = read_csv("rClauseData.csv")
11
12 # Make Certainty an ordered factor:
13 rc = rc %>%
14 mutate(Certainty = as.ordered(Certainty))
15
16 # Prepare our data (generate percentages):
17 cert = rc %>%
18 filter(Type == "Target") %>%
19 group_by(Proficiency, Condition, Certainty) %>%
20 count() %>%
21 group_by(Proficiency, Condition) %>%
22 mutate(Prop = n / sum(n))
23
24 cert = cert %>%
25 ungroup() %>%
26 mutate(color = ifelse(Certainty %in% c(4, 5, 6), "white", "black"),
27 Proficiency = factor(Proficiency,
28 levels = c("Int", "Adv", "Nat"),
29 labels = c("Intermediate", "Advanced", "Native")),
30 Condition = factor(Condition,
31 levels = c("High", "Low", "NoBreak"),
32 labels = c("High", "Low", "No break")))
CODE BLOCK 23 Plotting Certainty Levels with Adjustments: Preparing the Data
NoBreak), that’s why we’re using group_by() once again—we do that in lines
19–20 to generate a count and again in 21–22 to compute proportions (this
process should be familiar by now). The result of lines 17–22 is that we
now have a new variable, cert, which contains five columns: Proficiency,
Condition, Certainty, n, and Prop.
Lines 24–32 are the chunk of code that is actually new in code block 23.
Here we’re making some adjustments to cert. First, we ungroup the variables
(which we grouped twice in lines 17–22). We then create one column (color)
by using a conditional statement. The column color will have the following
values: “white” if a given certainty level is 4, 5, or 6 and “black” otherwise.
We then modify the labels of two existing columns (Proficiency and Condition).
In lines 27–29, we are asking R to take the existing levels of Proficiency and
label them with non-abbreviated proficiency levels. In lines 30–32, we are
simply adding a space to NoBreak in Condition. This is all we need to do to
get started with our figure.
Code block 24 is longer than usual, mostly because of our several adjust-
ments. The layers here are (mostly) ordered from major (content-related) to
102 Visualizing the Data 102
minor (form-related). You can probably guess what every line of code in code
block 24 is doing. That’s the advantage of having access to the actual code. You
don’t ever have to start the figure from scratch: you can adapt the code pro-
vided here to suit your data.
Our first layer (lines 2–6) for Fig. 5.1 points ggplot2 to our data object (cert),
as usual, but this time we have four arguments inside aes(). In addition to our
typical axes, we also establish label = Certainty, and fill = Certainty. The
former is responsible for labeling the points along the scale; the latter
changes the fill color of the bars depending on the certainty value (this
should be familiar from Fig. 4.4). Next, in lines 7–8, we add our bars. We
specify stat = “identity” because we are providing the y-axis ourselves, and
we want the bars to plot those values (and not, say, calculate the values some
other way). There we also adjust the width of the bars and the color for
their borders. Line 8 is important: we’re asking ggplot2 to reverse the order
of the scale—this will ultimately guarantee that our bars go from left (1) to
right (6), and not the other way around, which would be counterintuitive.
Next, we flip the entire figure in line 9 by using coord_flip()—this is what
makes our y-axis become our x-axis and vice versa. Naturally, you can flip
other plots in ggplot2 by adding this layer to your code.
R code
1 # Create bar plot for ordinal data:
2 ggplot(data = cert,
3 aes(x = Proficiency,
4 y = Prop,
5 label = Certainty,
6 fill = Certainty)) +
7 geom_bar(stat = "identity", width = 0.5, color = "black",
8 position = position_stack(reverse = TRUE)) +
9 coord_flip() +
10 geom_text(aes(color = color),
11 position = position_fill(vjust = 0.5, reverse = TRUE),
12 fontface = "bold") +
13 facet_grid(sCondition, labeller = "label_both") +
14 geom_vline(xintercept = 2.5, linetype = "dashed", color = "gray") +
15 scale_fill_brewer("Certainty", palette = "Greys", direction = 1) +
16 scale_color_manual(values = c("black", "white")) +
17 scale_y_continuous(labels = percent_format(), breaks = c(0.25, 0.5, 0.75)) +
18 theme_classic() +
19 theme(legend.position = "none",
20 text = element_text(family = "Verdana", size = 10),
21 axis.ticks.y = element_blank(),
22 axis.text = element_text(color = "black"),
23 axis.text.y = element_text(angle = 90, hjust = 0.5)) +
24 labs(y = "Percentage of responses",
25 x = NULL)
26
27 # ggsave(file = "figures/certaintyPlot2.jpg", width = 7, height = 3.5, dpi = 1000)
Once our axes are flipped, lines 10–12 adjust the text in our labels, which
represent the certainty levels in Fig. 5.1. First, we add aes(color = color),
which specifies that the color of the text depends on the value of the
column color in our data. Line 11 then adjusts the vertical position of the
labels1—try removing vjust from line 11 to see what happens. Line 12
defines that the labels along the scale should be bold.
Line 13 is familiar: we’re adding another dimension to our figure, namely,
Condition—and we’re asking ggplot2 to label the values of the variable. Next,
line 14 adds the horizontal dashed line separating native speakers from learners.
The function, geom_vline() draws a vertical line across the plot, but because we
have flipped the axes, it will draw a horizontal line instead. We still have to give
it an xintercept value (i.e., where we want to the line to cross the x-axis), even
though the axes have been flipped. Because our axis contains three discrete
values, xintercept = 2.5 will draw a line between levels 2 and 3, that is,
between advanced learners and native speakers.
Lines 15–17 use similar functions. Line 15, scale_fill_brewer() is using a pre-
defined palette (shades of gray) to fill the bars (recall that the fill color will depend
on the variable Certainty, which we defined back in line 6).2 The argument
direction merely defines that we want the colors to follow the order of the
factor values (direction = –1 would flip the order of colors). As a result,
darker colors will represent a higher degree of certainty. Line 16 manually
defines the colors of the labels, and line 17 defines not only that we want percent-
ages as labels for our former y-axis (now x-axis) but also that we want to have
only three breaks (i.e., we’re omitting 0% and 100%, mostly because the x-axis
looks cluttered across facets if we don’t omit those percentage points).
Everything from line 19 to line 23 is about formatting. First, we get rid of our
key (the one we have in Fig. 4.4), since now we have moved the actual certainty
values onto the bars for better clarity. Second, in line 20, we choose Verdana as
our font family, and we set the font size we desire.3 Line 21 removes the ticks
along our y-axis, line 22 changes the color of our axis text (it’s typically dark
gray, not black), and line 23 changes the angle of the labels along our y-axis.
In theory, we could remove the percentage points from the x-axis in
Fig. 5.1. The way the figure is designed, we just have to look at the sizes
and colors of the bars, and we will easily see where participants are more or
less certain. The actual percentage point of each bar is clearly secondary, and
by definition, all bars must add up to 100%. Finally, note that the y-axis is
not labelled. The idea here is that its meaning is obvious, so no labels are nec-
essary (you may disagree with me).
You might be wondering about the title of the figure. Even though we can
include a title argument in labs(), I personally never do that. Titles are usually
defined not in the plot itself, but rather in the caption of the plot, which is
defined later (e.g., in the actual paper)—this is the case for all figures in this
book.
104 Visualizing the Data 104
5.2 Exercises
Notes
1. vjust adjusts the vertical justification of the bars (options: 0, 0.5, 1)—0.5 centers the
labels.
2. ggplot2 has its own default colors, but you will likely want to change them. There
are numerous palettes available when you use scale_fill_brewer(). One example is
the palette RdOr, which goes from red to orange and can be quite useful for plotting
scalar data. In addition to Greys, popular palettes include Blues and Greens.
3. Technically, you can change the font size simply by changing the width and height of
the plot when you save it in line 26. For example, if you set weight to 5 and height to
2, your font size will be much larger—too large, in fact.
107
PART III
6
LINEAR REGRESSION
In Part II we visualized different datasets and explored different figures that com-
municate the patterns in our data. The next step is to focus on the statistical anal-
ysis of our data. Even though we discussed trends in the data as we created
different figures, we still don’t know whether such trends are statistically relevant.
In this chapter, we will focus on linear regressions. To do that, we will return
to continuous (response) variables, which we visually examined in chapter 3.
Linear regressions will be the first of the three statistical models we discuss in
this book. Later, in chapters 7 and 8, we will explore categorical response var-
iables, which we visually examined in chapter 4. As you can see, we will follow
the same structure used in Part II, where we started our discussion with con-
tinuous data and later examined categorical data.
For the present chapter, we will start with an introduction to linear regres-
sions (§6.1). Then, in §6.2, we will statistically analyze the data discussed in
chapter 3. Finally, we will spend some time on more advanced topics (§6.3).
Given that our focus in this book is on regression analysis, this chapter will be
the foundation for chapters 7 and 8. Before we get started, however, it’s time to
discuss file organization one more time. So let’s make sure all our files are orga-
nized in a logical way.
In Fig. 3.1 (chapter 3), the folder bookFiles contains two subfolders, basics
and plots. Recall that the plots folder contains all the files we used in Part II
of this book. We will now add a third subfolder to bookFiles—let’s call it
models. This folder will in turn contain another subfolder, Frequentist. Inside
Frequentist, we will add all the files for chapters 6–9, and we’ll also create
another figures folder, just like we did in the basics and plots directories.
Later, when we discuss Bayesian models (chapter 10), we will add another
110 Analyzing the Data 110
subfolder to models, namely, Bayesian—remember that you can check the file
structure used in this book in Appendix D.
For this chapter, we will first create an R Project called Frequentist.RProj
(inside the Frequentist folder/directory), and then we will add three new
scripts to our Frequentist directory—simply create three new empty scripts
once you’ve created the R Project and save them as dataPrepLinearModels.
R, plotsLinearModels.R, and linearModels.R. To recap, you should
have the following folder structure: bookFiles > models > Frequentist.
Inside Frequentist, you should have one folder (figures) and four files: an
R Project (Frequentist.RProj) and three scripts (dataPrepLinearModels.R,
plotsLinearModels.R, and linearModels.R). At this point you can
probably guess where this is all going: we will prepare our data in
dataPrepLinearModels.R, plot some variables in plotsLinearModels.R,
and work on our statistical analysis in linearModels.R. As mentioned earlier,
here we will use the same dataset from chapter 3, namely, feedbackData.
csv. Go ahead and add a copy of that file to the newly created folder
Frequentist.
Recall that our feedbackData.csv is a wide table, so we need to transform
the data before plotting or analyzing it (the same process we did for figures
in Part II). Fortunately, we have already done that in code block 11. So all
you need to do is copy that code block, paste it into dataPrepLinearModels.
R, and then save the file. Next, open plotsLinearModels.R and add source
(“dataPrepLinearModels.R”) to line 1. Once you run that line, you
will load the script in question, which will in turn load the data file
feedbackData.csv as feedback and create a long version of feedback,
longFeedback. All the figures shown later should be created inside
plotsLinearModels.R, not linearModels.R, which is reserved only for the
actual statistical models. The code blocks that generate figures will save them
in the figures folder as usual—feel free to change the location if you prefer
to save the figures elsewhere on your computer.
Next, open linearModels.R and add source(“dataPrepLinearModels.R”) to
line 1 as well, so both plotsLinearModels.R and linearModels.R will source the
script that imports and prepares the data. All the code blocks that run models
shown later should be added to linearModels.R. This should be familiar at
this point.1
To make things easier, make sure you have all three scripts we just created
open in RStudio (in three different tabs), so you can easily go from data prep-
aration to plots to models. This is a very effective way to work with data: we
separate these three essential components into three separate scripts in a
self-contained project in the same directory. In this chapter, we will focus
on linearModels.R, but it’s nice to be able to generate our figures in
plotsLinearModels.R—you can naturally copy some of the code blocks from
chapter 3 into plotsLinearModels.R.
111 Linear Regression 111
DATA FILE
In this chapter (§6.2) we will use feedbackData.csv again. Recall that this file
simulates a hypothetical study on two different types of feedback, namely,
explicit correction and recast. The dataset contains scores for a pre-, post-,
and delayed post-test. Three language groups are examined, speakers of
German, Italian, and Japanese.
6.1 Introduction
In this chapter, we will first review what linear models are and what they do
(this section). We will then see them in action with different examples: you
will learn how to run them in R, how to interpret them, and how to report
their results. Once you finish reading the chapter, you may want to reread
this introduction to review key concepts—a summary is provided at the end
of the chapter, as usual.
Observe the scatter plot on the left in Fig. 6.1. There seems to be a relation-
ship between the two variables in question: as we increase the value along our
x-axis, we also see some increase in values on our y-axis. This pattern indicates
some positive effect of x on y. On the right, we are fitting a line to the data.
The goal here is simple: we want to use some explanatory variable (x-axis)
to predict our response variable (y-axis). Simply put, we want to fit a line to
the data that “best captures the pattern we observe”. But how can we do
that? And why would we want to do that? Let’s focus on the how first.
What would a perfect line look like in Fig. 6.1? Well, it would be a line that
touches all data points in the data. Clearly, a straight line is impossible here: our
data points are too spread, so no straight line will be perfect. Here the best line
is simply the line that reduces the overall distance between itself and all the data
points. We can measure the dashed lines in Fig. 6.1 and use that as a proxy for
how much our line deviates from the actual data points. That distance, repre-
sented by ^e in the figure, is known as residual. Simply put, then, the residual is
the difference between what our line predicts and the data observed—it’s our
error: ^e ¼ ^y . y. We represent the true value of the error as . and use ^e for the
estimated value of the error.2 ^y is the value we predict (i.e., our fitted line), and
y is the actual data point in the figure.
If we sum all the residuals in Fig. 6.1, that is, all the vertical dashed lines, we
.
will get zero ( ^e ¼ 0), since some distances will be negative and some will be
positive (some data points are below the line and some are above it). If the sum
of the residuals always equals zero, it’s not a very informative metric, because
any line will give us the same sum. For that reason, we square all the residuals.
That way, each distance. ^e 2 will be a positive number. So now we can sum all
these residuals squared ( ^e 2 ) and that will give us a number representing how
good our line is. If we try drawing ten different lines, the line that has the lowest
sum of squares of the residuals is the best fit. The thick inclined line in Fig. 6.1 is
the best line we can fit to our data here.
The horizontal solid line simply marks the mean of the variable on the y-axis
(.y ¼ 86) and serves as our baseline, since it ignores the variable on the x-axis.
Our question, then, is how much better is the thick inclined line relative to our
baseline?
One way to summarize a good fit is to use the coefficient of determination, or
2
R . This number tells us what proportion of variance observed in the data is
predictable from our variable on the x-axis of Fig. 6.1. R2 ranges from 0 (no
relationship between x and y) to 1 (perfect relationship/fit). In practice,
perfect fits don’t exist in linguistics, since patterns in language and speakers’
behaviors are always affected by numerous factors—many of which we don’t
even consider in our studies. So the question is what constitutes a good fit,
what is a good R2. There’s no objective answer to that question. For
example, for the fit in Fig. 6.1, R2 = 0.15. This means that the variable on
the x-axis explains 15% more variation in the data relative to our flat line at
the mean. You may think that 0.15 is a low number—and you’re right.
However, bear in mind that we are only considering one explanatory variable.
If you can explain 15% of the variation with a single variable, that’s definitely
not a bad fit considering how intricate linguistic patterns are. The conclusion
here is that x is useful in Fig. 6.1. In other words, using x is better than
simply using the mean of y (our baseline)—among other things, this is what
a statistical model will tell us.
You don’t need to know how we calculate R2, since R will do that for you.
But the intuition is as follows (feel free to skip to the next paragraph if you
already know how to calculate R2): we compare the sum of squares for the
flat line to the sum of squares for the fitted line. First, we assess our flat line
at the mean by calculating how far it is from the actual data points:
113 Linear Regression 113
. .
ðy . .yÞ , which is the same as ^e 2 . This is our total sum of squares, or SSt—
2
the same process discussed earlier for the residuals, only now we’re using a flat
line at the mean as our reference. Next, we do the same process, this time com-
paring our fit line (in Fig. 6.1) to the actual data points (i.e., the sum of squares).
This is our regression sum of squares (the residuals discussed earlier), or SSr. We
can then define R2 as follows: R2 ¼ SStSS.SS t
r
. Now think about it: if SSr is zero,
that means that we have a perfect line, since there’s no deviation from the line
and actual data points. As a result, R2 ¼ SSSSt .0 t
¼ 1. Conversely, if our line is just
as good as drawing a flat line at the mean of y, then SSt and SSr will have the
same value and will equal zero: R2 ¼ SStSS.SS t
r
¼ SS0 t ¼ 0 (no relationship between
x and y). Finally, you may be familiar with r, the correlation coefficient, which
ranges from −1 to +1. R2 is simply that number squared. Consequently, a cor-
relation of ±0.5 is equivalent to R2 = 0.25.
Now that we know the intuition behind the notion of a “best” line, let’s
discuss why that’s relevant. Once we have a line fit to our data, we can
(i) examine whether x and y are correlated, (ii) determine whether that corre-
lation is statistically significant, and (iii) predict the y value given a new x value
(e.g., machine learning). For example, suppose we are given x = 90—a datum
not present in our data in Fig. 6.1. Given the line we have, our y will be
approximately 80 when x = 90. In other words, fitting a line allows us to
predict what new data will look like.
Besides its predictive power, a linear model (represented here by our line)
also allows us to estimate the relationship(s) between variables. For example,
in our dataset from chapter 3, we could ask ourselves what the relationship is
between the number of hours of study (Hours variable in feedbackData.csv)
and participants’ scores (Score). We will examine this particular example in
§6.2.
When can we use linear models to analyze our data? If we have two contin-
uous variables such as those shown in Fig. 6.1, the relationship between them
must be linear—that’s why you want to plot your data to inspect what kind of
potential relationship your variables have. If your two variables have a non-
linear relationship, fitting a straight line to the data will naturally be of little
use (and the results will be unreliable to say the least). For example, if your
data points show a U-shaped trend, a linear model is clearly not the right
way to analyze the data.
Two important assumptions on which a linear model relies are (i) that resid-
uals should be normally distributed and (ii) that variance should be constant.
Let’s unpack both of them. Our residuals represent the error in our model—
typically represented as .. Some residuals will be positive, some will be nega-
tive, as discussed earlier. Some will be small (closer to our line), and some
will be farther away. Indeed, if we created a histogram of the residuals, the
114 Analyzing the Data 114
In 6.1 and 6.2, x represents the value(s) of our predictor variable (the x-axis in
Fig. 6.1). The slope (b^) of our line tells us how much ^y changes as we change
one unit of x. In other words, b^ is the effect size of x—our only predictor
(explanatory variable) here.4 Crucially, when we run a model, our objective
is to estimate both β0 and β, that is, we want to fit a line to our data. In the
sections that follow, we will explore multiple examples of linear models. We
will see how to run them using R, how to interpret their results, and how to
use them to predict patterns in our data. We will also discuss how you can
present and report the results of a linear model.
We are now ready to run our model, since we already have a sense of what
we see in the data: we want to check whether the effect of Hours on Score (the
slope of our trend line) is statistically real. To run a linear model in R we use
the function lm().
To answer our question earlier, we run lm(Score * Hours, data =
longFeedback)5—note that our model mirrors Fig. 6.2: in both cases, we
want to predict scores based on the number of hours of study (see §2.6.2.1).
In code block 25, which should go at the top of our linearModels.R script,
line 10 runs our model and assigns it to a variable, fit_lm1. This will be our
first model fit to the data: we are modeling each participant’s score as a function
of Hours in our hypothetical study. Line 11 displays the results using the
display() function from the arm package (Gelman and Su 2018). Line 12 is
the most common way to print the output of our model—summary().
Finally, because we don’t get confidence intervals by default in our output,
line 13 prints them for us.
Let’s inspect the most important information in the main output of our
model, pasted in code block 25 as a comment (lines 15–25). The first
column in our output, Estimate, contains the coefficients in our model. These
are our effect sizes (b^0 and b^). Our intercept, b^0 ¼ 65:09, represents the pre-
dicted score of a participant when he/she studies zero hours per week, that is,
Hours = 0. This should make sense if you go back to 6.1 or 6.2:b^0 þb^ . 0 ¼b^0 .
The 95% confidence intervals of the estimates are listed in lines 27–30 in code
block 25 and could also be reported.
Line 19 tells us the effect of Hours: b^ ¼ 0:92. This means that for every
additional hour of study (per week), a participant’s score is predicted to increase
0.92 points. In other words, if a student studies 0 hours per week, his/her pre-
dicted score is approximately 65 (b^0 ). If that same student studied 10 hours per
week, then his/her predicted score would be 9.2 points higher:
117 Linear Regression 117
b^0 þ b^x ! 65 þ 10 . 0:92 ¼ 74:2. You can see this in Fig. 6.2: it’s where the
dashed lines cross.
Chances are you will only look at two columns in the output of a model when
you use the summary() function: the estimate column, which we have just dis-
cussed, and the last column, where you can find the p-values for the estimates. In
R, p-values are given using scientific notation. For example, our intercept has a
p-value of 2e-16, that is, 2.10−16 (2 preceded by sixteen zeros). This is clearly
below 0.05, our alpha value. Indeed, we can see that both the intercept and
the effect of Hours are significant. Let’s understand what that means first and
then examine the other columns in our output in lines 15–25.
What does it mean for an intercept to be significant? What’s the null hypoth-
esis here? The null hypothesis is that the intercept is zero—H0: β0 = 0. Remem-
ber: H0 is always based on the assumption that an estimate is zero, whether it’s
the intercept or the estimate of a predictor variable. In other words, it assumes
that the mean score for learners who study zero hours per week is, well, zero.
R code
1 # Remember to add this code block to linearModels.R
2
3 source("dataPrepLinearModels.R")
4 # install.packages("arm") # If you haven t installed this package yet
5 library(arm) # To generate a cleaner output: display()
6
7 head(longFeedback)
8
9 # Simple linear regression: continuous predictor:
10 fit_lm1 = lm(Score s Hours, data = longFeedback):
11 display(fit_lm1)
12 summary(fit_lm1)
13 confint(fit_lm1)
14
15 # Output using summary(fit_lm1)
16 # Coefficients:
17 # Estimate Std. Error t value Pr(>|t|)
18 # (Intercept) 65.0861 1.5876 40.997 < 2e-16 ***
19 # Hours 0.9227 0.1493 6.181 1.18e-09 ***
20 # ---
21 # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
22 #
23 # Residual standard error: 9.296 on 598 degrees of freedom
24 # Multiple R-squared: 0.06005, Adjusted R-squared: 0.05848
25 # F-statistic: 38.2 on 1 and 598 DF, p-value: 1.18e-09
26
27 # Confidence intervals using confint(fit_lm1):
28 # 2.5 % 97.5 %
29 # (Intercept) 61.9681988 68.204066
30 # Hours 0.6295188 1.215872
CODE BLOCK 25 Simple Linear Regression and Output with Estimates: Score *
Hours
118 Analyzing the Data 118
But the estimate for the intercept is b^0 ¼ 65—clearly not zero. Thus, it shouldn’t
be that surprising that the intercept is significant. Think about it: how likely is it
that the participants’ scores should be zero if they studied zero hours per week?
Not very likely. After all, even the worst scores would probably not be zero. We
therefore have to reject H0 here. To refresh our memories: the p-value here
(< 0.001) represents the probability of finding data at least as extreme as the
data we have assuming that the null hypothesis is true, so it’s the probability
of finding b^0 ¼ 65 when we assume it’s actually zero (β0 = 0). As we can see,
the probability is exceptionally low—practically zero.
The p-value for our predictor, Hours, is also significant. The null hypothesis
is again that β = 0, which in practical terms means that we assume the trend line
in Fig. 6.2 is flat. The fact that b^ is positive and significant here tells us that the
slope of our line is above zero, that is, not flat. Therefore, the number of hours
a learner studies per week statistically affects his/her scores, and we again have
to reject H0. How much does Hours affect Score? 0.92 points for every weekly
hour of study, our b^. These are all the estimates in our model, of course, so
now let’s take a look at the other columns in our output.
It turns out that all four columns in our model’s output are connected. For
example, if you divide the estimate by the standard error, the result will be the
t-value6 for the estimate. So our third column is simply our first column divided
by our second column. And |t|-values above 1.96 will be significant (assuming
α = 0.05). For Hours, for example, our t-value is 6.2. Therefore, we already
know that this predictor has a significant effect even without looking at the
p-value column. Furthermore, because we know that the result here is statisti-
cally significant, we also know that the confidence interval for our predictor
doesn’t include zero, by definition. This is the reason that if you run display
(fit_lm1), line 11 (from the arm package), the output only shows you estimates
and standard errors: that’s all you really need. But bear in mind that even
though you only need these two columns, the vast majority of people in
our field will still want to see p-values—and most journals will likely require
them.
If you remember our brief review of standard errors back in §1.3.4, you may
remember that we can manually calculate any 95% confidence interval our-
selves by using the estimate and its standard error, so technically we don’t
even need the confidence intervals in our output. Take the estimated effect
of Hours, b^ ¼ 0:92, and its standard error, 0.15. To manually calculate the
lower and upper bounds of the 95% confidence interval, we subtract and
add from the estimate its standard error times 1.96: 95% CI = 0.92
±(1.96.0.15). More generally, then, 95% CI ¼ b^ . ð1:96 . SEÞ. Alternatively,
we can (and should) use the confint() function in R to have confidence inter-
vals calculated for us—note that these will not be identical to the intervals man-
ually calculated, as confint() uses a profile method to calculate intervals, which
119 Linear Regression 119
REPORTING RESULTS
A linear model shows that weekly hours of study has a significant effect on
learners’ scores (b^ ¼ 0:92; 95% CI ¼ ½0:63; 1:22.; p < 0:001; R2 ¼ 0:06).
These results indicate that one additional weekly hour of study had an
average positive impact of 0.92 point on learners’ scores.
with a simple “yes” if we find significant results, the second question focuses on
the size of the effect—a much more relevant aspect of our hypothetical study
here. Think about it this way: something can be statistically significant but
practically meaningless. Our focus in linear models should be first on the
effect, not on whether there’s a difference. That’s what we did earlier for
our continuous predictor (Hours), and that’s exactly how we will approach
our next model—only now with a categorical predictor, Feedback.
Fig. 6.3 shows how scores differ for both feedback groups. We can see that
the mean score for the recast group is slightly higher than the mean score for
the explicit correction group. If we were to draw a line between the two
means (inside the error bars), the line would be positively inclined (from
left to right), that is, we would have a positive slope between the two levels of
Feedback. In other words, even though here we have a categorical variable
on the x-axis (cf. Hours in Fig. 6.2), we can still think of our model the
same way we did for the discussion in §6.2.1.
To answer our main question earlier (does Feedback affect scores?) we can run
lm(Score * Feedback, data = longFeedback)—here again our model mirrors
Fig. 6.3: in both cases we want to predict scores based on the two feedback
groups in question (see §2.6.2.1). In code block 26, line 2 runs our model
and assigns it to a variable, fit_lm2. This will be our second model fit to the
data: we are modeling each participant’s score as a function of which feedback
his/her group received in our hypothetical study. Lines 3–5 should be familiar
from code block 25—this time, we will use display()9 to generate our output,
so you can decide which one you prefer (summary() or display()).
Let’s inspect the most important information in the output of our model,
pasted in code block 26 as a comment (lines 7–14). Before, when we ran
fit_lm1, the intercept meant “the predicted score when Hours = 0”. What
does it mean now? Essentially, the same thing: the predicted score when
121 Linear Regression 121
Feedback Recast
Explicit correction 0
Explicit correction 0
Recast ! 1
Recast 1
Explicit correction 0
Recast 1
… …
Feedback = 0. But what’s zero here? R will automatically order the levels of
Feedback alphabetically: Explicit correction will be our reference level, and
Recast will be coded as either 0 or 1 accordingly. This type of contrast
coding is known as treatment coding, or dummy coding—see Table 6.1 and
Table E.1 in Appendix E for an example of dummy coding for a factor with
three levels. Therefore, our intercept here represents the group of participants
who received explicit correction as feedback. You can see that because here
Feedback only has two levels, and Recast is listed in line 11.10
We can mathematically represent our linear model as ^y ¼ b^0 þ b^recast . x,
where x = 0 (explicit correction) or x = 1 (recast). Note that because we
only have two levels in Feedback, we only need one b^, since we can use 0
or 1—that is, only one new column in Table 6.1 is sufficient to capture two
levels. If Feedback had three levels, we would have two b^s (n − 1)—that is,
two new columns would be needed in Table 6.1.
Imagine we had three types of feedback in the data, namely, Explicit
correction, Recast, and Peer (for student peer feedback). Our model would
be represented as follows: ^y ¼ b^0 þ b^recast . x þ b^peer . x—notice that we only
have two b^s. For a participant in the Peer group, we’d have
^y ¼b^ þb^
0 . 0 þb^ . 1. For a participant in the Recast group, we’d have
recast peer
How can we tell whether our intercept is significant if there are no p-values
in our output shown in code block 26? Remember: all we need are estimates
and their standard errors, so the minimalism of the output here shouldn’t be a
problem. If we divide b^0 by its standard error, we will clearly get a number that
is higher than 1.96. Therefore, given such an extreme t-value (73.17.0.55 =
133.04), we will have a highly significant p-value. The null hypothesis is that
the intercept is zero (H0 : β0 = 0)—this is always the null hypothesis. In
other words, it assumes that the mean score for the explicit instruction group
is zero. We have to reject H0 here because our estimate is statistically not
zero (b^ ¼ 73:2).
The same can be said about the effect of Feedback: 2.89.0.77 = 3.75. We
therefore conclude that the effect of feedback is also significant. This is like
saying that the difference between the two groups (= 2.9) is statistically signif-
icant. In summary: because our standard errors here are less than 1, and because
our estimates are both greater than 2, both |t|-values will necessarily be greater
than 1.96. Naturally, you can always run summary() on your model fit if you
want to have p-values explicitly calculated for you.
Our R2 for this model is 0.02—so a model with Feedback explains less var-
iation in scores relative to a model with Hours as its predictor. Needless to say,
0.02 is a low R2, but the question you should ask yourself is how much you
R code
1 # Simple linear regression: categorical predictor
2 fit_lm2 = lm(Score s Feedback, data = longFeedback)
3 display(fit_lm2)
4 summary(fit_lm2)
5 confint(fit_lm2)
6
7 # Output using display(fit_lm2):
8 # lm(formula = Score s Feedback, data = longFeedback)
9 # coef.est coef.se
10 # (Intercept) 73.17 0.55
11 # FeedbackRecast 2.89 0.77
12 # ---
13 # n = 600, k = 2
14 # residual sd = 9.48, R-Squared = 0.02
15
16 # Confidence intervals using confint(fit_lm2):
17 # 2.5 % 97.5 %
18 # (Intercept) 72.096209 74.245791
19 # FeedbackRecast 1.367016 4.406984
CODE BLOCK 26 Simple Linear Regression and Output with Estimates: Score *
Feedback
123 Linear Regression 123
care about this particular metric in your research. If your main objective is to
show that feedback has an effect, then a low R2 is not your top priority. If, on
the other hand, you want to argue that feedback is a powerful predictor to
explain learners’ scores, then you shouldn’t be too excited with such a low
R2. Ultimately, a simple model with a single predictor is unlikely to
explain much of the data when it comes to human behavior. There are so
many factors at play when we deal with language learning that a realistic
model will necessarily have to be more complex than the model we are exam-
ining here.
As you can see, the effect size here tells us the difference it makes to go from
explicit correction to recast. That the effect size is small is not inconsistent with
the literature: Russell and Spada (2006, p. 153), for example, conclude that
“[w]hile the results from this meta-analysis indicate that CF [corrective feed-
back] is useful overall, it was not possible to determine whether particular
CF variables make a difference”.
Finally, bear in mind that both models we have run thus far have essentially
the same underlying structure: ^y i ¼ b^0 þ b^xi þ ^e i , as per 6.1 or 6.2. Next, we
will run a model with two predictors, that is, two b^s.
REPORTING RESULTS
A linear model shows that feedback has a significant effect on learners’
scores (b^ ¼ 2:9; 95% CI ¼ ½1:4; 4:4.; p < 0:001; R2 ¼ 0:02).11 Learners who
received recast as feedback had an average score 2.9 points higher than
those who received explicit correction.
FIGURE 6.4 Participants’ Scores by Weekly Hours of Study and by Feedback Type
Our intercept (b^0 ) is 64. What does that mean now that we have two predic-
tors? Well, it means exactly the same thing as before: b^0 here represents the pre-
dicted score when our other terms (b^) are zero: Feedback = Explicit correction
and Hours = 0. This should make sense—we’re merely combining the interpre-
tation of the intercepts of the two models discussed earlier.
The estimate for Feedback = Recast, presented in line 11, represents the
change in score when we go from the explicit correction group to the recast
group. Likewise, the estimate for Hours represents the change in score for
every hour a participant studies every week. All estimates are statistically signif-
icant (p < 0.001), which should not be too surprising given what we have dis-
cussed so far.
We can represent our model mathematically as ^y i ¼ b^0 þ b^1 xi1 þ b^2 xi2 þ ^e i .
Here, we have one intercept (b^0 ) and two predictors (b^1 and b^2 ). Recall that ^y i
represents the predicted response (i.e., the score) of the ith participant. Let’s say
we have a participant who is in the recast group and who studies English for 7
hours every week. To estimate the score of said participant, we can replace the
variables in our model as follows: ^y i ¼ 64:2 þ 2:55 . 1 þ 0:88 . 7 ¼ 72:91.
Notice that our β1 here represents Feedback, and it’s set to 1 if Feedback =
Recast and to 0 if Feedback = Explicit correction. Hopefully now it’s clear
why the intercept is our predicted response when all other variables are set to
zero: if we choose a hypothetical participant who studies zero hours per week
and who is in the explicit correction group, his/her predicted score will be: ^yi ¼
64:2þ 2:55 . 0 þ 0:88 . 0 ¼ 64:2, and that is exactly b^0 .
REPORTING RESULTS
A linear model shows that both feedback (recast) and weekly hours of study
have a significant positive effect on learners’ scores. Learners in the recast
125 Linear Regression 125
group had higher mean scores than learners in the explicit correction group
(b^ ¼ 2:55; 95% CI = [1.06, 4.03], p < 0.001). In addition, learners with a
higher number of weekly study hours also had higher mean scores
(b^ ¼ 0:88; 95% CI = [0.59, 1.18], p < 0.0001)
Note that our (adjusted) R2 is now above 0.07.13 This is still a low number:
by using Feedback and Hours we can only explain a little over 7% of the var-
iation in scores in the data.
Here’s a question: looking at the estimates in code block 27, can we say that
feedback is more important than hours of study, given that their coefficients are
b^ ¼ 2:55 and b^ ¼ 0:88, respectively? The answer is no, we can’t say that. We
have to resist the temptation to directly compare these two numbers: after all,
our variables are based on completely different scales. While Feedback can be
either 0 or 1, Hours can go from 0 to 10 (and higher). Therefore, these two
estimates are not comparable right now. We will discuss how to make them
comparable later in this chapter (§6.3.2). For now, we can’t say which one
has a stronger effect on scores.14
R code
1 # Multiple linear regression: categorical and continuous predictor
2 fit_lm3 = lm(Score s Feedback + Hours, data = longFeedback)
3 display(fit_lm3)
4 summary(fit_lm3)
5 confint(fit_lm3)
6
7 # Output excerpt using summary(fit_lm3)
8 # Coefficients:
9 # Estimate Std. Error t value Pr(>|t|)
10 # (Intercept) 64.2067 1.5955 40.243 < 2e-16 ***
11 # FeedbackRecast 2.5449 0.7547 3.372 0.000794 ***
12 # Hours 0.8846 0.1484 5.960 4.32e-09 ***
13 # ---
14 # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
15 #
16 # Residual standard error: 9.217 on 597 degrees of freedom
17 # Multiple R-squared: 0.07762, Adjusted R-squared: 0.07453
18 # F-statistic: 25.12 on 2 and 597 DF, p-value: 3.356e-11
19
20 # Confidence intervals using confint(fit_lm2):
21 # 2.5 % 97.5 %
22 # (Intercept) 61.0733247 67.340123
23 # FeedbackRecast 1.0627199 4.027164
24 # Hours 0.5931205 1.176145
If you look back at Fig. 6.4, you will notice that the two trend lines in the
figure seem to be slightly different. More specifically, the line for the explicit
correction group is more inclined than the line for the recast group. If that’s
true, what does it mean? It means that the effect of Hours is stronger for the par-
ticipants in the explicit correction group. If statistically real, this effect would tell
us that the two variables in question interact. In other words, the effect of Hours
depends on whether a learner is in the explicit correction group or in the recast
group. This is what we will examine next.
FIGURE 6.5 A Figure Showing How Feedback and Hours may Interact
127 Linear Regression 127
R code
1 # Interaction linear regression: categorical and continuous predictor
2 # Note the * in the model specification below:
3 fit_lm4 = lm(Score s Feedback * Hours, data = longFeedback)
4 display(fit_lm4)
5 summary(fit_lm4)
6 round(confint(fit_lm4), digits = 2) # if you wish to round CIs (see lines 22-27)
7
8 # Output excerpt using summary(fit_lm4)
9 # Coefficients:
10 # Estimate Std. Error t value Pr(>|t|)
11 # (Intercept) 60.2775 2.4074 25.039 < 2e-16 ***
12 # FeedbackRecast 9.2825 3.1888 2.911 0.00374 **
13 # Hours 1.2724 0.2317 5.491 5.93e-08 ***
14 # FeedbackRecast:Hours -0.6547 0.3011 -2.174 0.03008 *
15 # ---
16 # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
17 #
18 # Residual standard error: 9.188 on 596 degrees of freedom
19 # Multiple R-squared: 0.08488, Adjusted R-squared: 0.08027
20 # F-statistic: 18.43 on 3 and 596 DF, p-value: 1.918e-11
21 #
22 # Confidence intervals using confint(fit_lm2):
23 # 2.5 % 97.5 %
24 # (Intercept) 55.55 65.01
25 # FeedbackRecast 3.02 15.55
26 # Hours 0.82 1.73
27 # FeedbackRecast:Hours -1.25 -0.06
representing the recast group is less inclined than the line representing the
explicit correction group. I mentioned earlier that this difference in the
figure suggested that the effect of hours was weaker for the recast group—
and that’s why we now have a negative estimate for the interaction in the
model. Let’s see that in action by considering two examples: participant A
will be in the recast group and reports studying 5 hours per week. Participant
B is in the explicit correction group and also reports studying 5 hours per week.
The predicted score for both participants is calculated in 6.3.
Participant A : ^yA ¼ b^0 þb^1 x1A þb^2 x2A þb^3 x1A x2A
^yA ¼ 60:28 þ 9:28 . 1 þ 1:27 . 5 þ ð.0:65 . 1 . 5Þ
^yA ¼ 72:65 ð6:3Þ
Participant B : ^yB ¼ b^0 þb^1 x1B þb^2 x2B þb^3 x1B x2B
^yB ¼ 60:28 þ 9:28 . 0 þ 1:27 . 5 þ ð.0:65 . 0 . 5Þ
^yB ¼ 66:64
participant A if they both study 20 hours per week. By inspecting Fig. 6.5, it’s
much easier to see this interaction in action. A typical table with the model esti-
mates is provided in Table 6.2—the table merely copies the output in code
block 28.16 You’ll notice that we have no confidence intervals in the table.
However, recall that we can calculate the 95% confidence intervals using the
standard errors. Whether you need to explicitly report confidence intervals
will depend in part on your readership (and on the author guidelines of the
journal of your choice). Finally, notice that the table has no vertical lines divid-
ing the columns and that numeric columns are right-aligned. As a rule of
thumb, your tables should follow a similar pattern.
REPORTING RESULTS
A linear model shows a significant effect of Feedback (b^ ¼ 9:28; p < 0:01)
and Hours (b^ ¼ 1:27; p < 0:0001) as well as a significant interaction
between the two variables (b^ ¼ .0:65; p < 0:05). The negative estimate
for the interaction tells us that the effect of Hours is weaker for participants
in the recast group. As a result, our model predicts that although partici-
pants in the recast group have higher scores on average, they can be sur-
passed by participants in the explicit correction group given enough
weekly hours of study, as shown in the trends in Fig. 6.5.
The earlier discussion is way more detailed than what you need in reality.
We normally don’t explain what an interaction means: instead, as mentioned
earlier, we simply say that said interaction is significant. This is in part
because we assume that the reader will know what that means—you should
certainly consider whether that assumption is appropriate.
Thus far, we have run four models. A natural question to ask is which one
we should report. Clearly, we don’t need all four: after all, they all examine the
same question (which variables affect Score). So let’s see how we can compare
all four models.
131 Linear Regression 131
R code
1 # Compare models:
2 anova(fit_lm3, fit_lm4)
3
4 # Output:
5 # Analysis of Variance Table
6 #
7 # Model 1: Score s Feedback + Hours
8 # Model 2: Score s Feedback * Hours
9 # Res.Df RSS Df Sum of Sq F Pr(>F)
10 # 1 597 50712
11 # 2 596 50313 1 399.1 4.7276 0.03008 *
12 # ---
13 # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
number telling us how much our model deviates from the actual data being
modeled. Thus, the lower the RSS, the better. Model 2 has a lower RSS
(50313 vs. 50712), which means it has a better fit relative to model 1. In
sum, we can say that there is a significant effect of the interaction between
Feedback and Hours (F(1, 596) = 4.73, p . 0.03).17
The earlier comparison shows that fit_lm4 offers the best fit of the data given
the models we have discussed so far. We have already seen how to report the
results from the model in text and in Table 6.2. Now let’s see how to present
model estimates using a plot. In our plot, we will have error bars that represent
estimates and their respective 95% confidence intervals. The result is shown in
Fig. 6.6—you can see how the figure is created in code blocks 30 and 31.
The y-axis in Fig. 6.6 lists the predictors in our model (fit_lm4)—following
the same order that we have in Table 6.2. In the figure, we can easily see the
estimates (they are in the center of the error bars) and the confidence intervals
(the error bars themselves). The figure also prints the actual estimates under
each error bar, but that is likely not necessary, given the x-axis. Being able
to see confidence intervals in a figure can be more intuitive, since we can actu-
ally see how wide each interval is. Recall that 95% confidence intervals that
cross (or include) zero mean that the estimate in question is not significant (p
> 0.05). Here, none of the error bars cross zero on the x-axis—although the
interaction comes close, so it’s not surprising that its p-value in our model is
closer to 0.05 in Table 6.2. We could remove the intercept from our figure
to reduce the range on the x-axis and better visualize the predictors of inter-
est—it’s not uncommon for researchers to remove intercepts from statistical
tables, especially if their meaning is not practically relevant.
Code blocks 30 and 31 show how Fig. 6.6 is created.18 First, we have to
prepare the data (code block 30) by gathering estimates and their confidence
intervals. Only then can we actually create the figure (code block 31). All
the lines of code in both code blocks show you how to create the figure man-
ually, just in case you were wondering how this could be done—spend some
time inspecting the lines of code that create lm4_effects, as some of it will
R code
1 # Remember to add this code block to plotsLinearModels.R
2 # For the code to work, you must have already run fit_lme4
3 source("dataPrepLinearModels.R")
4
5 # Plot estimates and confidence intervals for fit_lm4
6 # Prepare data:
7 lm4_effects = tibble(Predictor = c("(Intercept)",
8 "Feedback (recast)",
9 "Hours",
10 "Feedback (recast):Hours"),
11 Estimate = NA,
12 l_CI = NA,
13 u_CI = NA)
14
15 # Add coefficients and confidence intervals (CIs):
16 lm4_effects = lm4_effects %>%
17 mutate(Estimate = c(coef(fit_lm4)[[1]], # Intercept
18 coef(fit_lm4)[[2]], # Feedback
19 coef(fit_lm4)[[3]], # Hours
20 coef(fit_lm4)[[4]]), # Interaction
21 l_CI = c(confint(fit_lm4)[1:4]), # lower CI
22 u_CI = c(confint(fit_lm4)[5:8])) # upper CI
23
24 # Visualize tibble:
25 lm4_effects
CODE BLOCK 30 Preparing the Data for Plotting Model Estimates and Confidence
Intervals
be familiar. However, if you’re in a hurry, you could simply install the sjPlot
package, load it, and then run lines 22–23 in code block 31. Yes: with only two
lines you produce a figure that takes around 30 lines of code to produce man-
ually (!). Naturally, the easy way will not produce exactly the same figure, but it
will be essentially what you see in Fig. 6.6.
We have seen earlier that we can present the results of a statistical model
using a table or a figure. We could also have the results in the body of the
text and omit any tables of figures, but that’s probably the worst option.
Tables and figures are much better at presenting results in an organized way.
The question you should ask yourself now is which option best suits your
taste, needs, and readership: a table such as Table 6.2 or a figure such as Fig.
6.6. The vast majority of papers show model estimates in tables, not figures.
However, in many cases figures will certainly provide a more intuitive way
to discuss the results of your model. Indeed, much like box plots are underused
in data visualization in our field, the same can be said for figures to represent
model estimates, unfortunately.
134 Analyzing the Data 134
R code
1 # Plot estimates and confidence intervals for fit_lm4 (add to plotsLinearModels.R)
2 # Plot estimates and 95% confidence intervals:
3 ggplot(data = lm4_effects, aes(x = Predictor, y = Estimate)) +
4 geom_errorbar(aes(ymin = l_CI, ymax = u_CI), width = 0.2) +
5 coord_flip() +
6 theme_classic() +
7 scale_x_discrete(limits = c("Feedback (recast):Hours",
8 "Hours",
9 "Feedback (recast)",
10 "(Intercept)")) +
11 geom_text(aes(label = round(Estimate, digits = 2)),
12 position = position_nudge(x = -0.3, y = 0)) +
13 geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.1) +
14 labs(x = NULL)
15
16 # Save plot in figures folder:
17 # ggsave(file = "figures/model-estimates.jpg", width = 6, height = 2.5, dpi = 1000)
18
19 # Alternatively, install sjPlot package:
20 # library(sjPlot)
21 #
22 # plot_model(fit_lm4, show.intercept = TRUE) +
23 # theme_classic()
R code
1 # Rescale variables:
2 longFeedback = longFeedback %>%
3 mutate(Feedback.std = arm::rescale(Feedback),
4 Hours.std = arm::rescale(Hours))
5
6 # Check means and SDs for rescaled columns:
7 longFeedback %>%
8 summarize(meanFeedback = mean(Feedback.std),
9 sdFeedback = sd(Feedback.std),
10 meanHours = mean(Hours.std),
11 sdHours = sd(Hours.std))
12
13 # Rerun model:
14 fit_lm4_std = lm(Score s Feedback.std +
15 Hours.std +
16 Feedback.std * Hours.std,
17 data = longFeedback)
18
19 # Output:
20 display(fit_lm4_std)
21
22 # coef.est coef.se
23 # (Intercept) 74.68 0.38
24 # Feedback.std 2.52 0.75
25 # Hours.std 4.81 0.77
26 # Feedback.std:Hours.std -3.33 1.53
27
28 # Comparing predictions:
29 predict(fit_lm4_std, newdata = tibble(Hours.std = 0,
30 Feedback.std = 0.5))
31 predict(fit_lm4, newdata = tibble(Hours = mean(longFeedback$Hours),
32 Feedback = "Recast"))
Hours is easily calculated (.x ¼ 10:3). But it’s harder to picture what zero means
for Feedback.std—let’s discuss this estimate next and then come back to the
intercept.
FEEDBACK.STD. Our effect here is b^ ¼ 2:52. What does that mean? In a nut-
shell, a change in 1 unit of Feedback.std increases the predicted score by
2.52 points. Another way to say this is this: the difference between explicit cor-
rection and recast, assuming that Hours is kept at its mean value, is 2.52 points.
In a rescaled variable, “a unit” is represented by 2 standard deviations.
Feedback.std goes from −0.5 to +0.5, and its standard deviation is 0.5. There-
fore, 1 unit = 2s = 1.0. Simply put, a unit for Feedback.std is the “distance”
between Explicit correction and Recast. The effect size in question (b^ ¼ 2:52)
is what happens to Score if we go from −0.5 to +0.5—see Fig. 6.7.
137 Linear Regression 137
¯^0 = 74:68
Explicit correction (-0.5) Recast (+0.5)
| {z }| {z }
¯^ ¢ (¡0:5) ¯^ ¢ (+0:5)
| {z }
¯^ = 2:52
FIGURE 6.7 Rescaling a Binary Variable: Feedback.std
138 Analyzing the Data 138
^yA ¼ 73:4
Finally, lines 29–32 use the predict() function to illustrate the equivalence of
variables and their rescaled counterparts. Asking for the predicted score using
fit_lm4_std, we have Hours.std = 0 and Feedback.std = 0.5. That is equivalent
to asking for the predicted score using fit_lm4 assuming Hours = 10.3 (its mean)
and Feedback = Recast. Hopefully it’s clear now that these two are saying exactly
the same thing, only using different scales.
Now that we have rescaled our variables, which predictor is stronger:
Feedback or Hours? Clearly, Hours trumps Feedback, given that its effect
size is nearly twice as large as that of Feedback. A 1-unit change in Feedback.
std (i.e., going from explicit correction to recast) increases the score of a par-
ticipant by 2.52 points, whereas a 1-unit change in Hours.std (i.e., about 5
hours) increases the score of a participant by 4.81 points. However, bear in
mind that this interpretation is ignoring the interaction between the two var-
iables. In other words, Hours having a larger effect size than Feedback does not
mean that any number of hours will suffice for a learner in the explicit correc-
tion group to surpass a learner in the recast group (recall the discussion we had
earlier). Ultimately, to predict actual scores we need to consider the whole
model, not just one of its estimates. This is yet another reason that inspecting
Fig. 6.5 can help us understand interactions.
In summary, when we rescale our variables, the estimates in our model are
directly comparable, since they all rely on the same unit (2 standard deviations).
Once zero represents the means of our predictors, the intercept also takes a new
meaning. Note, too, that you can generate Fig. 6.6 using fit_lm4_std instead of
fit_lm4 in code blocks 30 and 31, so your figure for the model estimates would
show rescaled estimates that are comparable. A potential disadvantage of rescal-
ing variables is that our interpretation of the effect sizes no longer relies on the
original units—for Hours.std we are no longer talking about actual hours, but
rather about 2 standard deviations of the variable, so we are required to know
139 Linear Regression 139
what the standard deviation is for said variable if we want to translate 1 unit into
actual hours for the reader. You could, of course, provide both sets of estimates
and use the rescaled version to draw direct comparisons across estimates and
use the non-rescaled estimates to interpret the effects of individual estimates, as
we did with all previous models discussed in this chapter. In general, SLA
studies report unscaled estimates—the vast majority of studies don’t rescale vari-
ables, in part because most studies don’t employ full-fledged statistical models
currently.
6.4 Summary
In this chapter we discussed the basic characteristics and assumptions of linear
models, and we saw different examples of such models using our hypothetical
study on the effects of feedback. We also discussed how to report and present
our results, how to compare different models, and how to rescale our predictor
variables so that we can directly compare the magnitude of their effects. It’s
time to review the most important points about linear models.
6.5 Exercises
feedback groups). How do they differ? Hint: Predicted scores are not
affected by your choice of reference level for Feedback.
2. How do the predicted scores help us see the interaction between Hours
and Feedback?
Notes
1. You could, of course, source continuousDataPlots.R instead, which is the script
that contains code block 11. That script, however, is in a different folder/
working directory, which means you would need the complete path to it.
2. Like the true mean of a population (μ), we don’t know the true value of ., but we
can estimate it (^e ).
3. Don’t try to run this code just yet. First, we need to run a model and assign it to a
variable. Then, you should replace MODEL in the code with said variable.
4. Naturally, we can have multiple variables in a model, in which case we will have
multiple b^s. We will discuss such examples later.
5. Technically, what we’re running is lm(Score * 1 + Hours, data = longFeedback),
where 1 represents the intercept. R simply assumes the intercept is there, so we can
leave 1 out.
6. t-values give us the magnitude of a difference in units of standard error, which means
t- and p-values are connected to each other.
7. I will often propose that we manually calculate confidence intervals in this book as an
exercise to review SEs and CIs. However, you should simply use confint() instead.
8. In general, we should report the adjusted R2, which takes into account the number
of predictors that a model has. This will be especially important later when we add
multiple predictors to our models.
9. Remember to install and load the arm package. You should already have this
package if you added code block 25 to linearModels.R.
10. As already mentioned, R will order the levels alphabetically and choose the first one
as the intercept. We can change that by running longFeedback = longFeedback%>
%mutate(Feedback = relevel(as.factor(Feedback), ref = “Recast”)). Then rerun
the model and check the output again—note that here we first make Feedback a
factor, and then we change its reference level.
11. You could simply report b^ ¼ 2:9; SE ¼ 0:77, but this may be too minimalistic for
journals in our field.
12. Note that it makes sense to use facets for our categorical variable and leave the x-axis
for our continuous variable.
13. By adding more variables to a model, our R2 is going to increase. That’s why we
should consider the adjusted R2, which has been adjusted for the number of predic-
tors in the model.
142 Analyzing the Data 142
14. How do you think the different R2 values from our two previous models could help
us guess the relative importance of the predictors in question?
15. We could use shape or color, for example, to make the levels of Feedback look dif-
ferent, but it’s not easy to visually see that difference given the number of data points
we have. Clearly, the focus of Fig. 6.5 is to show the different trend lines.
16. If you use LaTeX to produce your academic documents, you can use packages such
as xtable and memisc in R to generate tables ready for publication.
17. Notice that this is what we typically report for ANOVAs. That makes sense, given
that we are comparing variances across models. And much like the ANOVAs you
may be familiar with, we could compare multiple models at once, not just two.
18. Unlike the previous code blocks, which you should place inside linearModels.R,
you should place these two code blocks inside plotsLinearModels.R, so as to
keep plots and models in separate scripts.
19. Some people scale variables by dividing by 1 standard deviation. To understand why
we will divide by 2 standard deviations instead, see Gelman (2008b).
20. See Appendix A on the use of “::” in functions. Here, the function rescale() is
present in the arm package and also in the scales package (which we use for
adding percentages to our axes, among other things).
21. Note that the significance of the intercept may change, given that its meaning is dif-
ferent now.
143
7
LOGISTIC REGRESSION
All the models we have explored thus far involve a continuous response variable
(Score in chapter 6). But in second language research, and linguistics more gen-
erally, we are often interested in binary response variables. These variables are
essentially 0/1, but we will usually label them incorrect/correct, no/yes, non-
target-like/target-like. If you recall our data on relative clauses in chapter 4
(rClauseData.csv), the Response variable was also binary: Low or High (and
we had some NAs as well). Clearly, this is a very different situation when com-
pared to continuous variables: while scores can have multiple values along a
continuum (e.g., 78.4), binary variables cannot—by definition.
In this chapter we will learn how to analyze binary response variables by
running logistic regressions—see Jaeger (2008) on why you should not use
ANOVAs for categorical response variables. Logistic regressions are underlyingly
very similar to linear regressions, so a lot of our discussion will be familiar from
chapter 6. However, there are important differences, given the different nature of
our response variable. As with chapter 6, we will first go over some basics (§7.1)
and then we’ll see how to run and interpret logistic models in R (§7.2). You may
want to go back to §7.1 after reading §7.2 to consolidate your understanding of
logistic models—you shouldn’t be surprised if you need to reread about the same
statistical concept several times!
Before we get started, let’s create some scripts. You should place these scripts
in the Frequentist folder we created in chapter 6, so all the files associated with
our models will be in the same directory. You should have rClauseData.csv in
there too, since we’ll be running our models on the dataset used in chapters 4
and 5. We will create three scripts: dataPrepCatModels.R, plotsCatModels.R,
and logModels.R—we already have our R Project, Frequentist.RProj, in the
144 Analyzing the Data 144
same directory, so we can use a single R Project to manage all the files con-
nected to the Frequentist statistical models in this book. These scripts follow
the same rationale we applied to chapter 6: one script will prepare the data,
another will plot the data, and the third script will run the models. Could
you do all three tasks in a single script? Absolutely. Should you? Probably
not. As our analyses become more complex, our scripts will get substantially
longer. Separating our different tasks into shorter scripts is a healthy habit.
DATA FILE
In this chapter (§7.2) we will use rClauseData.csv again. This file simulates a
hypothetical study on relative clauses in second language English.
7.1 Introduction
Observe Fig. 7.1, which plots participants’ reaction time (x-axis) against their
native language (L1, on the y-axis)—these data come from rClauseData.csv.
Recall that in the hypothetical study in question, Spanish speakers are second
language learners of English. Here our response variable is binary (L1), and
our predictor variable is continuous (reaction times)—so this is the opposite
of what we had in chapter 6. The intuition is simple: second language learners
are expected to take longer to process sentences in the L2, so we should see
slower reaction times for learners relative to native speakers—see, e.g.,
Herschensohn (2013) for a review of age-related effects in second language
acquisition. That’s what we observe in the box plot: Spanish speakers (L2ers)
have longer reaction times than English speakers (native group).
P Odds ln(odds)
0.10 0.11 −2.20
0.20 0.25 −1.39
0.30 0.43 −0.85
0.40 0.67 −0.41
0.50 1.00 0.00
0.60 1.50 0.41
0.70 2.33 0.85
0.80 4.00 1.39
0.90 9.00 2.20
147 Logistic Regression 147
0 1 2 3 4 5 1
² ² Odds
against in favor
¡1 ¡2 ¡1 0 1 2 +1
² ² Log-odds
against in favor
FIGURE 7.3 Odds and Log-odds
from log-odds to probabilities—and vice versa (as we can see in Table 7.1). We
are essentially transforming our binary variable into a continuous variable (log-
odds) using the logit function, so that later we can go from log-odds to prob-
abilities (our curve in Fig. 7.2). Ultimately, it’s much easier to work with a
straight line than with an S-shaped curve (also known as sigmoid curve). In
summary, while a linear regression assumes that our predicted response is a
linear function of the coefficients, a logistic regression assumes that the pre-
dicted log-odds of the response is a linear function of the coefficients.
. .
P
lnðoddsÞ ¼ ln ¼ logitðPÞ
1.P
P ¼ logit.1 ðXi bÞ
elnðoddsÞ
P ¼ ð7:3Þ
1 þ elnðoddsÞ
If you look closely at 7.1 you will notice that logistic regressions will give us
estimates in ln(odds), that is, logit(P). To get odds from log-odds, we exponenti-
ate the log-odds—see 7.2. To get probabilities from log-odds, we use the
inverse logit function (logit−1), which is shown in 7.3. That’s why understand-
ing log-odds is crucial if you want to master logistic regressions: when we run
our models, b^ values will all be in log-odds—we’ll practice these transforma-
tions several times in §7.2. Don’t worry: while it takes some time to get
used to ln(odds), you can start by using Table 7.1 as a reference. If you have
148 Analyzing the Data 148
positive log-odds, that means the probability is above 50%. If you have negative
log-odds, the probability is below 50%. We will see this in action in §7.2.
Hopefully it’s clear at this point why we want to use log-odds instead of
probabilities. Although probabilities are more intuitive and easier to under-
stand, log-odds are linear. If you look once more at the right plot in Fig.
7.2, you will notice that how much the y-axis changes in an S-shaped curve
depends on where you are on the x-axis. For example, the change in P(L2er
= 1) from 1 to 2 seconds on the x-axis is nearly zero, but the change is
huge if you go from 5 to 6 seconds (the curve is steepest right around 5–8
seconds). So while probabilities are indeed more intuitive, they are also depen-
dent on where we are on a curve, and that’s not ideal since our answer will
change depending on the value on the x-axis. Log-odds, on the other hand,
offer us an estimate that is constant across all values on the x-axis.
R code
1 # Remember to add this code block to dataPrepCatModels.R
2 library(tidyverse)
3
4 # Import our data:
5 rc = read_csv("rClauseData.csv")
6
7 # Remove fillers:
8 rc = rc %>%
9 filter(Type != "Filler") %>%
10 droplevels()
11
12 # Create 0/1 column for response and for L1:
13 rc = rc %>%
14 mutate(Low = ifelse(Response == "Low", 1, 0),
15 L2er = ifelse(L1 == "Spanish", 1, 0))
16
17 # Make certainty ordered factor (for next chapter):
18 rc = rc %>%
19 mutate(Certainty = factor(Certainty, ordered = TRUE))
20
21 # Mutate certainty into three categories: 1-2, 3-4, 5-6:
22 rc = rc %>%
23 mutate(Certainty3 = ifelse(Certainty < 3, "Not certain",
24 ifelse(Certainty > 4, "Certain",
25 "Neutral")))
26
27 # Adjust order of levels:
28 rc = rc %>%
29 mutate(Certainty3 = factor(Certainty3,
30 levels = c("Not certain", "Neutral", "Certain"),
31 ordered = TRUE))
32
33 # Make character columns factors:
34 rc = rc %>% mutate_if(is.character, as.factor)
R code
1 source("dataPrepCatModels.R")
2
3 library(arm) # to use invlogit function in line 25
4
5 fit_glm1 = glm(L2er s RT, data = rc, family = "binomial")
6 summary(fit_glm1)
7
8 # Output excerpt using summary()
9 # Coefficients:
10 # Estimate Std. Error z value Pr(>|z|)
11 # (Intercept) -8.270 0.618 -13.38 <2e-16 ***
12 # RT 1.381 0.098 14.09 <2e-16 ***
13 # ---
14 # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
15 #
16 # Null deviance: 1145.73 on 899 degrees of freedom
17 # Residual deviance: 398.01 on 898 degrees of freedom
18 # AIC: 402.01
19
20 # From log-odds to odds:
21 exp(coef(fit_glm1)[["(Intercept)"]])
22 exp(coef(fit_glm1)[["RT"]])
23
24 # From log-odds to probabilities (interpret with caution!):
25 invlogit(coef(fit_glm1)[["(Intercept)"]])
26
27 # Probability change: from 1s to 10s:
28 predict(fit_glm1,
29 newdata = tibble(RT = seq(from = 1, to = 10)), type = "response")
Let’s now interpret our estimates in lines 11–12. First, we can see that both
our intercept and reaction time (RT) (i.e., both b^0 and b^) are significant (p <
0.001). What does that mean? The intercept tells us the predicted log-odds of
being a learner assuming a reaction time of 0 seconds. This should sound famil-
iar from chapter 6—except for the log-odds bit. Intercepts are always the pre-
dicted response when all other variables are 0. The estimate for RT, in turn,
tells us how much the log-odds of being a learner change as we increase 1
unit of RT (i.e., 1 second).
The very first thing you should note (besides the significance) for each esti-
mate is the sign: the intercept is negative here, while RT is positive. If the inter-
cept is negative, it basically means that the probability of being a learner if your
reaction time is 0 is lower than 50%. If RT is positive, it means that as you
increase reaction time, the probability of being a learner also increases. The
nice thing about focusing on the sign is that you don’t need to think about
log-odds to understand the overall pattern here: reaction time is positively
152 Analyzing the Data 152
correlated with the probability of being a learner, which makes sense given Fig.
7.1, where learners had slower reaction times than native speakers.
Next, let’s focus on the actual effect sizes. If you look back at Table 7.1, you
will notice that we go from −2.20 to 2.20 (log-odds). That range covers prob-
abilities between 0.10 and 0.90. Now look at the estimate for our intercept: it’s
less than −8 (!). That’s a very small number: if −2.20 log-odds is equivalent to a
probability of 0.10, we already now the probability of being a learner if your
reaction time is 0 will be tiny—if you run line 25 you will find out how
tiny it is. You can also manually calculate that probability using 7.3:
.8:27
p ¼ 1þe e.8:27 ¼ 0:000256. This is equivalent to running exp(−8.27)/(1 + exp
(−8.27)) or invlogit(-8.27) in R. The specific estimate for the intercept
here is not very meaningful, because we don’t expect a participant to
have a reaction time of 0 seconds. Let’s move on to the effect of RT, our
focus here.
The estimate for RT is b^ ¼ 1:38 (log-odds). Let’s first interpret this in terms
^
of odds by taking the exponential of the estimate (ejb j ), or exp(1.38) in R—line
22 of code block 34 calculates that for us (line 21 does the same for the inter-
cept). An increase (positive sign) of 1.38 log-odds is equivalent to an increase
by a factor of 3.97. In other words, as you increase reaction time by 1 unit
(1 second), the odds of being a learner go up by a factor of almost 4—which
is a lot!
How about the change in probability? Here, the answer can be a little tricky.
Remember that a probability curve is not a straight line, which means the
change in probability is not constant across all values of the predictor vari-
able—we discussed this earlier in reference to the right plot in Fig. 7.2. To
make things more concrete, then, let’s pick actual reaction times to see how
much the probability of being a learner changes as a function of our predictor
variable here.
Take a look at lines 28–29 in code block 34. The function in question,
predict(), should be familiar from chapter 6: we can use it to predict new
data given a model fit. This time, we’re using an additional argument,
namely, type = “response”—which will give us probabilities (as opposed to
log-odds). Here, our new data is a sequence of reaction times: from 1 to 10
seconds, with 1-second intervals (i.e., 1, 2, …, 9, 10). What we’re doing is
simple: given the model we just ran, what’s the probability that a participant
is a learner if his or her reaction time is 1 second? We ask that for all ten reac-
tion times in question.
Run lines 28–29. You will see that, for a reaction time of 5 seconds, the
probability of being a learner is 20%. For 6 seconds, it’s 50% (P(L2er = 1) =
0.50). Notice that this change in 1 second caused a positive difference of
30%. Now look at 9 and 10 seconds, which result in 0.98 and 0.99. Here,
the change in probability is minuscule in comparison to the change between
153 Logistic Regression 153
FIGURE 7.4 Predicted Probabilities for Ten Reaction Times (Using Model Fit)
5 and 6 seconds. This makes sense: our probabilities do not follow a linear trend,
so the biggest change in probability will occur in the middle of our S-shaped
curve. Fig. 7.4 plots the predicted probabilities from lines 28–29. That’s why
simply taking the probability of the effect size can be misleading: it all
depends on where we look along the x-axis.
Let’s now examine the remainder of the output given in code block 34.
Lines 16 and 17 tell us the null deviance and the residual deviance of the
model, respectively. The null deviance basically tells us how accurate our pre-
dictions would be if we only had the intercept in the model (i.e., no predictor
variable at all). The residual deviance then tells us what happens once we
include our predictor variable, RT. Here, the deviance goes from 1145 to
398—this is a substantial difference, which in turn indicates that reaction
time indeed helps our model’s accuracy at predicting the probability that a
given participant is a learner, that is, P(L2er = 1).
Finally, in line 18, our output also gives us the AIC, or Akaike information
criterion (Akaike 1974). Here, AIC = 402, which in and of itself doesn’t tell
us much (unlike R2 in linear models). However, this number will be useful
once we start comparing models: the lower the AIC of a model, the better
the fit. So if we run three different models on the same data, the one with
the lowest AIC will have the best fit of the three.
AIC helps us select models by estimating the loss of information involved
in the process of fitting a model. Think of a model as a representation of our
data—a representation that is never perfect. We want our models to be max-
imally good, but at the same time we want them to be as simple as possible
to avoid overfitting the data. Here’s why: a model with too many variables
will fit the data better than a model with fewer variables, but in doing that
our model may be picking up noise that is specific to our data and that will
not be present in a different sample of participants, for example. If we design
a model that works perfectly for the data we have, it may fail to work properly
on data we don’t have yet, but which may be collected later. Ideally, we want
154 Analyzing the Data 154
our models to capture the patterns in our data but also to predict future pat-
terns. Consequently, what we want is a compromise between a model that is
too good (and which will therefore overfit the data) and a model that is too
simple (and which will therefore underfit the data). AIC values help us
decide which model offers the best compromise. We will return to this discus-
sion shortly.
One additional way to see how good or accurate our fit is would be to
compare its predictions to the actual data. In other words, we could
compare the value of L2er in rc to what the model predicts for each reaction
time in the data. That’s what we’ll do next.
Code block 35, which you should add to logModels.R, calculates the accu-
racy of our model, fit_glm1. Lines 2–9 create a new tibble, fit_glm1_accuracy.
First, we take rc (line 2), and select the only two variables that we care about
right now (line 3), namely, L2er and RT. Next, in lines 4–9, we create three
different columns. Column 1, called pL2er, will contain the predictions (in
probabilities) of fit_glm1 for all the reaction times in the data (line 5). Line
7 then dichotomizes these predictions in the form of a new column,
L2erBin: every time a predicted probability is above 0.5, we classify it as 1
(meaning: if your probability of being a learner is above 50%, you’re a
learner). When the probability is under 50%, we classify L2erBin as 0.
And for situations where the predicted probability is exactly 50%, we’ll
have NAs.
After we have dichotomized the predictions of our model, our next task is
simple: we create a new column, Accuracy, and for every time our dichoto-
mized prediction matches the actual value of L2er, Accuracy = 1 (that is,
R code
1 # Measuring accuracy of fit_glm1:
2 fit_glm1_accuracy = rc %>%
3 dplyr::select(L2er, RT) %>%
4 mutate(pL2er = predict(fit_glm1,
5 newdata = tibble(RT = RT),
6 type = "response"),
7 L2erBin = ifelse(pL2er > 0.5, 1,
8 ifelse(pL2er < 0.5, 0, NA)),
9 Accuracy = ifelse(L2erBin == L2er, 1, 0))
10
11 fit_glm1_accuracy %>%
12 summarize(Correct = sum(Accuracy)/nrow(.))
13
14 # A tibble: 1 x 1
15 # Correct
16 # <dbl>
17 # 1 0.908
our model is correct for that prediction). Finally, in lines 11–12, we count how
many times we have 1s in our Accuracy column and divide that number by the
number of rows in the data (nrow(.)). Line 17 has the answer: our model’s pre-
dictions were correct more than 90% of the time, which is impressive consid-
ering that we only have a single predictor variable.
There are two reasons that we shouldn’t be too impressed with our accuracy
here. First, the data being modeled is hypothetical. Your actual data may look
very different depending on a number of factors (who your participants are,
how you designed your experiment, etc.). Second, we used all of our data to
run a model and then gave it the same data to measure its accuracy. This is
like giving students a test today for them to study for a test next week and
then giving them the exact same test next week (!). No one would be surprised
if they did well.
Is it a problem that we’re testing our model’s accuracy with the same data we
used to run the model? It depends. In machine learning, for example, we train
the model with one dataset and then test it on a different dataset. In that case,
yes, this would be a problem. But not all statistical analyses have machine learn-
ing in mind, so you may not care that your model is trained on the same data
with which you’re testing it. If your intention is to have a percentage to say
how much of your data your model accounts for, then calculating its accuracy
like we did is a perfectly reasonable option to entertain.
REPORTING RESULTS
A logistic regression confirms that reaction time is a significant predictor of
whether or not a participant is a learner (L2er): b^ ¼ 1:38; p < 0:0001.
The estimate indicates that for every additional second of a partici-
pant’s reaction time, his/her odds of being an L2er go up by a factor
of 3.97 (e|1.38|). The model in question accurately predicts whether
or not a participant is an L2er 90% of the time in the data modeled.
Recall that you can also report the standard error of the estimates or the 95%
confidence intervals of the estimates. If you use the display() function to print
the output of fit_glm1, you will notice again that only estimates and standard
errors are provided, given that these two numbers are sufficient to calculate (or
infer) the z-value,4 the p-value, and the 95% confidence intervals.
Now that we have examined our first logistic regression in detail, we won’t
need to repeat all the steps from fit_glm1 when discussing our next models. For
example, we won’t calculate the accuracy of the model—but you can easily
adapt the code in code block 35. Remember: you can always come back to
our first model if you feel like reviewing some of the details involved in
156 Analyzing the Data 156
interpreting a model’s estimates. But don’t worry: we will run and interpret
three more models in this chapter.
baseline (no break at all). Look at Fig. 7.5 again: the NoBreak condition is below
the 50% mark. If you recall Table 7.1, a 50% probability is equivalent to 0 log-
odds. Because NoBreak is below 50%, we expect a negative effect size (β) for this
particular condition. And because we’re using NoBreak as our reference level,
it will be our intercept. As a result, we predict our intercept will be negative:
b^0 < 0. Finally, given the error bars in our figure, it shouldn’t surprise us if the
negative effect in question is significant (H0 : b^0 ¼ 0).
R code
1 # Set NoBreak as our reference for Condition:
2 rc = rc %>%
3 mutate(Condition = relevel(as.factor(Condition), ref = "NoBreak"))
4
5 fit_glm2 = glm(Low s Condition, data = rc, family = "binomial")
6 summary(fit_glm2)
7
8 # Coefficients:
9 # Estimate Std. Error z value Pr(>|z|)
10 # (Intercept) -0.3778 0.1175 -3.214 0.001309 **
11 # ConditionHigh -0.6682 0.1765 -3.787 0.000153 ***
12 # ConditionLow 0.9100 0.1677 5.427 5.73e-08 ***
13 # ---
14 # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
15 #
16 # (Dispersion parameter for binomial family taken to be 1)
17 #
18 # Null deviance: 1231.1 on 899 degrees of freedom
19 # Residual deviance: 1144.6 on 897 degrees of freedom
20 # AIC: 1150.6
and 0.63, respectively. These numbers make sense once we look at the percent-
ages in Fig. 7.5.
Much like the models in chapter 6, we could present our results in a table
(e.g., Table 6.2) or in a figure (e.g., Fig. 6.6)—we will explore both options
later for more complex models. Notice that in the present example we are
only entertaining a single predictor of low attachment, namely, Condition.
In other words, we are completely ignoring other variables that can potentially
be important here—indeed, treating all participants, both native speakers and
learners, as a single group makes very little sense given what we know about
second language acquisition. But don’t worry: this model is here simply to
show you how a simple logistic regression is interpreted when we have a cat-
egorical predictor (cf. fit_glm1 in §7.2.1). Before we move to our next
example, let’s see how we could report these results.
REPORTING RESULTS
A logistic model shows that Condition has a significant effect on the prob-
ability of choosing low attachment. Relative to our baseline, NoBreak, both
high and low breaks affect our participants’ preference for low attachment.
Having a high break has a negative effect ( b^ = . 0 . 67; p < 0 . 001),
and having a low break has a positive effect ( b^ ¼ 0:91; p < 0:0001).7
chosen match the proficiency levels, such that a darker shade represents a more
proficient group (native speakers being represented by the darkest shade). In addi-
tion, the order of presentation of the proficiency levels also makes sense, that is,
intermediate before advanced learners. Finally, even though the fill color of the
bars represent proficiency, we don’t see a key in the figure. Instead, the actual
levels of Proficiency are displayed at the base of each bar. The combination of
the labels and intuitive shades of gray makes it easy to quickly understand the pat-
terns in the figure—the code to generate Fig. 7.6 is shown in code block 37, since
it has some specific lines that we haven’t discussed yet.
We already know the overall pattern involving Condition, but Fig. 7.6 shows
the importance of Proficiency on top of Condition. First, let’s focus on native
speakers, that is, the darkest bars. Recall that English is expected to favor low
attachment in general. As a result, it’s not a surprise that in the NoBreak (i.e.,
neutral) condition natives speakers prefer low attachment more than 50% of
the time. If you move left, to Condition = High, the bar goes down for
native speakers, and if you go right, to Condition = Low, it goes up—both rel-
ative to NoBreak, which is our baseline here again.
Learners in the data clearly disprefer low attachment overall—recall that
Spanish is expected to favor high attachment in general: we can see that for
NoBreak the bars for intermediate and advanced learners are both below
50%. In fact, the only time a bar is above 50% for a non-native group is
when Condition = Low and Proficiency = Adv. In other words, advanced
English learners seem to prefer low to high attachment given the right condi-
tion (a low break in the stimuli).
As we transition from our figure to our model, it’s important to remember
what exactly is our intercept here, that is, what our reference level is for both
predictor variables. As per our discussion earlier, we will set our reference level
for Proficiency to native speakers (Nat)—we have already set our reference
level for Condition (NoBreak). Therefore, our intercept here represents
native speakers in the NoBreak condition.
Before we actually run our model, 7.5 shows how we can represent it math-
ematically. We have two b^ for Condition and two b^ for Proficiency. Because
161 Logistic Regression 161
R code
1 # Remember to add this code block to plotsCatModels.R
2 source("dataPrepCatModels.R")
3 library(scales) # to add percentages on y-axis using percent_format() below
4
5 # Order levels of Proficiency: Int, Adv, Nat
6 rc = rc %>%
7 mutate(Proficiency = factor(as.factor(Proficiency), levels = c("Int", "Adv", "Nat")))
8
9 # Make figure:
10 ggplot(data = rc, aes(x = Condition, y = Low,
11 fill = Proficiency, label = Proficiency)) +
12 geom_hline(yintercept = 0.5, linetype = "dashed", color = "gray") +
13 stat_summary(geom = "bar",
14 alpha = 0.5, width = 0.5,
15 color = "black",
16 position = position_dodge(width = 0.5)) +
17 stat_summary(geom = "errorbar", width = 0.2,
18 position = position_dodge(width = 0.5)) +
19 theme_classic() +
20 geom_text(data = rc %>% filter(Low == 0) %>% mutate(Low = Low + 0.04),
21 position = position_dodge(width = 0.5),
22 size = 3, fontface = "bold") +
23 scale_x_discrete(limits = c("High", "NoBreak", "Low")) +
24 scale_fill_manual(values = c("white", "gray60", "gray50")) +
25 scale_y_continuous(labels = percent_format()) + # This requires the scales package!
26 labs(y = "% of low responses") +
27 theme(legend.position = "none")
28
29 # ggsave(file = "figures/condition-prof-low-barplot.jpg",
30 # width = 8, height = 2.5, dpi = 1000)
CODE BLOCK 37 Code for Fig. 7.6: Bar Plot and Error Bars (Three Variables)
both variables are categorical, we can turn them on and off again—the same
situation we discussed in §7.2.2. For example, suppose a participant is an inter-
mediate learner, and we’re modeling a stimulus in the high break condition. In
that case, our model would be defined as PðLow ¼ 1Þ ¼
logit.1 ðb^0 þb^high . 1 þb^low . 0 þb^int . 1 þb^adv . 0Þ. Naturally, if the item we’re
modeling is in the NoBreak condition and our participant is a native
speaker, our model would simply be PðLow ¼ 1Þ ¼ logit.1 ðb^0 Þ.
z.............}|.............{ z...........}|...........{
Condition Proficiency
PðLow ¼ 1Þ ¼ logit.1 ðb^0 þ b^high xihigh þ b^low xilow þ b^int xiint þ b^adv xiadv Þ ð7:5Þ
Code block 38 fits our model (lines 1–2 set Nat as the reference level for
Proficiency).8 Let’s first look at our intercept (b^0 ¼ 0:53; p < 0:001). We
haven’t discussed null hypotheses in a while, but you should remember that
the null hypothesis for every estimate in our model is that its value is zero.
The only difference here is that zero means “zero log-odds”, which in turn
means 50% probability (see Table 7.1). This makes sense: a 50% probability
is no better than chance. The intercept here tells us the log-odds of choosing
low attachment for native speakers in the condition NoBreak. The estimate
162 Analyzing the Data 162
is positive, which means P(Low = 1) > 0.5. If you look back at Fig. 7.6, you
will see that this makes perfect sense: the bar for native speakers in the
NoBreak condition is above 50%.
If we simply look at the signs of our estimates, we can see that relative to our
intercept: (i) high breaks lower the log-odds, and (ii) low breaks raise the
log-odds of choosing a low attachment response—these effects shouldn’t
be surprising given our discussion about fit_glm2. Let’s focus specifically
on Proficiency effects. Relative to native speakers (our intercept), both
advanced and intermediate learners choose low attachment less frequently—
we can see that in Fig. 7.6. Therefore, it’s not surprising that the log-odds of
both Adv and Int are negative in fit_glm3.
Finally, we could also generate a table with the P(Low = 1) for all possible
combinations of variables: all three conditions for each of the three proficiency
groups in our data. The table would contain nine rows and could be generated
with the predict() function. The predicted probabilities in question would
mirror the percentages we observe in Fig. 7.6.
How does the model in question compare with fit_glm2? Intuitively,
fit_glm3 is expected to be better, since it includes a predictor that statistically
affects our response variable. If you look back at code block 36, you will see
that the AIC for fit_glm2 is 1150.6. For fit_glm3, we see in code block 38
that the AIC is 1067.3. The lower AIC value for fit_glm3 tells us that it is
the better fit of the two models in question. We could also repeat the steps
in code block 35 to determine the accuracy of both models and compare
them directly.
R code
1 # Remember to add this code block to logModels.R (run code block 36 to relevel Condition)
2 rc = rc %>%
3 mutate(Proficiency = relevel(as.factor(Proficiency), ref = "Nat"))
4
5 fit_glm3 = glm(Low s Condition + Proficiency, data = rc, family = "binomial")
6 summary(fit_glm3)
7
8 # Coefficients:
9 # Estimate Std. Error z value Pr(>|z|)
10 # (Intercept) 0.5302 0.1609 3.295 0.000986 ***
11 # ConditionHigh -0.7443 0.1868 -3.984 6.78e-05 ***
12 # ConditionLow 1.0118 0.1779 5.688 1.28e-08 ***
13 # ProficiencyAdv -1.2142 0.1814 -6.695 2.16e-11 ***
14 # ProficiencyInt -1.6020 0.1878 -8.532 < 2e-16 ***
15 # ---
16 # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
17 #
18 # (Dispersion parameter for binomial family taken to be 1)
19 #
20 # Null deviance: 1231.1 on 899 degrees of freedom
21 # Residual deviance: 1057.3 on 895 degrees of freedom
22 # AIC: 1067.3
REPORTING RESULTS
A logistic model shows that both Condition and Proficiency are significant pre-
dictors of participants’ preference for low attachment in their responses.
More specifically, both high (b^ ¼ .0:74; p < 0:001) and low (b^ ¼
1:01; p < 0:0001) breaks in the stimuli significantly affect the probability
that a participant will choose low attachment in the data (relative to the
NoBreak condition, our intercept in the model). As for proficiency, both
advanced (b^ ¼ .1:21; p < 0:0001) and intermediate (b^ ¼ .1:60;
p < 0:0001) learners choose low attachment significantly less frequently
than native speakers, consistent with previous studies and with what we
observe in Fig. 7.6.
Fortunately, we will never have to use 7.6 manually: R will do all that for us.
z.........}|.........{
Condition
z........}|........{
Proficiency
PðLow ¼ 1Þ ¼ logit.1 ðb^0 þ b^H xiH þ b^L xiL þ b^I xiI þ b^A xiA þ
ð7:6Þ
b^H.I xiH xiI þ b^L.I xiL xiI þ b^H.A xiH xiA þ b^L.A xiL xiA Þ
|.........................................{z.........................................}
Interaction
Code block 39 shows the output of fit_glm4. You will notice that the ref-
erence level of Condition is NoBreak, and the reference level of Proficiency is
Nat. If your output looks different, you should rerun code blocks 36 and 38,
which relevel both variables—all three code blocks should be in logModels.R.
As usual, let’s start general and get more specific as we discuss our effects
here. First, notice that all main effects are significant, including the intercept.
Second, two interactions are also significant (lines 12 and 13 in our code
block). So we were right to suspect that an interaction between Condition
and Proficiency existed in our data given Fig. 7.6—one more reason that visu-
alizing our patterns is essential before we start exploring our models. Next,
notice that the AIC of this model is 1060.3, the lowest number yet. This
already tells us that fit_glm4 is the best model so far (excluding fit_glm1,
which modeled a different variable, i.e., L2er).
INTERCEPT. The intercept here is again the predicted log-odds of Low = 1
when all other variables are set to zero. What does that mean? Well, here,
that’s when Condition = NoBreak and Proficiency = Nat, since that’s what
the intercept represents in our model. Notice that the estimate is positive
(b^0 ¼ 0:53; p < 0:05), which means the probability of choosing low attach-
ment for native speakers in the NoBreak condition is above chance (50%)—
again, this should not be surprising given our discussion so far (and given
Fig. 7.6).
CONDITIONHIGH. This is the predicted log-odds of choosing low attachment
in condition High assuming that Proficiency = Nat. That the estimate is neg-
ative (b^ ¼ .1:02) is not surprising, since we know that native speakers dispre-
fer low attachment in the high condition (relative to the NoBreak condition).
CONDITIONLOW. This is the predicted log-odds of choosing low attachment
in condition Low assuming that Proficiency = Nat. That the estimate is positive
is not surprising either, since we know that native speakers prefer low attach-
ment in the low condition (relative to the NoBreak condition). As you can see,
so far we’ve been interpreting our estimates assuming the native speaker
group—recall our discussion back in chapter 6.
PROFICIENCYINT. This is the predicted log-odds of choosing low attachment
for intermediate learners assuming the NoBreak condition. Notice a pattern
here? When we interpret one variable, we assume the other one is held at
165 Logistic Regression 165
R code
1 fit_glm4 = glm(Low s Condition * Proficiency, data = rc, family = "binomial")
2 summary(fit_glm4)
3
4 # Coefficients:
5 # Estimate Std. Error z value Pr(>|z|)
6 # (Intercept) 0.5322 0.2071 2.570 0.01018 *
7 # ConditionHigh -1.0218 0.2921 -3.498 0.00047 ***
8 # ConditionLow 1.5585 0.3808 4.092 4.27e-05 ***
9 # ProficiencyInt -1.2860 0.2981 -4.314 1.60e-05 ***
10 # ProficiencyAdv -1.5268 0.3060 -4.990 6.05e-07 ***
11 # ConditionHigh:ProficiencyInt -0.2169 0.4754 -0.456 0.64822
12 # ConditionLow:ProficiencyInt -1.0459 0.4812 -2.173 0.02975 *
13 # ConditionHigh:ProficiencyAdv 1.0719 0.4309 2.488 0.01286 *
14 # ConditionLow:ProficiencyAdv -0.3227 0.4862 -0.664 0.50679
15 # ---
16 # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
17 #
18 # (Dispersion parameter for binomial family taken to be 1)
19 #
20 # Null deviance: 1231.1 on 899 degrees of freedom
21 # Residual deviance: 1042.3 on 891 degrees of freedom
22 # AIC: 1060.3
zero, which means we assume the level represented by the intercept. The esti-
mate is negative (b^ ¼ .1:29), which simply captures the observation that
intermediate learners disprefer low attachment relative to native speakers in
the NoBreak condition—make sure you return to Fig. 7.6 to see how our
interpretation matches the patterns in the figure.
PROFICIENCYADV. This is the predicted log-odds of choosing low attachment
for advanced learners assuming the NoBreak condition. The estimate is also
negative, which captures the observation that advanced learners disprefer low
attachment relative to native speakers in the NoBreak condition.
CONDITIONLOW:PROFICIENCYINT. Examine only the bars for Condition =
Low in Fig. 7.6. Now examine the bar for intermediate learners, and
compare it to that of native speakers. If you compare the difference between
intermediate learners and native speakers in the low condition and in the
NoBreak condition, you will notice that the difference is more pronounced
in the low condition. Our model is basically telling us that this difference in
magnitude is significant (b^ ¼ .1:05; p < 0:05). Again: this is telling us that
the relationship between Condition and Proficiency is not constant in the
data. More specifically, we see that the difference in response patterns
between intermediate learners and native speakers is not the same when we
compare the NoBreak and Low conditions. Indeed, the difference intensifies
when Condition = Low, since the main effect for ProficiencyInt is already neg-
ative (see earlier).
166 Analyzing the Data 166
REPORTING RESULTS
A statistical model confirms that participants’ preference for low attach-
ment is significantly affect by Condition, Proficiency, and the interaction
of both variables—estimates are provided in Table 7.2. Consistent with
the literature, native speakers of English favor low attachment in the
NoBreak (b^ ¼ 0:532; p ¼ 0:01) and Low (b^ ¼ 1:559; p < 0:001) condi-
0
tions. For the High condition, on the other hand, native speakers disfavor
low attachment (b^ ¼ .1:022; p < 0:001). This shows that a prosodic
effect can impact speakers’ interpretation of relative clause heads in ambig-
uous sentences. Finally, learners in this (hypothetical) study show the
167 Logistic Regression 167
effects of their L1 in their response patterns: they overall prefer low attach-
ment less than the native speakers in the study (see estimates in Table 7.2).
TABLE 7.2 Model Estimates and Associated Standard Errors, z-values, and p-values
AIC = 1060.3
168 Analyzing the Data 168
can use different line types in our geom_pointrange() to represent the two
levels of Sig. Finally, in line 34, we manually choose which line types we
want to represent both No and Yes—in that order, since R orders levels
alphabetically.
Let’s now briefly go over the goodness of fit of fit_glm4. We have already
established that the model’s AIC is the lowest of the three models so far, which
tells us that it has a better fit than the other two models. In addition, we can use
the anova() function to compare the two models using a likelihood ratio test:
anova(fit_glm3, fit_glm4, test = “LRT”), which will reveal that fit_glm4 has
a significantly better fit (i.e., it has lower deviance from the data relative to
fit_glm3). As a result, this would be the model to report. None of this is sur-
prising: the p-values we get from summary(fit_glm4) are computed using
Wald tests, which basically compare a model with predictor X to a model
without said predictor to check whether X is actually statistically relevant.
We’ve seen that the interaction between Condition and Proficiency is signifi-
cant, so it makes sense that a model without said interaction would be less
optimal than a model with the interaction.
But how accurate is fit_glm4? In other words, what percentage of the data
modeled does the model predict accurately? You may recall that we examined
the accuracy of fit_glm1, which predicted the probability of being a learner
based on a participant’s reaction time—see code block 35. The answer here
is 70%: fit_glm4 can accurately predict 70% of the responses in the data—
169 Logistic Regression 169
R code
1 # Plot estimates and confidence intervals for fit_glm4
2 # Prepare data:
3 glm4_effects = tibble(Predictor = c("(Intercept)", "High",
4 "Low", "Adv", "Int",
5 "High:Adv", "Low:Adv",
6 "High:Int", "Low:Int"),
7 Estimate = c(coef(fit_glm4)[[1]], # Intercept
8 coef(fit_glm4)[[2]], # High
9 coef(fit_glm4)[[3]], # Low
10 coef(fit_glm4)[[4]], # Adv
11 coef(fit_glm4)[[5]], # Int
12 coef(fit_glm4)[[6]], # High:Adv
13 coef(fit_glm4)[[7]], # Low:Adv
14 coef(fit_glm4)[[8]], # High:Int
15 coef(fit_glm4)[[9]]), # Low:Int
16 l_CI = c(confint(fit_glm4)[1:9]), # lower CI
17 u_CI = c(confint(fit_glm4)[10:18])) # upper CI
18
19 # Add binary column for sig vs. not sig:
20 glm4_effects = glm4_effects %>%
21 mutate(Sig = ifelse(l_CI > 0 & u_CI > 0 | l_CI < 0 & u_CI < 0,
22 "yes", "no"))
23
24 ggplot(data = glm4_effects, aes(x = Predictor, y = Estimate)) +
25 geom_pointrange(aes(ymin = l_CI, ymax = u_CI, linetype = Sig)) +
26 coord_flip() + theme_classic() +
27 scale_x_discrete(limits = rev(c("(Intercept)", "High", "Low",
28 "Int", "Adv", "High:Int", "Low:Int",
29 "High:Adv", "Low:Adv"))) +
30 geom_text(aes(label = round(Estimate, digits = 2)),
31 position = position_nudge(x = -0.3, y = 0)) +
32 geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.1) +
33 labs(x = NULL) + theme(legend.position = "none") +
34 scale_linetype_manual(values = c("dashed", "solid"))
35
36 # Save plot:
37 # ggsave(file = "figures/model-estimates.jpg", width = 6, height = 4.5, dpi = 1000)
38
39 # Alternatively: install sjPlot package
40 # library(sjPlot)
41 # plot_model(fit_glm4, show.intercept = TRUE) + theme_classic()
this doesn’t necessarily mean that if we ran the model on new data it would
have the same accuracy (see discussion on the accuracy of fit_glm1 in
§7.2.1). The code to calculate the accuracy of fit_glm4 is shown in code
block 41.
Finally, we should also spend some time on model diagnostics once we have
run and selected our model of choice. In chapter 6, we saw that plotting the
residuals of a model is essential for model diagnostics. For logistic regressions,
however, only plotting residuals will not be very helpful. Instead, you should
use the binnedplot() function in the arm package. For example, for
fit_glm4, we could use the function in question to plot expected values
170 Analyzing the Data 170
against average residuals. This is known as a binned residual plot and can be gen-
erated by running binnedplot(predict(fit_glm4), residuals(fit_glm4)).
Most points in the plot should fall within the confidence limits (gray lines)
shown—see Gelman and Hill (2006, pp. 97–98) for more details on binned plots.
7.3 Summary
In this chapter we discussed the basic characteristics and assumptions of logistic
models, and we examined different examples of such models using our hypo-
thetical study on relative clauses. We have also discussed how to report and
present our results, how to compare different models, and how to calculate
the accuracy of a model using the data modeled as the model’s input. It’s
time to review the most important points about logistic models.
. We can choose our models based on the lowest AIC value they achieve:
lower AIC values indicate better fits.
. It’s also possible to compute the accuracy of a model, which tells us how
much of the data modeled is accurately predicted by the model being fit to
the data.
. Finally, we saw how to report our results in text and to present our esti-
mates in a table or in a figure.
. Most of the coding used for linear regressions are applicable to logistic
regressions, since both types of models belong to the same family (gen-
eralized linear models). Indeed, even the interpretation of our model
estimates is very similar to what we have in linear regressions—once you
get used to log-odds in logistic models, of course. In other words, if you
know how to interpret an interaction in a linear regression, you also know
how to do that in a logistic regression as long as you understand log-odds.
Likewise, if you know how to rescale variables for a linear regression, you
also know how to do that for a logistic regression.
. Finally, you should use binned residual plots for model diagnostics. Such
plots are a quick and intuitive way to check whether the model you plan to
report is actually working as it should.
7.4 Exercises
2. If an estimate is 0.57 and its standard error is 0.18, what is its 95% con-
fidence interval? Is the estimate statistically significant?
plot or a bar plot with error bars with Response on the x-axis and then use
coord_flip() to make your response variable appear on the y-axis.
2. Run a logistic model that accompanies the figure you just created—that
is, where you add reaction time as a predictor. Would we want to rescale
RT? Why? Why not? Do the results surprise you given the figure you just
created?
Notes
1. What 1 and 0 represent is entirely up to you. You could assign 0 to Spanish speakers,
in which case our figure would plot the probability that participant i is a native
speaker of English. We will return to this discussion later, in §7.2.
2. More specifically, we’ll be using the Bernoulli distribution, which in our data repre-
sents the probability of being a learner for each individual data point (i.e., each trial)
in the data. A binomial distribution is the sum of multiple trials that follow a Ber-
noulli distribution.
3. If you don’t remember the hypothetical study in question, return to chapter 4, where
the data is first presented.
4. Not a t-value, which we saw in our linear models in chapter 6.
5. You should be able to produce the figure in question by going back to chapter 4—
remember to add the code to plotsCatModels.R and to load the scales package to
add percentages to the y-axis. The code for the figure can be found in the files
that accompany this book.
6. Notice that the labels on the x-axis have been reordered and that the x-axis here plots
variable Low, created in code block 33. When we have a column with 0s and 1s,
stat_summary() will automatically give us proportions.
7. We could also report confidence intervals by using the confint() function, or the
change in odds by taking the exponential of the log-odds, as we did in our discussion
earlier.
8. We have already changed the reference level of Condition back in code block 36,
which at this point you have already run.
173
8
ORDINAL REGRESSION
DATA FILE
In this chapter (§8.2), we will use rClauseData.csv again. This file simulates a
hypothetical study on relative clauses in second language English.
8.1 Introduction
The type of ordinal model we will examine in this chapter is essentially a logistic
regression adapted to deal with three or more categories in a response variable.
More specifically, we will focus on proportional odds models—see Fullerton and Xu
(2016, ch. 1) for a typology of ordered regression models. While in a logistic
regression we want to model the probability that a participant will respond
“yes” (as opposed to “no”), in an ordinal logistic regression we want to model
a set of probabilities. For example, let’s assume that we have three categories in
a scale where participants had to choose their level of certainty: NOT CERTAIN,
NEUTRAL, and CERTAIN. What we want to model is the probability that a given
participant is at least NOT CERTAIN—which in this case simply means “not
certain”, since there’s no other category below NOT CERTAIN. We also want to
model the probability that a participant is at least NEUTRAL, which in this case
includes the probability of NOT CERTAIN and NEUTRAL—so we are essentially
asking for the cumulative probability of a given category in our scale. Because
our hypothetical scale here only has three categories, we only need two proba-
bilities: to find out the probability that a participant is CERTAIN we simply subtract
the cumulative probability of being (at least) NEUTRAL from 1.
The structure of an ordinal model is shown in 8.1, where we only have a
single predictor variable. Here, for a scale with J categories, we want to
model the cumulative probability that a response is less or equal to category
j = 1, …, J − 1. Notice that our intercept, now represented as ^t j (the Greek
175 Ordinal Regression 175
letter tau), also has a specification for category j. That’s because in an ordinal
regression we have multiple intercepts ( J − 1 to be exact), which represent
cut-points or thresholds in our scale. Therefore, if our scale has three categories,
our model will give us two thresholds. We normally don’t care about τ, but we
will see how it affects our interpretation soon. Finally, notice the minus sign
before b^1 in 8.1. The reason for subtracting our coefficients from ^t j is
simple: an association with higher categories on our scale entails smaller cumu-
lative probabilities for lower categories on our scale.
The intuition behind an ordinal model is relatively straightforward, especially
if you’re mostly interested in a broad interpretation of the results. If you wish to
have a more detailed interpretation of coefficients and thresholds, then it can
take a little longer to get used to these models. The good news is that a
broad interpretation is all you need 99% of the time.
The variable Certainty comes from a 6-point certainty scale used by partic-
ipants to indicate how certain they were about their responses. We will simplify
our scale, though: instead of 6 categories, let’s derive a 3-point scale from it.
Our new scale will mirror the scale discussed earlier, so it will be Not
certain if Certainty < 3, Certain if Certainty > 4, and Neutral otherwise.
We are essentially compressing our scale into three blocks: 1 2 3 4 5 6 . The
good news is that this is already done in code block 33, which should
be located in dataPrepCatModels.R. Our new variable/column is called
Certainty3, so it’s easy for us to remember that this is the version of Certainty
that has 3, not 6, categories. Finally, notice that lines 28–31 in code block 33
adjust the order of the levels of Certainty3 such that Not Certain < Neutral <
Certain—by default, R would order them alphabetically, but we want to make
sure R understands that Certain is “greater than” Neutral.
Let’s take a look at how the certainty levels of the two groups involved in the
study (English and Spanish) vary depending on the condition of our experiment
(Low, High, and NoBreak). For now, we will focus only on our target group,
namely, the learners (Spanish speakers). So focus on the right facet of Fig. 8.2.
We can see that in the high condition (at the bottom of the y-axis), Spanish
speakers are certain of their responses more than 50% of the time. In contrast,
for the low condition these learners are certain of their responses less than 15%
of the time. This is in line with the typological differences between the two
languages: while in English low attachment is typically preferred, in Spanish
it’s high attachment that is favored. Therefore, we can see some influence of
the participants’ L1 in their certainty levels (recall that the experiment was in
English, not Spanish).
Fig. 8.2 seems to have no x-axis label. The reason is simple: the actual per-
centages have been moved to the bars, so it’s very easy to see the values. In
addition, the shades of gray make it intuitive to see which bars represent
FIGURE 8.2 Certainty Levels (%) by Condition for English and Spanish Groups
177 Ordinal Regression 177
Certain and which represent Not certain. The code to produce Fig. 8.2 is shown
in code block 42—make sure you place this code inside plotsCatModels.R.
Fig. 8.2 clearly shows a potential effect of Condition on Certainty3, so that
will be the focus of our first model: we want to know how the certainty level
of our Spanish-speaking participants is affected by the experimental condi-
tions in question. Our model is therefore defined as Certainty3 * Condition
in R.
Code block 43 runs our ordinal model—you can add this code block to
your ordModels.R script. First, we load the script that prepares our data,
dataPrepCatModels.R. Next, we load the ordinal package (make sure you
have installed it). Lines 7–8 relevel Condition so that NoBreak is our reference
level. Line 10 runs the actual model by using the clm() function (cumulative link
model).2 Notice that line 10 runs the model on a subset of the data (only Spanish
speakers).3 We’re doing that here because we want to start with a simple
model, which ignores the differences between the two groups and only
focuses on the differences between the conditions for a single group. Thus,
fit_clm1 represents only the right facet in Fig. 8.2.
The output of our model, fit_clm1, is shown in lines 14–24. We have our
coefficients in lines 16–17 and our thresholds ^t in lines 23–24. Recall from the
R code
1 # Remember to add this code block to plotsCatModels.R
2 # Calculate proportions:
3 propsOrd = rc %>%
4 group_by(L1, Condition, Certainty3) %>%
5 count() %>%
6 group_by(L1, Condition) %>%
7 mutate(Prop = n / sum(n),
8 Dark = ifelse(Certainty3 == "Certain", "yes", "no"))
9
10 # Make figure:
11 ggplot(data = propsOrd, aes(x = Condition, y = Prop, fill = Certainty3)) +
12 geom_bar(stat = "identity", color = "black") +
13 geom_text(aes(label = str_c(Prop*100, "%"), color = Dark),
14 fontface = "bold", size = 3,
15 position = position_stack(vjust = 0.5)) +
16 facet_grid(sL1, labeller = "label_both") +
17 scale_fill_manual(values = c("white", "gray80", "gray50")) +
18 scale_color_manual(values = c("black", "white"), guide = FALSE) +
19 scale_y_reverse() +
20 coord_flip(ylim = c(1,0)) +
21 theme_classic() +
22 theme(axis.text.x = element_blank(),
23 axis.ticks.x = element_blank(),
24 legend.position = "top") +
25 labs(y = NULL, fill = "Certainty:")
26
27 # ggsave(file = "figures/certainty3.jpg", width = 7, height = 3.5, dpi = 1000)
discussion earlier that in an ordinal model we have multiple thresholds (at least
two): fit_clm1 has two, given that Certainty3 has three categories ( J = 3).
The broad interpretation of our estimates (b^) is straightforward: condition
High (line 16) increases the probability of being certain relative to condition
NoBreak (b^ ¼ 1:08). Conversely, condition Low (line 17) decreases the probabil-
ity of being certain relative to condition NoBreak (b^ ¼ .0:88). Both conditions
have a significant effect on participants’ certainty levels, which in turn means that
both the effect of High and the effect of Low are significantly different from the
effect of NoBreak. Remember: all the estimates in a logit ordered model are
given in log-odds (see 8.1), just like in a logistic regression. We could therefore
be more specific and say that when Condition = High, the odds of being certain
about one’s response go up by a factor of ejbj ¼ ej1:08j ¼ 2:94 (“being certain”
here simply means “leaning towards the right end point of the scale”). Let’s
see how we could report these results first, and then we’ll explore the interpre-
tation of the model in greater detail.
Most of the functions we discussed back in chapters 6 and 7 are applicable
here too. For instance, you can get the confidence intervals of our estimates
by running confint(fit_clm1), and you can also manually calculate them
using the standard errors in our output (§1.3.4). Likewise, you don’t actually
need p-values here, since we have both the estimates and their respective stan-
dard errors.
R code
1 rm(list=ls())
2 source("dataPrepCatModels.R")
3
4 library(arm) # to use invlogit()
5 library(ordinal)
6
7 rc = rc %>%
8 mutate(Condition = relevel(factor(Condition), ref = "NoBreak"))
9
10 fit_clm1 = clm(Certainty3 s Condition, data = rc %>% filter(L1 == "Spanish"))
11
12 summary(fit_clm1)
13
14 # Coefficients:
15 # Estimate Std. Error z value Pr(>|z|)
16 # ConditionHigh 1.0776 0.1904 5.661 1.51e-08 ***
17 # ConditionLow -0.8773 0.1920 -4.569 4.90e-06 ***
18 # ---
19 # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
20 #
21 # Threshold coefficients:
22 # Estimate Std. Error z value
23 # Not certain|Neutral -0.5706 0.1372 -4.159
24 # Neutral|Certain 0.9632 0.1420 6.784
Notice that we didn’t care about ^t in our broad interpretation earlier, nor did
we report what the threshold values are. Let’s now spend some time going over a
more detailed interpretation of our model, which will require our ^t values.
REPORTING RESULTS
An ordinal model confirms that the different conditions in our experiment
significantly affect learners’ certainty levels. The high attachment condition
significantly increases the probability of a learner being more certain about
his or her response (b^ ¼ 1:08; p < 0:001) relative to the condition with no
break in it. Conversely, the low attachment condition significantly
decreases the probability of a learner being more certain about his or
her response (b^ ¼ .0:88; p < 0:001) relative to the condition with no
break in it.4
What intercepts mean is essentially constant for all models we discuss in this
book. In linear and logistic regressions, the intercept indicated the predicted ^y
value when all predictors are zero—the difference, of course, was that in a
logistic regression the predicted value is given in log-odds. As mentioned
earlier, the ordinal models we are dealing with now are essentially logistic
regressions.
If you look back at the output of fit_clm1 in code block 42, you will notice
that the values of our thresholds are ^t 1 ¼ .0:57 and ^t 2 ¼ 0:96. These values
are shown in Fig. 8.3, which overlays our scale categories on top of a
normal (Gaussian) distribution. Recall that because we have three categories,
we only get two ^t values in our model. As you go back to the output of
our model, notice that ^t 1 is the threshold between Not certain and Neutral.
Therefore, ^t 1 represents the log-odds of being at least in category Not
certain (or less, which doesn’t apply here since this is the first category in
the scale).
But which condition are we talking about when we interpret the meaning of
^t 1 here? The answer is NoBreak, our reference level. Think about it: an inter-
cept gives us the predicted response value assuming that all other terms are set
to zero. For categorical predictors, we choose one level of our factor to be our
reference. For condition, that level is NoBreak (lines 7–8 in code block 42). So
^t 1 here essentially gives us the log-odds that a Spanish-speaking participant will
choose at least Not certain for the NoBreak condition. If you take the inverse
logit of ^t 1 , you will get P = 0.36 (invlogit(-0.57) in R using the arm package).
That probability is the shaded area under the curve in Fig. 8.3a.
How about ^t 2 ? Our second intercept/threshold here simply tells us the
cumulative probability of choosing Neutral (or less). The probability is
180 Analyzing the Data 180
therefore cumulative because it includes Neutral and all the categories below
Neutral, that is, Not certain. As you can see in Fig. 8.3b, the probability we
get by taking the inverse logit of ^t 2 is P = 0.72 (again: this is all for condition
NoBreak).
Given what we now know, what’s the probability of choosing Certain for
the NoBreak condition? Because probabilities add up to 1, the entire area
under the curve in Fig. 8.3 is 1. Therefore, if we simply subtract 0.72 from
1, we will get the probability of choosing Certain (assuming the NoBreak con-
dition): 1−0.72 = 0.28 (note that this is non-cumulative). How about the non-
cumulative probability of choosing Neutral? We know that the cumulative
probability of choosing Neutral is 0.72, and we also know that the cumulative
probability of choosing Not certain is 0.36. Therefore, the probability of
choosing Neutral must be 0.72 − 0.36 = 0.36. Look at Fig. 8.2 and check
whether these probabilities make sense.
In summary, by taking the inverse logit of the ^t values in our model, we get
predicted cumulative probabilities for condition NoBreak, our reference level
here. If we wanted to calculate cumulative probabilities for conditions High
and Low, we would have to plug in 1s and 0s to the equation, and our
figures would look different since the areas under the curve would depend
not just on ^t 1;2 but also on the estimates of our predictors. The take-home
message is that ^t values depend on where we are on the scale (which category
we wish to calculate cumulative probabilities for). In contrast, b^ estimates are
the same for each and every category in the scale for which we calculate cumu-
lative probabilities (taking the inverse logit of the log-odds predicted by the
model).
Let’s calculate by hand the cumulative probabilities for condition Low using
the estimates in fit_clm1. In 8.2, we see the cumulative probability for the first
category j in our scale (Not certain), which is 0.58. In 8.3, we see the cumu-
lative probability for the second category j (Neutral), which is 0.86.
181 Ordinal Regression 181
Naturally, we won’t calculate anything by hand, but I hope the earlier cal-
culations helped you see how an ordinal model works (intuitively). Instead, we
can simply use the predict() function once more. Code block 44 shows how
you can generate all predicted probabilities for all three conditions for all
three categories in our scale—notice that we are generating non-cumulative
probabilities (type = “prob” in line 6). If you look at line 19, you will see
all three probabilities manually calculated earlier: 0.58, 0.28, and 0.14.5 You
can also see that if you add the first two columns for condition Low, you
will get the cumulative probability calculated earlier as well.
Finally, make sure you go back to Fig. 8.2 to compare the actual proportions
in the data to the predicted probabilities of the model. They should look very
similar, since we’re predicting the same dataset used in our model. We will
now turn to a more complex model,which contains an interaction between
Condition and L1.
R code
1 # Generate predicted probabilities:
2 predictedProbs1 = as_tibble(predict(fit_clm1,
3 newdata = tibble(Condition = c("High",
4 "Low",
5 "NoBreak")),
6 type = "prob")$fit)
7
8 # Add column for Condition:
9 pred_fit_clm1 = predictedProbs1 %>%
10 mutate(Condition = c("High", "Low", "NoBreak"))
11
12 # Print tibble:
13 pred_fit_clm1
14
15 # A tibble: 3 x 4
16 # Not certain Neutral Certain Condition
17 # <dbl> <dbl> <dbl> <chr>
18 # 1 0.161 0.310 0.529 High
19 # 2 0.576 0.287 0.137 Low
20 # 3 0.361 0.363 0.276 NoBreak
to this rule. For example, we could argue that learners and native speakers have
fundamentally different grammars. And if we want to equate a statistical model
to a grammar, then we could have one model per grammar, that is, one model
for learners and one model for native speakers. Naturally, doing that will make
it harder to compare the two groups. The bottom line is that your approach
will depend on your goals and on your research questions.
If you look back at Fig. 8.2, you will know the answer to the question “Do
Condition and L1 interact?” The answer is clearly “yes”, given the trends in the
figure: if we only consider the Certain category, English speakers (controls) are
more certain in the Low condition (50%) than Spanish speakers are (14.5%). In
our next model, which we will call fit_clm2, we will include Condition and
L1, as well as their interaction as predictors.
To interpret the results from fit_clm2 in code block 45, it will be useful to go
back to Fig. 8.2 once more. Our model has five coefficients (b^)—and two cut-
points (^t ). Thus, we have a total of seven parameters. The only non-significant
estimate is that of L1Spanish (b^ ¼ .0:17; p ¼ 0:46). Before we actually inter-
pret the results, notice that we see Spanish, but not English, and we see High
and Low, but not NoBreak. This should already clarify what the reference
levels are (i.e., the levels we don’t see in the output)—and should be familiar
from previous chapters.
What does L1Spanish represent here? In a nutshell, it represents the effect of
being a Spanish speaker (vs. an English speaker) for the condition NoBreak.
Look back at Fig. 8.2 and focus on the NoBreak condition in the plot. You
will notice that both groups behave similarly as far as their certainty levels
183 Ordinal Regression 183
R code
1 fit_clm2 = clm(Certainty3 s Condition * L1, data = rc, link = "logit")
2 summary(fit_clm2)
3
4 # Coefficients:
5 # Estimate Std. Error z value Pr(>|z|)
6 # ConditionHigh -0.6426 0.2644 -2.430 0.01508 *
7 # ConditionLow 0.7430 0.2697 2.755 0.00587 **
8 # L1Spanish -0.1696 0.2302 -0.737 0.46133
9 # ConditionHigh:L1Spanish 1.7096 0.3262 5.241 1.60e-07 ***
10 # ConditionLow:L1Spanish -1.6126 0.3316 -4.864 1.15e-06 ***
11 # ---
12 # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
13 #
14 # Threshold coefficients:
15 # Estimate Std. Error z value
16 # Not certain|Neutral -0.7263 0.1941 -3.742
17 # Neutral|Certain 0.7754 0.1944 3.989
go. In other words, only by looking at the High and Low conditions do we see
more substantial differences between the two groups of participants. So it’s not
surprising that we don’t have a significant coefficient for L1Spanish.
The coefficient for L1Spanish is negative (the null hypothesis, as usual, is
that b^ ¼ 0). A negative estimate indicates that the predictor in question
increases the probability of a response on the lower end of the scale (i.e.,
Not certain). Look again at Fig. 8.2. You will notice that Spanish speakers
have 27.5% of Certain in their responses, while English speakers have 36%
of Certain in their responses. We can see that Spanish speakers are therefore
leaning more towards the lower end of the scale relative to English speakers
(for the NoBreak condition)—hence the negative estimates. However, as we
see in the output of fit_clm2, that difference is not statistically significant—
so we fail to reject the null hypothesis that b^ ¼ 0 (log-odds).
Let’s now spend some time interpreting the estimates that are significant in
our model. Remember: by looking at the sign of the coefficients, we can see
whether a predictor moves participants’ certainty levels left or right along
our scale. But to calculate exact probabilities we need to include the ^t values
estimated in our model. First, let’s focus on the effect sizes. Later we will
plot the exact percentages in a clear and intuitive way.
CONDITIONHIGH. This is the effect (in log-odds) of High relative to our
baseline, that is, NoBreak, for English speakers only. The effect is negative,
so when we have high attachment stimuli (vs. stimuli with no breaks), partic-
ipants’ certainty levels go down (i.e., they move towards the lower end of the
scale). Again: look back at Fig. 8.2 and focus on the L1: English facet only.
184 Analyzing the Data 184
You will notice that for the NoBreak condition, we have 36% of Certain
responses. Now look at the High condition: we have 15% of Certain. Cer-
tainty is going down in the High condition, and that’s why our estimate is
negative here.
CONDITIONLOW. Here we see the opposite effect: a positive estimate.
Looking at Fig. 8.2, this again makes sense: while for No break we have
36% of Certain, for Low we have 50%. So in this case certainty is going up rel-
ative to NoBreak. Therefore, our estimate is positive. Here, the Low condition
increases English speakers’ certainty levels by a factor of 2.10 (e|0.743|)—note
that we are talking about the scale as a whole, so we can’t be precise about
which categories of the scale we are referring to.6
CONDITIONHIGH:L1SPANISH. Interactions are always trickier to interpret (you
may remember this from the model in code block 39 back in chapter 7).
Figures are especially important to visualize interactions and understand what
interaction terms mean in a model. So let’s look back at Fig. 8.2, this time
focusing on the High condition at the bottom of the figure. For English speak-
ers, we have 45% of Not certain responses, and 15% of Certain. For Spanish
speakers, we have only 15.5% of Not certain responses and 52.5% of
Certain. Clearly, these two language groups differ when it comes to the
High condition: Spanish speakers are much more certain—which makes
sense, given that Spanish favors high attachment. Likewise, if you only look
at Spanish speakers and compare High and NoBreak, you will again see that
there’s a big difference in percentage between the two conditions. Simply
put, this is what the interaction is telling us: the effect of Condition depends
in part on whether you are an English speaker or a Spanish speaker in our
hypothetical study. Notice that the estimate of our interaction here is positive
(b^ ¼ 1:71; p < 0:001). This makes sense: after all, the High condition (vs.
NoBreak) makes Spanish (vs. English) speakers much more certain. Another
way of looking at it is to visually compare NoBreak and High across both
groups in Fig 8.2. You will notice that they follow opposite patterns—hence
the significant interaction in our model.
CONDITIONLOW:L1SPANISH. The interpretation of this estimate should now
be clear. Here we have a negative effect (b^ ¼ .1:61; p < 0:001), which cap-
tures the observation in Fig. 8.2 that Spanish speakers are less certain than
English speakers in the Low condition (vs. in NoBreak stimuli).
What do the cut-points tell us? Remember: like any intercept, the two ^t
values here assume that all predictors are set to zero (i.e., Condition =
NoBreak and L1 = English). If you take the inverse logit of these two
values, you will get cumulative probabilities of being Not certain (or less)
and Neutral (or less)—the same meaning discussed for fit_clm1 earlier. If
you run invlogit() on ^t 1 , you will get 0.33; if you run it on ^t 2 , you will get
0.68. Therefore, there’s a 68% probability that English speakers in the
185 Ordinal Regression 185
NoBreak condition will be Neutral or less about their responses. Here, again, if
you want to calculate the actual probabilities for Spanish speakers or for other
conditions in the study, you will need to go through the estimates like we did
for fit_clm1. We won’t do that here because it’s much easier (and faster!) to
simply use the predict() function.
We can see here that the interpretation of interactions is the same in ordinal
regressions as it is in linear or logistic regressions. That makes sense: these are all
generalized linear models, which means they belong to the same family of
models. In other words, if you know how to interpret an interaction, you
will know how to do it across all the models covered in this book (and
many others in the same family, such as Poisson regressions).
Our final step in this chapter is to plot our model’s predictions. What are the
predicted probabilities for each category on our scale for each group for each
condition? In a way, we want to reproduce Fig. 8.2, only this time instead
of percentage points (proportions), we will have predicted probabilities
coming from our model (fit_clm2). Remember: these will likely be similar,
because we are basically testing our model with the exact same dataset it
used to “study” our variables.
We will not plot the model’s estimates—but you will practice doing that at
the end of this chapter. Instead, we first calculate predicted probabilities (non-
cumulative) using predict(). Then we create a new tibble to store the values,
and use that to create a figure using ggplot2. The figure we want to plot is
Fig. 8.4.
If you compare Fig. 8.4 with Fig. 8.2, you will notice they are very similar,
even though they are showing two different things (predicted probabilities vs.
actual values in our data). The code to generate Fig. 8.4 is shown in code
block 46 (which you should add to plotsCatModels.R).7 The figure itself has a
series of aesthetic adjustments, which you can study in lines 23–37—play
around with the code so you can see what each line is doing. Most of the
code should be fairly familiar at this point, and you should also be able to adapt
it to your needs (should you wish to plot predicted probabilities in the future).
8.3 Summary
In this chapter, we examined how to model scalar (or ordinal) response var-
iables. The model we used is essentially a logistic regression, but there are
some important differences (the main difference is that we now deal with
more than one intercept)—Agresti (2010) offers comprehensive coverage
of ordinal models. Following are some of the key points discussed earlier.
Remember: a lot of what we discussed in previous chapters is also applicable
to ordinal regressions. For example, we didn’t rescale our variables in
fit_clm2 (since they were all binary in our model), but you may want to
186 Analyzing the Data 186
R code
1 # Generate predicted probabilities:
2 predictedProbs2 = as_tibble(predict(fit_clm2,
3 newdata = tibble(Condition = rep(c("NoBreak",
4 "Low",
5 "High"),
6 times = 2),
7 L1 = rep(c("English", "Spanish"),
8 each = 3)),
9 type = "prob")$fit)
10
11 # Add L1 and Condition columns to predictions:
12 pred_fit_clm2 = predictedProbs2 %>%
13 mutate(L1 = rep(c("English", "Spanish"), each = 3),
14 Condition = rep(c("NoBreak", "Low", "High"), times = 2))
15
16 # Wide-to-long transform tibble:
17 longPredictions = pred_fit_clm2 %>%
18 pivot_longer(names_to = "Certainty",
19 values_to = "Probability",
20 cols = Not certain :Certain)
21
22 # Plot model s predictions (load scales package first):
23 ggplot(data = longPredictions, aes(x = Certainty, y = Probability)) +
24 geom_hline(yintercept = 0.5, linetype = "dashed", color = "gray80") +
25 geom_line(aes(group = L1, linetype = L1)) +
26 geom_point(aes(fill = L1), shape = 21, size = 3) +
27 facet_grid(sCondition, labeller = "label_both") +
28 coord_cartesian(ylim = c(0,1)) +
29 scale_y_continuous(labels = percent_format()) +
30 scale_x_discrete(limits = c("Not certain", "Neutral", "Certain")) +
31 scale_fill_manual(values = c("white", "black"), guide = FALSE) +
32 scale_linetype_manual(values = c("dashed", "solid")) +
33 theme_classic() +
34 labs(linetype = "L1:") +
35 theme(legend.position = c(0.9,0.8),
36 legend.background = element_rect(color = "black"),
37 axis.text.x = element_text(angle = 25, hjust = 1))
38
39 # ggsave(file = "figures/ord-predictions.jpg", width = 7, height = 4, dpi = 1000)
. Just like linear and logistic regressions, we can use the predict() function to
check the intuition behind the output of the model and to plot predicted
probabilities for new (or existing) data.
. The code in Appendix F demonstrates how to run two simultaneous
ordinal models, one for each language group—you can adapt the code to
run linear and logistic models as well.
8.4 Exercises
The resulting figure will be better aligned with fit_clm1. Finally, change the
font family to Times. You will need to adjust two layers of the figure. This is
because geom_text() is independent and requires its own font specification
if you want the percentages inside the bars to use Times as well—you should
add family = “Times” to geom_text(). To change the font family for
everything else in the figure, check chapter 5.
2. In this chapter, we ran our models on a 3-point scale, but our dataset also has
a 6-point scale (Certainty column), which is its raw/original ordinal
response variable. By doing this, we simplified or compressed our scale (from
6 to 3 points). Run a new model, fit_clm3, which will be equivalent to
fit_clm2. This time, however, use Certainty as the response variable. How
do the results compare (estimates and p-values) to those in fit_clm3?
3. Create a figure showing the estimates for fit_clm2 using the sjPlot package.
Notes
1. Another option would be to use probit models, which we won’t discuss in this book.
2. See also the polr() function from the MASS package (Venables and Ripley 2002).
3. Appendix F demonstrates how to run two simultaneous models, one for each lan-
guage group, and print a table comparing estimates from both models.
4. So while a negative sign indicates a shift to the left on our scale (Not certain), a pos-
itive sign indicates a shift to the right (Certain).
5. Discrepancies are due to rounding in the manual calculation.
6. This is usually what we want. Most of the time our questions focus on whether we go
up or down our scale as a whole; that’s why you normally don’t need to interpret the
^t values estimated.
7. We have already loaded the scales package in plotsCatModels.R, which you will
need to run the code.
189
9
HIERARCHICAL MODELS
9.1 Introduction
So far we have examined three types of models: linear, logistic, and ordinal
regressions—all of which are part of the same family of models (generalized
linear models). In chapter 6, for example, we predicted the score of our partic-
ipants on the basis of two variables, namely, Hours and Feedback. One impor-
tant problem of all the models we ran in chapters 6, 7, and 8 is that they all
assume that data points are independent. In reality, however, data points are
rarely independent.
Take our longFeedback dataset (created in dataPrepLinearModels.R).
There, each participant either belonged to the Recast group or to the Explicit
correction group. Each participant completed two tasks, and each task had five
assessments throughout the semester. As a result, every single participant in our
hypothetical study completed ten assignments in total. The scores for these ten
assignments are clearly not independent, since they all come from the same par-
ticipant. Likewise, the five assignments from each task are not independent
either, since all five come from the same task—this grouping is illustrated in
Fig. 9.1. We could deal with this lack of independence by averaging the
scores for all five assignments given earlier (per task) or by averaging all ten
assignments, such that each participant would have one (mean) score. The
issue is that every time we take the average of different values we lose important
information.
In short, our models so far completely ignore that our observations come
from a grouped structure, for example, Fig. 9.1. Consequently, they are
blind to the fact that participants (or groups of participants) behave differently
or that experimental items may elicit different types of responses. You can see
190 Analyzing the Data 190
#
Task A Task B
nested
A1 A2 A3 A4 A5 A1 A2 A3 A4 A5
why that’s a major issue in studies with human subjects such as those we carry
out in second language research.
Before we continue, it’s essential to understand what a “grouped structure”
means, which brings us to the difference between nested and crossed (i.e., non-
nested) factors. Observe Fig. 9.1, which illustrates the design of our hypothet-
ical study on the effects of feedback on learning. Here, both Explicit correction
and Recast have two tasks, namely, tasks A and B. Task A is the same regardless
of which feedback group we consider. Note that the arrows from the feedback
groups to the tasks are crossing. In this design, Task is crossed with Feedback:
participants in both feedback groups will see the same tasks.
Now look at the participants’ IDs at the top and the tasks at the bottom of
Fig. 9.1. Each task has five assignments (the Item column in longFeedback), A1
to A5. But the assignments from task A are different from the assignments from
task B (even though they have the same label across tasks). Therefore, Item is
nested inside Task. Likewise, our participant variable, ID, is nested inside Feed-
back: a given participant could only be part of one feedback group. As we con-
sider grouped factors and how to account for them in our models, it’s
extremely important to fully understand not only our data but also the study
design in question.
Take participant whose ID is Learner_21, a 20-year-old male (see
longFeedback). This participant is in the Explicit correction group, as we
can see in Fig. 9.1, so by definition he is not in the Recast group, given the
nested design between Feedback and ID. However, he completed both tasks
and all their respective assignments just like any other participant in the
Recast group, given the crossed design between Feedback and Task.
To deal with grouped data, we could employ repeated measures analysis of
variance (ANOVAs), which are very commonly used in the literature. In this
book, however, we won’t do that. Instead, we will explore a more powerful
method, namely, hierarchical models, also known as multilevel models. Unlike
repeated measures ANOVAs, these models allow for multiple grouping vari-
ables, and they can also handle unbalanced data quite well.
191 Hierarchical Models 191
yi ¼ b0 þ b1 xi1 þ .i ð9:1Þ
The model in 9.1 ignores the complex structure of our data and the variation
coming from different participants, items, and so on. Such a model can give us
estimates for a given predictor b^ that may be incorrect once we take into
account how much participants and items vary in our data. In particular,
Type I or Type II errors become more likely: we could reject the null hypoth-
esis when it’s in fact true (Type I error), or we could fail to reject the null
hypothesis when it’s in fact false (Type II error).
If we assume that the model in question is defined as Score * Task, then
our intercept represents the average score for task A, and our estimate repre-
sents the slope from task A to task B (i.e., the difference between the two
tasks). This model fit is shown in Fig. 9.2a—all three plots in Fig. 9.2
exhibit the empirical patterns in longFeedback. We can improve our model
192 Analyzing the Data 192
In 9.2, we have a new term, αj[i], which works as our by-participant inter-
cept. Here, j represents ID and j[i] represents the ith participant. Thus, αj[i] rep-
resents how far from β0 each participant is—that is, it is the offset of each
participant relative to β0. In the visual representation of this type of model
shown in Fig. 9.2b, every participant has an individual intercept, but the
slope of the all the lines is the same, so all the lines are parallel (i.e., we
assume that the effect of task B is constant across participants).
Finally, we can also make our slopes vary by participant. This is shown
in 9.3, where we have one offset for the intercept and one offset for our pre-
dictor estimate, γj[i]. As we can see in Fig. 9.2c, participants seem to vary
in both intercept and slope; both kinds of variation are accounted for in
model 9.3.
We will see how to run hierarchical models in the next section, as usual, but
for now let’s spend some more time going over Fig. 9.2. As already mentioned,
when we run a model, we are mostly interested in our fixed effects—here,
Task is a fixed effect, while our varying intercept and slope are our random
effects. However, individual variation is a key characteristic of data in second
language studies. This is yet another reason that we should favor hierarchical
models over simple (non-hierarchical) models.
Did you notice that the solid thick line in Fig. 9.2 is essentially the same
across all three figures? In other words, our fixed effects here, b^0 and b^1 , are
basically the same whether or not we add random effects to our model. This
is because the data in longFeedback is balanced, and no one participant
behaves too differently relative to other participants—which is often not the
case in second language research. If we had unbalanced data, or individual
193 Hierarchical Models 193
participants whose behavior were too different from other participants, the esti-
mates from our hierarchical models in Fig. 9.2 would look different compared
to our non-hierarchical model.
While fixed effects might not change in a hierarchical model, the standard
errors estimated for these fixed effects almost always look different from
those in a non-hierarchical model. Intuitively, this makes sense: the more we
take into account the variation in our data, the more precise our estimates
will be. Precision can mean that our standard errors will be larger, which
will reduce Type I errors; it can also mean that our standard errors will be
smaller, which will reduce Type II errors (see, for example, discussion on
fit_lmer1 later). In general, we are more worried about the former.
Before we see hierarchical models in action, it’s important to understand the
structure of our dataset here. Notice that Fig. 9.2c has by-participant random
intercepts and slopes (for Task). That structure is possible because every partic-
ipant completed the same tasks (crossed design in Fig. 9.1). If we had a nested
design for tasks, we wouldn’t be able to run such a model. For example, our
Feedback variable cannot be specified as a random effect by participant,
because no participant actually tried both feedback styles! As you can see, to
think about random effects we have to understand the design of our study
and the structure of our data.
Finally, Table 9.1 lists three different model specifications in R. Model A is a
typical linear model predicting participants’ scores as a function of Task—this
should be quite familiar at this point. You can think of model A as our “base-
line”, since it’s a simple non-hierarchical model. Model B includes a varying
intercept for each participant and a varying intercept for each item (assign-
ment)—“1” there simply means “intercept”. This model takes into account
the fact that we expect to see variation across participants and also across assign-
ments in our study. Finally, model C includes a varying slope for Task—this is
essentially what you see in Fig. 9.2c, where both intercepts and slopes are
allowed to vary by participant. The most important difference between
models B and C is that model C assumes that the effect of task will be different
for different participants—which is a very reasonable assumption.
The random slope that we see in model C in Table 9.1 reflects the crossed
structure of our data, where every participant took part in both tasks. Again: we
wouldn’t be able to add a by-participant random slope for Feedback, given that
participants were never in both feedback groups in our hypothetical study—we
Model Code in R
could, however, add a by-item (assignment) random slope for Feedback (see
Fig. 9.1). We will discuss model structures in more detail in the next
section, where we will run different hierarchical models.
Finally, notice that all the models we have discussed so far can be easily
adapted to be hierarchical. The syntax in Table 9.1 works consistently for all
the linear, logistic, and ordinal regressions discussed in this book: you can
run hierarchical logistic regressions using the glmer() function in the lme4
package1 and hierarchical ordinal regressions using the clmm() function in
the ordinal package. Running hierarchical versions of our models in R is actu-
ally quite easy, i.e., the code is nearly identical to what we have already seen.
The tricky part is not the code, but the statistical concept behind it, and which
random effects to specify in our model given the data.
R code
1 library(tidyverse)
2
3 # Read file as tibble:
4 feedback = read_csv("feedbackData.csv")
5
6 # Wide-to-long transform for models:
7 longFeedback = feedback %>%
8 pivot_longer(names_to = "Task",
9 values_to = "Score",
10 cols = task_A1:task_B5) %>%
11 separate(col = Task, into = c("Task", "Item"), sep = 6) %>%
12 mutate(Task = ifelse(Task == "task_A", "Task A", "Task B")) %>%
13 mutate_if(is.character, as.factor)
In code block 47, line 12 renames the levels of Task, and line 13 simply
transforms all character variables into factor variables (this was already in
code block 11, line 9). Notice that the code also loads tidyverse (line 1), as
usual. Now we’re ready to analyze our data.
R code
1 source("dataPrepLinearModels.R")
2
3 ggplot(data = longFeedback, aes(x = Feedback, y = Score)) +
4 stat_summary(aes(group = Task), position = position_dodge(width = 0.5)) +
5 stat_summary(fun = mean, shape = 4) +
6 stat_summary(aes(group = Task, linetype = Task),
7 geom = "line", position = position_dodge(width = 0.5)) +
8 theme_classic() +
9 theme(legend.position = "top") +
10 labs(linetype = "Task:")
11
12 # ggsave(file = "figures/scores-feedback-task.jpg", width = 4, height = 2.5, dpi = 1000)
R code
1 source("dataPrepLinearModels.R")
2 library(lme4)
3 library(arm)
4 library(MuMIn) # To calculate R-Squared for hierarchical models
5
6 fit_lm0 = lm(Score s Feedback + Task, data = longFeedback)
7 fit_lmer1 = lmer(Score s Feedback + Task + (1 | ID) + (1 | Item), data = longFeedback)
8
9 display(fit_lm0)
10
11 # lm(formula = Score s Feedback + Task, data = longFeedback)
12 # coef.est coef.se
13 # (Intercept) 73.75 0.67
14 # FeedbackRecast 2.89 0.77
15 # TaskTask B -1.17 0.77
16 # ---
17 # n = 600, k = 3
18 # residual sd = 9.47, R-Squared = 0.03
19
20 display(fit_lmer1)
21
22 # lmer(formula = Score s Feedback + Task + (1 | ID) + (1 | Item),
23 # data = longFeedback)
24 # coef.est coef.se
25 # (Intercept) 73.75 3.58
26 # FeedbackRecast 2.89 1.01
27 # TaskTask B -1.17 0.44
28 #
29 # Error terms:
30 # Groups Name Std.Dev.
31 # ID (Intercept) 3.51
32 # Item (Intercept) 7.83
33 # Residual 5.33
34 # ---
35 # number of obs: 600, groups: ID, 60; Item, 5
36 # AIC = 3840.4, DIC = 3841
37 # deviance = 3834.7
CODE BLOCK 49 Running a Model with Random Intercepts: Feedback and Task
the same code block. Because both outputs are printed using display(), not
summary(), you won’t see p-values. Later I discuss why we will now stop
looking at p-values. You should add code block 49 to hierModels.R.
First, in lines 11–18 we see the output for fit_lm0, a non-hierarchical linear
regression. We have two columns: coef.est (the estimates of our predictors) and
coef.se (their standard errors). We could calculate confidence intervals with
these two columns, and if our interval includes zero, that would be equivalent
to a non-significant p-value (§1.3.4). Alternatively, we can simply run
confint(fit_lm0)—recall that these two methods are not identical (chapter 6).
FeedbackRecast also has a reliable positive estimate, but let’s focus on
TaskTask B. There, the effect is −1.17 and the standard error is 0.77.
fit_lmer1 adds a random intercept by participant, (1 j ID), and a random
intercept by item, (1 j Item)—so this model has two α terms as well as two
β terms (and the intercept, β0). Note that now our output for fit_lmer1
comes in two parts: the first part can be found in lines 24–27, our fixed
197 Hierarchical Models 197
effects—this is what we care about. Then we have the error terms (or random
effects) in lines 29–33. Let’s focus on the fixed effects first.
The fixed effects for fit_lmer1 are the same as those we discussed earlier for
fit_lm0—indeed, we interpret the model estimates the same way we interpret
our estimates back in chapter 6. However, pay attention to the standard errors,
as they’re very different. First, the standard error for our intercept went from
0.67 to 3.58, and the standard error for FeedbackRecast went from 0.77 to
1.01. Second, the standard error for our task effect went from 0.77 in
fit_lm0 to 0.44 in fit_lmer1. Because the effect sizes of our intercept and
FeedbackRecast are large relative to their SEs, the increase in their SE is not
sufficient to invalidate their statistical effect (i.e., t > 1.96). On the other
hand, by accounting for the variation coming from participants and items,
we now have a statistically significant effect of Task. If you calculate the
95% confidence interval for TaskTask B here, you will notice that it
no longer includes zero, so here we would reject the null hypothesis. Impor-
tantly, rejecting the null here makes more intuitive sense given what we see
in Fig. 9.3—and is in fact the right conclusion, since the data in question
was simulated and the two tasks do have different mean scores. The bottom
line here is that fit_lm0 led us to a Type II error, and fit_lmer1 didn’t.
Now let’s inspect our error terms, shown in lines 29–33. It will be helpful
now to look back at Fig. 9.2b. The thick black line represents the intercept
(b^0 ) and slope (b^) for Task in our model, shown in lines 25 and 27 (two of
our fixed effects; Feedback is not shown in the figure). Line 31, in turn,
shows us the estimated by-participant standard deviation (s ^ ¼ 3:51) from b^0 ,
that is, how much by-participant intercepts deviate from the intercept (grand
mean; b^0 ¼ 73:75). Random intercepts are assumed to follow a Gaussian
(normal) distribution, so we can use b^0 and s ^ to calculate the predicted 95%
interval for the participants’ intercepts: 73.75 ± (1.96.3.51) = [66.87, 80.63]—
alternatively, simply run confint(), which will yield almost identical values
here. Likewise, line 32 tells us how much scores deviated from each item (i.e.,
assessment within each task) relative to the mean score for all items (we don’t
have this information in the figure).
Another important difference between our output is the estimated standard
deviation of the residual, or s ^ e (e = error). For fit_lm0, it’s 9.47 (line 18); for
fit_lmer1, it’s 5.33 (line 33). The intuition is simple: because our model is now
including variation by participant and by item, some of the error (residual) is
accounted for, and we end up with a lower s ^ e —inspect Fig. 9.2 again, and
go back to the discussion on residuals in chapter 6 if you don’t remember
what residuals are.
At the end of code block 49, line 36 gives us AIC and DIC (deviance infor-
mation criterion) values. We already discussed AIC values in chapter 7, and we
will use them soon to compare different model fits.
198 Analyzing the Data 198
WHERE’S R2? You may have noticed that while the output for fit_lm0 in
code block 49 prints an R2 (line 18; R-Squared = 0.03), we don’t have an
R2 value for our hierarchical model, fit_lmer1. We can generate R2 values
for hierarchical models by installing and loading the package MuMIn (Bartoń
2020) and then running r.squaredGLMM(fit_lmer1), which here will give
us two values (cf. fit_lm0): R2m = 0.02 and R2c = 0.73 (remember that
you can always run ?r.squaredGLMM to read about the function in R).2
Here, the first value refers to our fixed effects (marginal R2), while the
second refers to the model as a whole (conditional R2). Simply put, our
fixed effects describe roughly 2% of the variance in the data, and our model
describes over 70% of the variance observed.
WHERE ARE THE P-VALUES? We have already used display() to print model
outputs before (e.g., code block 26), so this is not the first time we don’t see
p-values in an output. What’s different this time, however, is that even if
you try running summary(fit_lmer1), you will not get p-values from
hierarchical models in lme4. There are ways to “force” p-values to be
printed, of course (e.g., lmerTest package),3 but there’s a good reason to
avoid them—that’s why the authors of lme4 don’t include them by default
in our outputs. The problem here is that it’s not obvious how to calculate
degrees of freedom once we have a hierarchical model, so we won’t bother
with exact p-values—I want you to stop focusing on them anyway and focus
instead on estimates and standard errors, which allow us to calculate confidence
intervals. An easy (and better) solution is to ignore p-values and to calculate and
report confidence intervals with the confint() function. You can read more
about this topic in Luke (2017) or Kuznetsova et al. (2017), or see https://
stat.ethz.ch/pipermail/r-help/2006-May/094765.html—a post by Douglas
Bates, one of the authors of lme4. Soon, in chapter 10, we won’t even
discuss p-values, so now’s a good time to get used to their absence. Let us
now turn to our second hierarchical model—we’ll see how to report our
results shortly.
can now draw a visual comparison between our models and Fig. 9.2: whereas
fit_lmer1 mirrors Fig. 9.2b, fit_lmer2 mirrors Fig. 9.2c (and, of course, fit_lm0
mirrors Fig. 9.2a).
First, let’s focus on our fixed effects. Note that the estimates are very similar,
which shouldn’t be surprising given what we discussed earlier. But look closely
at the standard error of the estimates. For example, in fit_lmer1, the SE for
TaskTask B was 0.44. Now, in fit_lmer2, it’s 0.54—it went up. You can run
confint(fit_lmer2) to calculate the 95% confidence intervals of our estimates,
and you will see that Task here still doesn’t include zero (R may take some
seconds to calculate intervals).
Next, let’s look at our random effects (lines 18–23). Lines 20 and 22 are
already familiar from our discussion about fit_lmer1: they simply tell us how
much the by-participant and by-item intercepts vary relative to their respective
means. What’s new here is line 21, TaskTask B. This line has two values,
namely, Std.Dev. and Corr. The first number indicates how much by-
participant slopes vary relative to the main slope (b^ ¼ .1:17) in our fixed
effects, which represents the mean slope for Task for all participants.
Like before, we can calculate the predicted 95% interval for by-participant
slopes. Here, our mean is b^ ¼ .1:17, and our standard deviation for the
slope of Task is s ^ ¼ 2:68. As a result, 95% of our participants’ slopes are pre-
dicted to fall between −6.42 and 4.08. Line 21 also shows the predicted corre-
lation between the intercepts and the slopes in the model (−0.11). The negative
number indicates that higher intercepts tended to have lower slopes. Finally,
note that in fit_lmer1 the standard deviation of our by-participant random
intercept was 3.51; in fit_lmer2, it’s 3.43, so our model with a by-participant
random slope has reduced the unexplained variation between participants.
How many parameters does fit_lmer2 estimate? The answer is seven. We
have three fixed effects (b^0 ; b^1 ; b^2 ), three random effects (s
^ for two random
intercepts and one random slope), and an estimated correlation (ρ) between
by-participant random intercepts for Task and by-speaker random slopes for
Task.4
Here’s an easy way to actually see our random effects. Code block 51 extracts
by-participant and by-item effects from fit_lmer2. First, we create two variables
(lines 2 and 3), one for participant effects (byID), and one for item effects
(byItem). Let’s focus on the former, since that’s more relevant to us here.
Lines 6–12 show you the top six rows of participant-level random effects.
Notice how both columns are different for each participant—this makes
sense, since we have varying intercepts and slopes in our model.
Look back at our model’s estimates in code block 50. Our intercept is 73.74,
and the effects of Feedback and Task are, respectively, b^ ¼ 2:91 and
b^ ¼ .1:17. Now take Learner_1. This participant’s intercept is −2.42,
which indicates the difference between this particular speaker’s intercept and
200 Analyzing the Data 200
R code
1 fit_lmer2 = lmer(Score s Feedback + Task +
2 (1 + Task | ID) +
3 (1 | Item),
4 data = longFeedback)
5
6 display(fit_lmer2)
7
8 # lmer(formula = Score s Feedback + Task +
9 # (1 + Task | ID) +
10 # (1 | Item),
11 # data = longFeedback)
12
13 # coef.est coef.se
14 # (Intercept) 73.74 3.58
15 # FeedbackRecast 2.91 0.99
16 # TaskTask B -1.17 0.54
17 #
18 # Error terms:
19 # Groups Name Std.Dev. Corr
20 # ID (Intercept) 3.43
21 # TaskTask B 2.68 -0.11
22 # Item (Intercept) 7.83
23 # Residual 5.14
24 # ---
25 # number of obs: 600, groups: ID, 60; Item, 5
26 # AIC = 3834.9, DIC = 3832.2
27 # deviance = 3825.6
28
29 r.squaredGLMM(fit_lmer2)
30 # R2m R2c
31 # 0.02356541 0.7469306
CODE BLOCK 50 Running a Model with Random Intercepts and Slopes: Feedback
and Task
our main intercept. The offset here is negative, which means this participant’s
score for task A is below average. We have a negative slope too, so this partic-
ipant’s slope is also below average.
Let’s use Learner_1, who was in the Explicit correction group, to plug in the
numbers in our model—this is shown in 9.4. It may be useful to go back to the
output of fit_lmer2 in code block 50. Recall that our fixed effects are repre-
sented by b^ and our random effects are represented by ^a (intercept) and ^g
(slope). In 9.4, b^1 is our estimate for Feedback and b^2 is our estimate for
Task. Crucially, because our model has a by-item random intercept, we also
have to be specific about which item (assignment) we want to predict scores
for. Let’s assume Item = 3, so we’ll be predicting the score for the participant’s
third assignment in task B. Because Feedback = Explicit correction, we set x1
to zero; and since Task = Task B, we set x2 to 1.
201 Hierarchical Models 201
R code
1 # Check random effects for fit_lmer2:
2 byID = lme4::ranef(fit_lmer2)$ID
3 byItem = lme4::ranef(fit_lmer2)$Item
4
5 head(byID)
6 # (Intercept) TaskTask B
7 # Learner_1 -2.420820 -1.3072026
8 # Learner_10 1.256462 -1.9007619
9 # Learner_11 0.492263 -2.5075564
10 # Learner_12 1.209074 2.5266376
11 # Learner_13 -3.898236 -3.1992087
12 # Learner_14 -2.579415 0.8593917
13
14 head(byItem)
15 # (Intercept)
16 # 1 -9.5335424
17 # 2 -5.3627110
18 # 3 -0.1780263
19 # 4 5.2017898
20 # 5 9.8724899
It’s worth examining 9.4 carefully to fully understand how our model works.
Here’s a recap of what every term means: b^0 is our intercept, as usual; ^a j½1. is our
random intercept for the participant in question (j = ID, and 1 = Learner_1);
^a k½3. is our random intercept for the item (assignment) in question (k = Item,
and 3 = third item); b^ is Feedback = Recast; b^ is Task = Task B; and,
1 2
finally, ^g j½1. is the random slope for Task for Learner_1. All the numbers we
need are shown in code blocks 50 and 51.
The predicted score for the conditions in question is 68.67, which is not too
far from the participant’s actual score, 68.2—you can double-check this entire
calculation by running code block 52. The mean score for the explicit correc-
tion group is 73.2—represented by “x” in Fig. 9.3. Therefore, the predicted
score for Learner_1 is a little closer to the group mean relative to the partici-
pant’s actual score.
^y1 ¼ b^0 þ ^
aj½1. þ ^ak½3. þb^1 x11 þ ðb^2 þ ^gj½1. Þx21 þ ^e1
^y1 ¼ 73:74 þ ð.2:42Þ þ ð.0:17Þ þ 2:91 . 0 þ ð.1:17 þ ð.1:31ÞÞ . 1 ð9:4Þ
^y1 ¼ 68:67 Actual score : 68:2
What we just saw for Learner_1 illustrates the concept of shrinkage, also
referred to as “pooling factor”. Hierarchical models will shrink individual
values towards the mean. The amount of shrinkage will depend on the
202 Analyzing the Data 202
R code
1 predict(fit_lmer2,
2 newdata = tibble(Feedback = "Explicit correction",
3 Task = "Task B",
4 ID = "Learner_1",
5 Item = "3"))
group-level variance in the data (among other factors). The intuition here is
that by doing so, models are better able to generalize their predictions to
novel data—see, for example, Efron and Morris (1977). After all, our sample
of participants is random, and we shouldn’t consider their individual values
as accurate representations of the population as a whole—this is tied to our dis-
cussion on overfitting models in chapter 7.
To visualize what shrinkage does to our predicted values, let’s inspect
Fig. 9.4. First, Fig. 9.4a plots our actual data—the dashed black line represents
the scores for Learner_1, assignment 3 (Item = 3), for both tasks.5 Now look at
Fig. 9.4b, which plots the estimated scores from fit_lmer2. Notice how all the
gray lines, which represent individual participants, are closer to the grand mean
(thick solid line). The dashed line representing Learner_1 is significantly
affected here, as it is pulled towards the overall trend (thick line). Why is
there so much shrinkage here? The answer is simple: because we’re looking
at a single assignment score from a single participant, our sample size here is
minuscule (n = 1), and therefore its impact on the overall trend is very
small. Simply put, such a minuscule sample size is not strong enough to
avoid a lot of shrinkage.
Finally, let’s compare fit_lmer1 and fit_lmer2 to see whether adding a
random slope actually improves the model in question. We can do that by
using the anova() function, which we discussed back in §6.3. If you run
anova(fit_lmer1, fit_lmer2), you will see that fit_lmer2 is statistically better
(e.g., its AIC is lower than that of fit_lmer1). Therefore, between fit_lmer1
and fit_lmer2, we would report fit_lmer2.
203 Hierarchical Models 203
REPORTING RESULTS
To analyze the effect of feedback and task, we ran a hierarchical linear
model with by-participant random intercepts and slopes (for task), and
by-item random intercepts. Results show that both feedback
(b^ ¼ 2:91; SE ¼ 0:99) and task (b^ ¼ .1:17; SE ¼ 0:54) have an effect on
participants’ scores—see Table 9.2.
Intercept
Feedback (Recast)
Task (Task B)
FIGURE 9.5 Plotting Estimates for Fixed and Random (by-Participant) Effects
As you can see, nothing substantial changes on the surface: we interpret and
report our estimates the same way. This should make sense, since a hierarchi-
cal linear regression is a linear regression after all. Likewise, you can run hier-
archical logistic regressions (glmer() function in the lme4 package) and
hierarchical ordinal regressions (clmm() function in the ordinal package), as
already mentioned. The take-home message is this: once you understand
logistic regressions and hierarchical models, you will also understand a hierar-
chical version of a logistic regression—so everything we’ve discussed thus far is
cumulative.
Given all the advantages of hierarchical models, it’s safe to say that you
should virtually always use them instead of non-hierarchical models—
hopefully the discussion around Fig. 9.2 made that clear (see, e.g., discussion
in McElreath 2020). Hierarchical models are more complex, but that’s not typ-
ically a problem for computers these days. Your role as the researcher is to
decide what you should include in your model (both as fixed effects and as
random effects, recall the model comparison earlier using anova()). Ulti-
mately, your decisions will be guided by your own research questions, by
your theoretical interests, by the structure of your data, and by the computa-
tional complexity involved in the model itself—see, for example, Barr et al.
(2013) and Bates et al. (2015a).
If your model is too simplistic (random intercepts only), it will run very
quickly and you will likely not have any convergence issues or warning mes-
sages. However, because your model is too simple, Type I (or Type II) error
is likely. On the other hand, if your model is too complex, it may take a
while to run, and Type II error is a potential problem—that is, being too con-
servative makes you lose your statistical power. On top of that, complex
hierarchical models can fail to converge—see, for example, Winter (2019,
pp. 265–266). For that reason, if you start with a complex model, you may
have to simplify your model until it converges—you will often see people
reporting their “maximal converging model”, which basically means they are
using the most complex model that converged. Fortunately, the R community
205 Hierarchical Models 205
R code
1 # Extract random effects:
2 rand = tibble(ID = row.names(coef(fit_lmer2)$ID),
3 Intercept = coef(fit_lmer2)$ID[[1]],
4 TaskB = coef(fit_lmer2)$ID[[3]])
5
6 rand = rand %>%
7 pivot_longer(names_to = "Term",
8 values_to = "Estimate",
9 cols = Intercept:TaskB) %>%
10 mutate(Term = factor(Term,
11 labels = c("Intercept", "Task (Task B)")))
12
13 # Extract fixed effects:
14 FEs = fixef(fit_lmer2)
15
16 # Extract lower and upper bounds for 95% CI:
17 CIs = confint(fit_lmer2)
18
19 # Combining everything:
20 fixed = tibble(Term = row.names(CIs[c(6,7,8),]),
21 Estimate = FEs,
22 Lower = c(CIs[c(6,7,8), c(1)][[1]],
23 CIs[c(6,7,8), c(1)][[2]],
24 CIs[c(6,7,8), c(1)][[3]]),
25 Upper = c(CIs[c(6,7,8), c(2)][[1]],
26 CIs[c(6,7,8), c(2)][[2]],
27 CIs[c(6,7,8), c(2)][[3]]))
28
29 # Change labels of factor levels:
30 fixed = fixed %>%
31 mutate(Term = factor(Term, labels = c("Intercept", "Feedback (Recast)",
32 "Task (Task B)")))
online is so huge that you often find effective help right away simply by pasting
error messages on Google.
R code
1 library(cowplot) # to combine multiple plots (see below)
2 # Plot effects
3 # Plot intercept:
4 estimates_intercept = ggplot(data = fixed %>%
5 filter(Term == "Intercept"),
6 aes(x = Term, y = Estimate)) +
7 geom_pointrange(aes(ymin = Lower, ymax = Upper), size = 0.3) +
8 geom_jitter(data = rand %>% filter(Term == "Intercept"), shape = 21,
9 width = 0.1, alpha = 0.1,
10 size = 2) +
11 labs(x = NULL, y = expression(hat(beta))) +
12 coord_flip() +
13 scale_x_discrete(limits = c("Task (Task B)", "Feedback (Recast)", "Intercept")) +
14 geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.3) +
15 geom_label(aes(label = Term), position = position_nudge(x = -0.4)) +
16 theme_classic() +
17 theme(axis.text.y = element_blank(),
18 axis.ticks.y = element_blank())
19
20 # Plot slopes:
21 estimates_slopes = ggplot(data = fixed %>%
22 filter(Term != "Intercept"),
23 aes(x = Term, y = Estimate)) +
24 geom_pointrange(aes(ymin = Lower, ymax = Upper), size = 0.3) +
25 geom_jitter(data = rand %>% filter(Term != "Intercept"), shape = 21,
26 width = 0.1, alpha = 0.1,
27 size = 2) +
28 labs(x = NULL, y = expression(hat(beta))) +
29 coord_flip() +
30 scale_x_discrete(limits = c("Task (Task B)", "Feedback (Recast)", "Intercept")) +
31 geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.3) +
32 geom_label(aes(label = Term), position = position_nudge(x = 0.4)) +
33 theme_classic() +
34 theme(axis.text.y = element_blank(),
35 axis.ticks.y = element_blank())
36
37 # Combine both plots (this requires the cowplot package):
38 plot_grid(estimates_intercept, estimates_slopes)
39
40 # ggsave(file = "figures/estimates.jpg", width = 7, height = 2.5, dpi = 1000)
If you wish to learn more about regression models, or if you are interested in
more advanced topics related to them, you should definitely consult sources
such as Gelman and Hill (2006), who offer comprehensive coverage of regres-
sion models—you should also consult Barr et al. (2013) on random effects in
regression models. A more user-friendly option would be Sonderegger et al.
(2018) or Winter (2019, ch. 14–15)—who also offers some great reading rec-
ommendations. Additionally, Cunnings (2012; and references therein) provides
a brief overview of mixed-effects models in the context of second language
research. Finally, see Matuschek et al. (2017) and references therein for a dis-
cussion on Type I and Type II errors in hierarchical models.
207 Hierarchical Models 207
In the next chapter, we will still work with hierarchical regression models.
As a result, our discussion will continue—this time within the framework of
Bayesian statistics.
9.4 Summary
In this chapter we explored hierarchical models and their advantages. Thus far,
we have covered three types of generalized linear models, namely, linear, logis-
tic, and ordinal. All three models can be made hierarchical, so our discussion in
this chapter about hierarchical linear models can be naturally applied to logistic
and ordinal models in chapters 7 and 8, respectively. You now have all the basic
tools to deal with continuous, binary, and ordered response variables, which
cover the vast majority of data you will examine in second language research.
9.5 Exercises
2. Using the clmm() function, rerun fit_clm1 from chapter 8 with by-item
random intercepts and with by-speaker random slopes for Condition.
Call the model fit_clmm1. Print the output for the two models using
summary() (display() won’t work with clm models).
a) How do estimates and standard errors compare between the two
models? Are the estimates for ^t different?
b) Not all models that are complex will be necessarily better—especially if
the complexity is not supported by the data (hence the importance of
visualizing the patterns in our data prior to running models). Using the
anova() function, compare fit_clm1 and fit_clmm1. Does fit_clmm1
offer a better fit (statistically speaking)? Which model has the lower
AIC? Why do you think that is? On this topic, you may want to take
a look at discussion in Matuschek et al. (2017) for future reference.
3. Create a figure for the rc dataset that mirrors Fig. 4.3 in chapter 4—except
for the bars, which will be replaced by point ranges here. The figure will
only include Spanish speakers, so make sure you filter the data in the first
layer of ggplot()—as a result, only two facets will be present (cf. Fig. 4.3).
On the y-axis, add as.numeric(Certainty), since we will treat certainty as a
continuous variable here. On the x-axis, add Condition. Then use
stat_summary() to plot mean certainty and standard errors. You should
add by-speaker lines representing each participant’s mean certainty level.
How does the figure help you understand the previous question?
Notes
1. Note that we spell lme4 as L-M-E-4 (the first character is not a number).
2. You can use the package in question to extract R2 values from logistic models as well—
hierarchical or not. However, recall that linear and logistic models operate differently
209 Hierarchical Models 209
(chapter 7); see Nakagawa and Schielzeth (2013). These are therefore “pseudo-R2”
values for logistic models.
3. Simply install and load the package, and then rerun your model. After that, print the
output using summary() and you will see p-values.
4. You can run the same model without the correlation parameter by using double pipes
(“k”) in the model specification: (1 + Task k ID)—but first you will need to create a
numeric version of Task. See Sonderegger et al. (2018, §7.8.2) for a discussion on
adding a correlation to hierarchical models.
5. Notice that every gray line represents a single participant’s score, aggregated from all
five assignments in each task. The dashed line for Learner_1, however, only considers
the scores for Item = 3.
6. You may or may not keep it there for your own models when you report them. Your
text should already be explicit about the structure of the model you are employing,
but I find that having the structure in the table as well can be useful for the reader.
10
GOING BAYESIAN
Thus far, all the models we have run in this book are based on Frequentist data
analysis. You may not be familiar with the term Frequentist, but every time you
see p-values you are looking at Frequentist statistics—so you are actually famil-
iar with the concept. But while p-values seem to be everywhere in our field,
not all statistical analyses must be Frequentist. In this chapter, we will
explore Bayesian statistics, a different approach to data analysis. There are
numerous differences between Frequentist and Bayesian inference, but you’ll
be happy to know that all the models we have run so far can also be run
using Bayesian statistics.
The first important difference between Frequentist and Bayesian statistics can
be contextualized in terms of probabilities and how we conceptualize them.
For example, what’s the probability that a given learner will choose low attach-
ment in our hypothetical relative clause study? A Frequentist would define
the probability based on several samples. He or she would collect data and
more data, and the long-term frequency of low attachment would reveal the
probability of interest. As a result, if after ten studies the average probability
of low attachment is 35%, then that’s the probability of a learner choosing
low attachment. For an orthodox Frequentist, only repeatable events have
probabilities.
A Bayesian would have a different answer to the question. He or she would
first posit an initial assumption about the true probability of choosing low
attachment. For example, knowing that Spanish is biased towards high attach-
ment, the probability of choosing low attachment is likely below 50%. Then
data would be collected. The conclusion of our Bayesian statistician would
be a combination of his or her prior assumption(s) and what the data (i.e.,
211 Going Bayesian 211
The intuition behind Bayes’s theorem is simple: the stronger our priors are,
the more data we need to see to change our posterior. Consider this: the more
we believe in something, the more evidence we require to change our minds (if
we blindly believe in something, no evidence can convince us that we’re
wrong). Here, however, our priors should be informed not by what we wish
to see, of course, but rather by the literature—see Gelman (2008a) for
common objections to priors in Bayesian inference.
Let’s see 10.1 in action—we will use our hypothetical relative clause study to
guide us here. First, we establish our prior probability that learners will choose
low attachment, let’s call that probability P(π)—π is our parameter of interest,
which goes from 0 to 1, as shown in Fig. 10.1. Because these learners are speak-
ers of Spanish, and because we know the literature, we assume that they are
FIGURE 10.1 Bayes’s Rule in Action: Prior, Likelihood, and Posterior Distributions
214 Analyzing the Data 214
more likely to choose high than low attachment, so we expect π < 0.5, or,
more explicitly, P(Low = 1) < 0.5. Notice that everything in Fig. 10.1 is a dis-
tribution (more specifically a beta distribution)4: we don’t see single-point esti-
mates in the figure.
Given what we know about the topic, let’s assume that our prior probability
will peak at . 0.18. This is like saying “we believe a priori that the most prob-
able value for π is approximately 0.18, that is, Spanish speakers will choose low
attachment 18% of the time. However, π follows a distribution, so other
values are also probable—they’re simply less probable relative to the peak of
the distribution”. We then run our experiment and collect some data.
What we observe, however, is that participants chose low attachment over
50% of the time: this is our likelihood line in Fig. 10.1. To be more specific,
we collected 13 data points, 5 Low and 8 High. What should we conclude
about π?
As you can see, our posterior in Fig. 10.1 is a compromise between our prior
and the data observed. We started out assuming that the most probable value
for π was 0.18. Our data, however, pointed to π = 0.55 as the most probable
value for π. Our posterior distribution then peaks at π = 0.41. Again: we are
talking about probability distributions, so when we say “the most probable
value for parameter π”, we are essentially talking about the peak of the
distribution.
Notice that our prior distribution in Fig. 10.1 is relatively wide. The width
of our distribution has an intuitive meaning: it represents our level of certainty.
Sure, we believe that Spanish learners will choose low attachment only 18% of
the time. But how sure are we? Well, not too sure, since our prior distribution
clearly extends from zero to over 50%. Intuitively, this is equivalent to saying
“we have some prior knowledge based on previous studies, but we are not
absolutely sure that 0.18 is the true value of π”.
Let’s see what happens if we are much more certain about our prior beliefs
regarding π. In Fig. 10.2, our prior distribution is much narrower than that of
Fig. 10.1—the peak of our prior distribution is now . 0.23. Because we are
now more certain about our prior, notice how our posterior distribution is
much closer to the prior than it is to the likelihood distribution. Here, we
are so sure about our prior that we would need a lot of data for our posterior
to move away from our prior.
In Figs. 10.1 and 10.2, we assumed our data came from only 13 responses
(5 Low and 8 High). What happens if we collect ten times more data? Let’s
keep the same ratio of Low to High responses: 50 Low and 80 High. In
Fig. 10.3, you can see that our posterior is now much closer to the data
than it is to our prior (which here is identical to the prior in Fig. 10.1). The
intuition here is simple: given enough evidence, we have to reallocate our
credibility. In other words, after collecting data, we discovered that we were
wrong.
215 Going Bayesian 215
FIGURE 10.4 95% Highest Density Interval of Posterior from Fig. 10.1
Stan
FIGURE 10.5 Bayesian Models in R Using Stan
The solution to the problem is to sample values from the posterior distribu-
tion without actually solving the math—this is essentially what we will do in R.
Sampling is at the core of Bayesian parameter estimation, which means we need
an efficient sampling technique. One powerful sampling method is called
Monte Carlos Markov chain (MCMC). In this book, we will rely on a specific
MCMC method, called Hamiltonian Monte Carlo, or HMC. The details of these
techniques are fascinating (and complex), but we don’t need to get into them to
be able to run and interpret Bayesian models—see Kruschke (2015).
The language (or platform) we will use to run our models is called Stan (Car-
penter et al. 2017), named after Polish scientist Stanislaw Ulam.8 We won’t
actually use Stan directly (but you will be able to see what its syntax looks
like if you’re curious). Instead, we will use an R package called brms
(Bürkner 2018), which will “translate” our instructions into Stan. Fig. 10.5
illustrates the process: we will soon install and load brms, which itself loads
some dependent packages, such as rstan (Stan Development Team 2020)—
wait until §10.3 before installing brms.
The main advantage of brms is that it lets us run our models using the familiar
syntax from previous chapters. The model is then sent to Stan, which in turn
compiles and runs it. Finally, the output is printed in R. On the surface, this
will look very similar to the Frequentist models we ran in previous chapters—
in fact, if I hadn’t told you about it you would probably not even notice at
first sight that there’s something different happening. But below the surface a
lot is different.
You may be wondering how we can know whether the sampling (or our
model as a whole) worked. In other words, how can we inspect our model
to check whether it has found the appropriate posterior distribution for our
parameters of interest? When we sample from the posterior, we use multiple
chains. Our chains are like detectives looking for the parameter values (π)
that are most credible given the data. The detectives begin a random walk in
the parameter space, and as their walk progresses, they cluster more and
more around parameter values that are more likely given the data that we
are modeling. By having multiple chains, we can easily compare them at the
end of the process to see whether they have all arrived at (roughly) the same
219 Going Bayesian 219
values. This will be easy to interpret when we inspect our models later in this
chapter, when we will discuss model diagnostics in more detail.
Finally, let’s briefly examine what the models we ran in previous chapters
look like in a Bayesian framework. This will help you understand what is
going on once we run our models later in the chapter. Let’s pick the simplest
model we have run so far. In 10.2, you can see a linear regression with an inter-
cept and a single predictor, b^1 . This is the fundamental model/template we
have been using in this book (generalized linear models).
yi ¼ b0 þ b1 xi1 þ .i ð10:2Þ
yi . N ðmi ; sÞ
mi ¼ b0 þ b1 xi1
b1 . N ð0; 100Þ
s . Uð0:001; 1000Þ
distributed. Thus, β need not refer to a continuous variable per se. We are also
specifying the standard deviation of the distributions.
Let’s stop for a second to contextualize our discussion thus far. In fit_lm1,
back in chapter 6, we ran Score * Hours, so it was a model with the same
structure as the model exemplified in the specification given in 10.3. Back in
chapter 6, the estimates for fit_lm1 were Intercept = 65 and Hours = 0.92.
In the model specified in 10.3, we are assuming the intercept has a prior dis-
tribution centered around 0. If we were fitting fit_lm1, we would be off by
a lot, since b^0 ¼ 65 in that model. However, notice that the standard deviation
assumed here is 100, which is considerably wide. Therefore, the model would
be relatively free to adjust the posterior distribution given the evidence (data)
collected.
Notice that here we are being explicit about the standard deviations for β0
and β1 (i.e., 100). But we could let the model estimate them instead by spec-
ifying different priors for these standard deviations. This is an important point
because in Frequentist models we are obligated to assume that variance will be
constant (homoscedasticity). In Bayesian models, on the other hand, that
doesn’t have to be the case.
Finally, the last line in 10.3 tells the model that we expect σ to follow a non-
committal uniform distribution (U). The simple model in 10.3 is estimating
three parameters, namely, β0, β1, and σ. Of these, we’re mostly interested in
β0 and β1. Once we run our model on some data, we will be able to display
the posterior distribution of the parameters being estimated. Fig. 10.6 illustrates
what the joint posterior distribution of β0 and β1 looks like—both posteriors
follow a normal distribution here. The taller the density in the figure, the
more credible the parameter value. Thus, the peak of the two distributions
represents the most likely values for β0 and β1 given the data, i.e., Pðb^0 jdataÞ
and Pðb^1 jdataÞ. We will see this in action soon.10
One of the main advantages of specifying our own model is that we can
choose from a wide range of distributions to set our priors. Thus, we no
longer depend on assumptions imposed by the method(s) we choose. At the
¯^1
¯^0
same time, as we will see later, we can also let our model decide which priors to
use. This is especially useful for people who are new to Bayesian models and
who are not used to manually specifying models. Ultimately, however, if
you wish to maximize the advantages of a Bayesian model, you should consider
customizing priors based on previous research.
To illustrate my point, allow me to use photography as an analogy. Profes-
sional cameras give you a lot of power, but they require some knowledge on
the part of the user. Fortunately, such cameras have an “auto” mode, which
allows you to basically point and shoot. Hopefully, once you start shooting
with a professional camera, you will slowly understand its intricacies, which
in turn will lead you to better explore all the power that the camera has to
offer. However, even if you decide to use the auto mode forever, you will
still take excellent photos. What we are doing here is very similar: Bayesian
models offer a lot, but you don’t have to be an expert to use them—the
“auto mode” we will see later gives you more than you expect (and more
than enough to considerably advance your data analytical techniques). Natu-
rally, if you decide to learn more about Bayesian statistics, you will get
better at customizing your model to your needs.
Another advantage of RData files is their ability to also save metadata. For
example, we often change the class of columns in our datasets. You may
have a character column that you transformed into an ordered factor to run
an ordinal model in your study (we did exactly that in dataPrepCatModels.
R). However, csv files have no “memory”, and the next time you load your
csv file, you will have to perform the same transformation—of course, you
will simply rerun the lines of code that do that for you, or source a script
where you prepare your data, which is what we have been doing. If you
save your tibble in an RData file, however, it will “remember" that your
column is an ordered factor, and you won’t have to transform it again.
At this point you should be convinced that RData files are extremely useful.
So how do we use them? To save objects in RData file, we use the save()
function (e.g., save(object1, object2, …, file = “myFiles.RData”)). The
file will be saved in your current working directory unless you specify some
other path. To load the file later on, we use the load() function (e.g., load
(“myFiles.RData”). That’s it. We will see RData in use soon, when we
save our Bayesian model fits.
figures called figures, as usual. Your Bayesian folder should have one R Project
file, two csv files, two new scripts, and two old scripts that read and prepare our
data files (in addition to the figures folder)—refer to Appendix D. As you run
the code blocks that follow, more files will be added to your directory. Recall
that you don’t have to follow this file organization, but it will be easier to
reproduce all the steps here if you do (plus, you will already get used to an orga-
nized work environment within R using R Projects).
R code
1 library(brms)
2
3 source("dataPrepLinearModels.R")
4 fit_brm1 = brm(Score s Feedback + Task +
5 (1 + Task | ID) + (1 | Item),
6 data = longFeedback,
7 family = gaussian(),
8 save_model = "fit_brm1.stan", # this will add a file to your directory
9 seed = 6)
10
11 fit_brm1
12
13 # Population-Level Effects:
14 # Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
15 # Intercept 73.78 3.96 65.78 81.69 1.00 1349 1309
16 # FeedbackRecast 2.91 1.03 0.92 4.96 1.00 1784 2553
17 # TaskTaskB -1.18 0.53 -2.25 -0.12 1.00 4226 3272
18 #
19 # Family Specific Parameters:
20 # Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
21 # sigma 5.18 0.17 4.87 5.51 1.00 2547 2929
22
23 # save(fit_brm1, file = "fit_brm1.RData") # this will add a file to your directory
24 # load("fit_brm1.RData")
which is saved in line 23). What does that mean? Since brms is actually trans-
lating our instructions into Stan, we have the option to actually save those
instructions (written for Stan) as a separate text file (.stan extension), which
will be added to your working directory (Bayesian). You won’t need this
file here, but it is actually important for two reasons: first, you can later
open it using RStudio and inspect the actual code in Stan (which will allow
you to see how brms is specifying all the priors in the model, i.e., what the
“auto model” looks like). Once you do that, you will be very grateful that
we can use brms to compile the code for us—learning how to specify
models in Stan can be a daunting task. A second advantage of saving the
model specification is that we can later have direct access to the entire
model (this can be a nice learning tool as well). We won’t get into that in
this book, but see the reading suggestions at the end of the chapter.
Line 9 helps us reproduce the results. Each time you run a model, a random
number is used to start the random walk of the chains, so results will be slightly
different every time (remember that we’re sampling from the posterior). By
specifying a seed number (any number you want), you can later rerun the
model and reproduce the results more faithfully.
There are several additional/optional arguments missing from our brm()
function in code block 55. For example, we can specify how many chains
we want (the default is four). We can also specify whether we want to use mul-
tiple cores in our computer to speed up the process. Importantly, we can also
225 Going Bayesian 225
specify the priors of our model. We will do that later. For now, we’re letting
brms decide for us (think of this as the “auto mode” mentioned earlier). If you
want to check other arguments for the brm() function, refer to the documen-
tation of brms—or simply run ?brm in RStudio. You can also hit Tab inside
the brm() function to see a list of possible arguments in RStudio, as already
mentioned.
Let’s now interpret the actual output of the model, which will look slightly
different compared to your output once you run the model yourself. Code
block 55 only shows our main effects (population-level effects), but random
effects will also be printed once you run line 11. First, we have our estimates
and their respective standard errors. This is quite familiar at this point. Then
we have the lower and upper bounds of our 95% credible intervals (as discussed
in Fig. 10.1). For example, the most credible value for our intercept13 is
b^ ¼ 73:78, but the 95% credible interval ranges from b^ ¼ 65:78 to
b^ ¼ 81:69—you should be able to picture a posterior distribution with these
values, but we will visualize these estimates soon. Notice that we’re given
the probability of the parameter given the data, Pðb^0 jdataÞ, not the probability
of the data given the parameter (i.e., p-value in Frequentist inference; §10.1).
The last three columns in the output of fit_brm1 help us identify whether the
model is working properly (i.e., model diagnostics). First, Rhat (R ^ ), which is also
^
known as the Gelman-Rubin convergence diagnostic. R is a metric used to
monitor the convergence of the chains of a given model. If R ^ . 1, all our
^
chains are in equilibrium—see Brooks et al. (2011). All our R in fit_brm1 are
exactly 1, so we’re good here—an R ^ ¼ 1:2 is a bad sign.
Finally, we have two ESS columns (effective sample size, typically repre-
sented as ^n eff ). This has nothing to do with the sample size in our study; it is
instead the number of steps that are uncorrelated in our sampling. If our
chains get stuck in a single location before moving on, the steps taken in
that location are all correlated and therefore are less representative of the
walk as a whole. Simply put, the higher the number, the better—higher
numbers indicate that more uncorrelated steps were taken in the posterior esti-
mation for a given parameter.
How can we visualize our Bayesian model? Throughout this book, I have
been emphasizing the importance of data visualization to better understand
our data. For Bayesian models, data visualization is even more crucial. Plotting
is the best way to check whether our model has converged and to examine the
posterior distributions of our estimates. Let’s start with a trace plot, which we can
use to inspect our chains. This is not the type of plot you would add to your
paper, but it’s useful to check it to make sure that all the chains arrived at the
same location in the parameter space.14 If all four chains in fit_brm1 converged
towards the same parameter values, that’s a good sign.
226 Analyzing the Data 226
In Fig. 10.7, the x-axis represents the sampling process: by default, each
chain in our model has 2,000 iterations (iter argument in brm()), but 1,000
of those are used as warmup (warmup argument in brm()).15 As a result, we
actually have 1,000 “steps” per chain in our random walk to find out what
parameter values are more likely given the data being modeled. These 1,000
steps are represented by the x-axis in our figure. As you can see in the
figure, all four chains have converged towards the same range of b^ values (on
the y-axis).
The fact that you probably can’t differentiate which line represents which
chain in Fig. 10.7 is a good sign; it means that all four chains overlap, that
is, they agree—it would help to use colors here (see line 23 in code block
56). If you go back to the estimates of fit_brm1, you will notice that b^ ¼
2:91 for FeedbackRecast—the value representing the peak of the posterior.
On the y-axis in Fig. 10.7, you will notice that all four chains are clustered
around the b^ value in question, that is, they all agree.
The code that generated Fig. 10.7 is provided in code block 56—this code
block prepares our data not only for the trace plot shown in Fig. 10.7 but for
our next plot (the code should be added to plotsBayesianModels.R). First, we
extract the values from our posterior distributions (line 7)—run line 8 to see
what that looks like. Line 11 prints the names of the parameters from our
model (recall that this is a hierarchical model). Line 14 renames some of the
parameters—those we’re interested in plotting. Line 19 creates a vector to
hold only the estimates of interest. Finally, we select a color scheme (line
22) and create our plot (lines 26–28)—you may want to run line 23 to
produce a trace plot with color, which will help you see the different
chains plotted.
Next, let’s plot what’s likely the most important figure when it comes to
reporting your results: the model estimates. Overall, it’s much easier to see
a model’s estimates in a figure than in a table. Fig. 10.8 should be familiar:
we have done this before (e.g., Fig. 7.7), but now we’re not plotting
227 Going Bayesian 227
R code
1 # You must run fit_brm1 before running this code block
2 library(bayesplot)
3 # library(extrafont)
4
5 # Prepare the data for plotting
6 # Extract posteriors from fit:
7 posterior1 = as.array(fit_brm1)
8 head(posterior1)
9
10 # Check all parameters in the model:
11 dimnames(posterior1)$parameters
12
13 # Rename the ones we ll use:
14 dimnames(posterior1)$parameters[c(1:3, 8)] = c("Intercept",
15 "Feedback (Recast)",
16 "Task (Task B)", "Sigma")
17
18 # Select which estimates we want to focus on:
19 estimates = c("Intercept", "Feedback (Recast)", "Task (Task B)", "Sigma")
20
21 # Select color scheme:
22 color_scheme_set("gray") # To produce figures without color
23 # color_scheme_set("viridis") # Use this instead to see colors
24
25 # Create plot:
26 mcmc_trace(posterior1, pars = "Feedback (Recast)") +
27 theme(text = element_text(family = "Arial"),
28 legend.position = "top")
29
30 # ggsave(file = "figures/trace-fit_brm1.jpg", width = 4, height = 2.5, dpi = 1000)
point estimates and confidence intervals. We’re instead plotting posterior dis-
tributions with credible intervals. This is the figure you want to have in your
paper.
Fig. 10.8 is wider on purpose: because our intercept is so far from the other
parameters, it would be difficult to see the intervals of our estimates if the figure
were too narrow. That’s the reason that you can’t see the actual distributions
for Task or Sigma (estimated variance) they’re too narrow relative to the
range of the x-axis.16 As you can see, no posterior distribution includes zero.
That means that zero is not a credible value for any of the parameters of interest
here. Note that even if zero were included, we’d still need to inspect where in the
posterior zero would be. If zero were a value at the tail of the distribution, it’d
still not be very credible, since the most credible values are in the middle of our
posteriors here—which follow a normal distribution.
Finally, Fig. 10.8 shows two intervals for each posterior (although you can
only see both for the intercept in this case). The thicker line represents the
228 Analyzing the Data 228
R code
1 # Plot estimates with HDIs:
2 mcmc_intervals(posterior1,
3 pars = estimates,
4 prob = 0.5,
5 prob_outer = 0.95) +
6 theme(text = element_text(family = "Arial"))
7
8 # ggsave(file = "figures/estimates-fit_brm1.jpg", width = 8, height = 2.5, dpi = 1000)
50% HDI, while the thinner line represents the 95% HDI—use lines 4 and 5 in
code block 57 if you wish to plot different intervals.
So far, we have checked some diagnostics (R ^ , ^n eff , trace plot), and we have
plotted our model’s estimates. Another quick step to check whether our model
is appropriate is to perform a posterior predictive check. The intuition is simple:
data generated from an appropriate model should look like the real data orig-
inally fed to our model.
Fig. 10.9 plots our actual data (thick black line), represented by y, against
replicated data from our model, represented by yrep—here, we are using 20
229 Going Bayesian 229
samples from our data. The x-axis represents predicted scores, our response var-
iable. As you can see, the patterns from the predicted data match our actual data
quite well. Although you don’t need to include such a figure in your paper, it’s
important to perform predictive checks to ensure that our models are actually
adequate given the data.
REPORTING RESULTS
A hierarchical linear model confirms that Feedback has a statistically credi-
ble effect on participants’ scores in the experimental data under analysis
(b^ ¼ 2:91, 95% HDI = [0.92, 4.96]). Task also has a statistically credible
Recast
effect (b^TaskB ¼ .1:18, 95% HDI = [−2.25, −0.12])—see Fig. 10.8. Our
model included by-participant random intercepts and slopes (for Task)
as well as by-item random intercepts. Convergence was checked based
on R^ and n^ . Posterior predictive checks confirm that the model generates
eff
simulated data that resemble actual data.
The earlier paragraph is likely more than you need: it’s a comprehensive
description of the results coming out of fit_brm1. Before reporting the
results, you will naturally describe the methods used in your study. That’s
where you should clearly explain the structure of your model (e.g., which pre-
dictors you are including). You should also motivate your Bayesian analysis
(§10.1) by highlighting the advantages of this particular framework relative
to Frequentist inference. For more details on how to report Bayesian
models, see Kruschke (2015, ch. 25).
Finally, you should carefully explain what a Bayesian estimate means, since
your reader will probably not be used to Bayesian models. Fortunately, it’s
much more intuitive to understand Bayesian parameter estimates than Frequen-
tist ones. Again: these estimates are simply telling us what the most credible
effects of Feedback and Task are given the data being modeled.
participants in our study. Here are three expectations we could consider (there
are many more).
. English speakers should favor low attachment more than 50% of the time
in the NoBreak condition (the “control” condition).
. In the Low condition, English speakers should favor low attachment even
more.
. Spanish learners should favor low attachment less often than native speakers
in the NoBreak condition.
The goal here is to use brms to run a hierarchical (and Bayesian) version of
fit_glm4, first examined in chapter 7. Our model specification will be Low *
Condition * Proficiency + (1 j ID) + (1 j Item). Recall that condition has
three levels (High, Low, NoBreak) and that proficiency also has three levels
(Adv, Int, Nat). As usual, we will choose our reference levels to be
NoBreak for Condition and Nat for Proficiency—this is the more intuitive
option here. Consequently, our intercept will represent the expected response
(in log-odds) when Condition = NoBreak and Proficiency = Nat—this should
be familiar, since we’re basically repeating the rationale we used for fit_glm4
back in chapter 7.
The model specified here, which we’ll call fit_brm2, will be used to predict
the log-odds of choosing Low attachment, which we can map onto probabil-
ities by using the inverse logit function discussed in chapter 7. Our response
variable is binary, so we cannot assume that it follows a normal distribution
(cf. fit_brm1). We use the logistic function (logit−1) to map 0s and 1s from
our response variable Low onto probabilities, and our probabilities will
follow a Bernoulli distribution—Bern in 10.4. Nothing here is conceptually
new (see chapter 7). What is new is that we can specify our priors for different
parameters.
yi ¼ Bernðmi Þ
b0 . N ð1; 0:5Þ
ð10:4Þ
blo . N ð1; 0:5Þ
::: . :::
Look at the third and fourth lines in 10.4. We are essentially saying that
we expect both β0 and βlo to follow a normal distribution with a positive
mean (μ = 1). What does that mean? For the intercept, it means that we
231 Going Bayesian 231
R code
1 source("dataPrepCatModels.R")
2
3 # Adjust reference levels:
4 rc = rc %>%
5 mutate(Condition = relevel(Condition, ref = "NoBreak"),
6 Proficiency = relevel(Proficiency, ref = "Nat"))
7
8 # Get priors:
9 get_prior(Low s Condition * Proficiency + (1 | ID) + (1 | Item), data = rc)
10
11 # Specify priors:
12 priors = c(prior(normal(1, 0.5), class = "Intercept", coef = ""), # Nat, NoBreak
13 prior(normal(1, 0.5), class = "b", coef = "ConditionLow"), # Nat, Low
14 prior(normal(-1, 0.5), class = "b", coef = "ProficiencyInt")) # Int, NoBreak
15
16 # Run model:
17 fit_brm2 = brm(Low s Condition * Proficiency +
18 (1 | ID) + (1 | Item),
19 data = rc,
20 family = bernoulli(),
21 save_model = "fit_brm2.stan",
22 seed = 6,
23 prior = priors)
24
25 fit_brm2
26
27 # save(fit_brm2, file = "fit_brm2.RData")
28 # load("fit_brm2.RData")
expect native speakers in the NoBreak condition to favor low attachment (pos-
itive estimates ! preference for low attachment above 50%). For βlo, it means
that we also expect a positive estimate for the low condition—remember that
our reference levels are the NoBreak condition and the Nat proficiency. In
other words, we expect native speakers to favor low attachment more in the
low attachment condition (relative to the NoBreak condition). Now look at
the fifth line, which says we expect intermediate participants to have an
effect that follows a normal distribution with a negative mean (μ = −1). This
means that we expect these participants to favor low attachment less relative
to native speakers (in the NoBreak condition). These are the only prior distri-
butions we will specify, but you could specify all the terms in the model (our
model here has random intercepts as well as interaction terms).
Code block 58 shows how to specify priors in a Bayesian model run with
brms. You should run line 9 to understand the structure of priors first. Lines
12–14 then specify prior distributions for the three parameters discussed
earlier—the code should be self-explanatory.
The model in question, fit_brm2, is run in lines 17–23 (notice how we
specify our priors in line 23). As with fit_brm1, you can print the results of
the model by running line 25. Lines 27 and 28 simply show how to save the
model (so you don’t have to rerun it again in the future) and how to load it
232 Analyzing the Data 232
FIGURE 10.10 Posterior Distributions for all Parameters of Interest (Four Chains
Each)
into RStudio once you’ve saved it—exactly the same code we used for
fit_brm1. This time, however, code block 58 doesn’t include an excerpt of
the output of the model. That’s because we will plot the estimates later.
Fig. 10.10 is a useful way to visualize our model’s results.17 First, we have
nine panels (for the nine parameters of interest in our model). Second, we
have the entire posterior distributions for each parameter, so we don’t just
see the mean of the distributions (which is what we’d see in the estimates
column in our output). Third, for each parameter, we see the posterior distri-
bution for each of the four chains in our simulation, which means we can also
inspect if the chains converged (this is similar to inspecting a trace plot). We can
clearly see that the chains converged, since all four lines for each parameter are
showing us roughly the same posterior distribution of credible parameter
values.
You may also want to zoom in on specific parameters to see their combined
posterior distributions. Fig. 10.11 plots the posterior distribution of the inter-
cept from fit_brm2 against the posterior distribution of b^lo —this would be
equivalent to a 2D version of what is shown in Fig. 10.6. The code used to
generate both figures from fit_brm2 is shown in code block 59—Fig. 10.11
requires the hexbin package.
Finally, you should also inspect R ^ and ^n eff for fit_brm2, just as we did for
fit_brm1. Likewise, you should run a posterior predictive check to make
sure that the data predicted by the model mirrors actual data.18
The effects of the model in question are not too different from the estimates
of fit_glm4. This is mostly because our priors are only mildly informative here
233 Going Bayesian 233
(the standard deviations of our priors are wide enough that the model can easily
override our priors given enough evidence in the data)—you will have the
opportunity to play around with priors in the exercises that follow. Of
course, you should bear in mind that fit_brm2 is not only different from
fit_glm4 because we’re running Bayesian models here: it’s also different
because fit_brm2 is hierarchical and includes random intercepts for participants
and items.
Last but not least, we can compare different Bayesian models much like we
compared different Frequentist models back in chapters 6, 7, and 9. There are
multiple ways to compare model fits (Bayesian or Frequentist). You may
remember our discussion on AIC values from chapter 7. A similar metric
can be used for Bayesian models, but it’s called Widely Applicable Information
Criteria (WAIC). For example, say you have two models, modelA and
modelB, and you want to decide which one to report. You can run waic
(modelA, modelB) (or run waic() on each model individually) to see WAIC
values for both models. As with AIC values, the lower the WAIC value the
better the fit. Another technique commonly used to assess a model’s fit is
Leave-one-out cross validation (LOO), which you can run using the loo()
function (it works the same way as waic()). As with WAIC values, lower
looic values in the output of loo() indicate a better fit.
models using R given what we discussed here. The main goal here was to get
you started with the fundamental concepts and the code to implement Bayesian
models through brms. Combined with what we discussed in previous chapters,
you can now apply the code from this chapter to run linear, logistic, and ordinal
hierarchical models using Bayesian estimation. You may also take a look at the
tidybayes package (Kay 2020), which brings tidy data to Bayesian models (and
also offers additional options for data visualization). Following are some reading
suggestions should you wish to learn more about Bayesian statistics.
If you’re interested in the fascinating history of Bayes’s theorem, see
McGrayne (2011). If you are interested in the intersection between philosophy
of science and statistics and are wondering whether Bayesian inference is best
characterized as inductive inference, see Gelman and Shalizi (2013).
R code
1 # You must run fit_brm2 before running this code block
2 # Prepare the data for plotting
3 # Extract posteriors from fit:
4 posterior2 = as.array(fit_brm2)
5 head(posterior2)
6
7 # Check all parameters in the model:
8 dimnames(posterior2)$parameters
9
10 # Rename the ones we ll use:
11 dimnames(posterior2)$parameters[1:9] = c("Intercept", "High", "Low",
12 "Adv", "Int", "High:Adv",
13 "Low:Adv", "High:Int", "Low:Int")
14
15 # Select which estimates we want to focus on:
16 estimates2 = c("Intercept", "High", "Low",
17 "Adv", "Int", "High:Adv",
18 "Low:Adv", "High:Int", "Low:Int")
19
20 # Select color scheme:
21 color_scheme_set("gray") # To produce figures without color
22 # color_scheme_set("viridis") # Use this instead to see colors
23
24 # Create plot: posterior distributions for all 9 parameters of interest
25 # Plot chains overalaid (not shown; see fit_brm2)
26 mcmc_dens_overlay(posterior2, pars = estimates2) +
27 theme(legend.position = "top",
28 text = element_text(family = "Arial"))
29
30 # ggsave(file = "figures/posteriors-brm2.jpg", width = 8, height = 5, dpi = 1000)
31
32 # Create plot: posterior distributions for intercept and Condition = Low
33 library(hexbin) # Install hexbin package first
34 mcmc_hex(posterior2, pars = c("Intercept", "Low")) +
35 theme(legend.position = "top",
36 text = element_text(family = "Arial"))
37
38 # ggsave(file = "figures/hex-brm2.jpg", width = 4, height = 4, dpi = 1000)
The top three books I would recommend on Bayesian data analysis are
Gelman et al. (2014a), Kruschke (2015), and McElreath (2020). You may
want to read Kruschke and Liddell (2018), an introductory and user-friendly
article for newcomers. Then, if you decide you want to really explore Bayesian
methods, start with Kruschke (2015) or McElreath (2020), which are the most
user-friendly options of the three books.
Much like the present chapter, all three books suggested here focus on data-
analytic applications, but you may also be interested in the applications of Baye-
sian models in cognition and brain function. In that case, see Chater et al.
(2006), Tenenbaum et al. (2006), as well as Lee and Wagenmakers (2014).
Finally, if you’d like to see Bayesian data analysis applied to second language
research, see Norouzian et al. (2018) and Garcia (2020). In Garcia (2020), I
compare different hypotheses about language transfer by running statistical
models using different sets of priors.
10.6 Summary
In this chapter, we ran two hierarchical models using Bayesian estimation.
Bayesian inference offers many advantages (and some disadvantages) over Fre-
quentist inference. Let’s briefly review some of its advantages. First, Bayesian
models are more intuitive to interpret, as they provide the probability of a
parameter value given the data, and not the probability of the data given a
parameter value. As a result, interpreting estimates and credible intervals is
much more intuitive and natural than interpreting Frequentist confidence
intervals. Second, Bayesian models allow us to incorporate our specialized
knowledge into our statistical analyses in the form of prior distributions.
This, in turn, provides the researcher with much more analytical power.
Third, outputs of Bayesian models are considerably more comprehensive
than those of Frequentist models, as we’re given entire posterior distributions
of credible parameter values. Fourth, Bayesian models allow for a higher
degree of customization and complexity.
How about disadvantages? First, Bayesian models are computationally
demanding. Running such models takes much longer than running their Fre-
quentist counterparts. Second, Bayesian data analysis is still in its infancy in the
field of second language research. Consequently, these methods will likely not
be familiar to reviewers or readers. For example, many people may find it
unsettling not to see p-values at all in your analysis, and you might have a
hard time explaining why that’s actually a good thing. Third, our field is
mostly used to categorical answers to complex questions, so providing posterior
distributions may backfire if not enough background is provided beforehand.
Here’s why: people tend to like p-values because they provide a clear and com-
forting categorical answer to our research questions (however complex such
questions may be). Naturally, this is a simplistic understanding of statistical
236 Analyzing the Data 236
. Any model can be run using Bayesian estimation. You only need to know
which distribution to use for your model—unlike the models run in
chapters 6–9, brms allows us to use a single function, namely, brm() to run
all our models. When we ran our linear regression (fit_brm1), we specified
family = gaussian(); for our logistic regression, we specified family =
bernoulli(). If we wanted to run an ordinal regression, we’d specify family
= cumulative(“logit”).
. Bayesian models provide posterior distributions of effect sizes, instead of
the single-point estimates we get in Frequentist models.
. Estimates are credible parameter values given the data (cf. Frequentist
inference).
. We can choose from a wide range of prior distributions to customize our
models, even though we can also let brms choose its default distributions
(“auto mode”). Our decisions regarding priors should be grounded on the
literature, that is, on previous experiments that measure the effects of
interest.
. If we use vague priors (i.e., if we let brms use default priors), then our
results will be very similar to the results of a Frequentist model, since the
posterior will be mostly determined by the data being modeled.
. You will often get warnings when running models using Stan. They are
not necessarily bad; most of the time they’re simply trying to help you with
some information (see Appendix A.4). It’s up to you to assess whether the
warning is applicable or not. A comprehensive list of warnings is provided
at https://mc-stan.org/misc/warnings.html.
. To check whether our model is appropriate, and that it has converged, we
should run a posterior predictive check, and we should inspect R ^ as well as
the ^n eff for each parameter being estimated.
. Because our estimates are given in distributions, plotting said estimates is
probably the best way to present our results.
10.7 Exercises
Notes
1. Named after English mathematician Thomas Bayes (1702–1761; Bayes 1763).
Across the English channel, French scholar Pierre-Simon Laplace independently
developed and further explored the probability principles in question—see Dale
(1982) and McGrayne (2011) for a comprehensive historical account.
2. A rephrasing of Laplace’s principle. Sagan’s sentence is indeed a characterization of
Bayesian inference.
3. Note that the idea that what is more probable is what happens more often is much
older than Frequentist statistics.
4. For mathematical details, see Kruschke (2015, ch. 6). Simply put, a beta distribution
is a continuous probability distribution—which means it is constrained to the inter-
val [0, 1]. Notice that the prior distribution in the figure is not symmetrical (cf.
Gaussian distribution). The shape of a beta distribution relies on two parameters,
typically denoted as α and β—much like the shape of a Gaussian distribution also
relies on two parameters, μ and σ. The shape of the prior distribution here is
defined as α = 3, β = 10.
5. We can also calculate the equal-tailed interval, or ETI. See discussion in Kruschke
(2015, pp. 242–243).
6. Note that 95% is an arbitrary number—as arbitrary as 89%. Here we will keep using
95% as our standard interval, but always remember that there’s nothing special about
this number (other than most people seem to like it).
7. Remember that a probability of 0.5 is equivalent to 0 log-odds (Table 7.1) in a logis-
tic regression.
239 Going Bayesian 239
8. A popular alternative is to use JAGS (Just Another Gibbs Sampler). For our purposes
here, JAGS and Stan are very similar, but see Kruschke (2015, ch. 14) for a discus-
sion on the differences between these two options.
9. Unlike robust linear regressions, which assume a t-distribution—see, for example,
Gelman et al. (2014a, ch. 17).
10. We will not use 3D figures, though, since they can only display the joint distribu-
tions of two parameters at once. If we have a model with n predictors (where n > 2),
we can’t actually picture what their joint distribution looks like—we can, of course,
look at the their distributions individually.
11. You may get a warning message after running Bayesian models. See Appendix A.4 if
that happens.
12. You may remember from chapter 7 that we specified the family of our model for
logistic regressions, so this is not exactly new. To run a robust linear regression,
specify family = student()—see Kruschke (2015, §17.2) for the implementation
of such a model. If you’re not familiar with robust models, also see Gelman et al.
(2014a, ch. 17).
13. Recall that our intercept represents the predicted score of a participant when
Feedback = Explicit feedback and Task = TaskA. This is the same interpretation
as before.
14. You will often get warnings and/or see that something’s off by inspecting the output
of the model, but visualizing the chains is perhaps the most intuitive way to check
that they have converged.
15. You can naturally change these default values by adding these two arguments to brm()
when you run the model. The more iterations our chains have, the more accurate our
posterior estimates will be—and the longer it will take for the model to run.
16. You could remove the intercept from the figure by adjusting the code shown in
code block 56, line 19: simply remove the intercept from the vector and rerun
the code, and then run code block 57.
17. Your figure will look slightly different.
18. Note that this time the response variable is binary, so instead of the normal distribu-
tion shown in Fig. 10.9, you should go for a different type of figure. For logistic
regressions, bars would be more useful. Simply run pp_check(fit_brm2, type =
“bars”, nsamples = 20) and see what happens. You can run this multiple times
to see how the figure changes with different samples.
11
FINAL REMARKS
The two main goals of this book were to emphasize the importance of data
visualization in second language research and to promote full-fledged statistical
models to analyze data. I hope to have convinced you of the numerous advan-
tages of visualizing patterns before statistically analyzing them and of analyzing
such patterns using hierarchical models, which take into account individual-
and group-level variation when estimating group-level coefficients. More
comprehensive quantitative methods are vital in our field, given that second
language studies typically employ a dangerously narrow range of statistical tech-
niques to analyze data—as mentioned in chapter 1.
In addition to these two points, I hope you are now convinced that Bayesian
data analysis is not only more powerful and customizable but also more intui-
tive in many respects. Perhaps more importantly, Bayesian models allow us to
incorporate our previous knowledge into our statistical analysis. Second lan-
guage research can certainly benefit from using priors that mirror (i) previous
research findings or (ii) L1 grammars in the context of language transfer
(e.g., Garcia 2020).
Throughout this book, we examined several code blocks in R. The fact that
R is a powerful language designed specifically for statistical computing allows
you and me to easily reproduce the code blocks discussed earlier. And
because R is open-source and so widely used these days, we have literally thou-
sands of packages to choose from. Packages like tidyverse and brms allow more
and more people to accomplish complex tasks by running simple lines of code.
We also saw in different chapters how files can be organized using R Proj-
ects. We saw that by using the source() function we can easily keep different
components of our projects separate, which in turn means better organization
241 Final Remarks 241
overall. Finally, we briefly discussed RData files, which offer a great option to
compress and save multiple objects into a single file.
The range of topics introduced in this book makes it impossible for us to
discuss important details. For that reason, numerous reading suggestions were
made in different chapters. The idea, of course, was to give you enough
tools to carry out your own analyses but also to look for more information
in the literature when needed.
The end.
Appendix A
Troubleshooting
it will assume you mean the arm version, not the dplyr version, which is now
“masked”.
Therefore, if you have both packages loaded and plan to use select(), make
sure you are explicit about it by typing dplyr::select()—you can find this in
code block 34. Otherwise, you will get an error, since you’ll be using the
wrong function without realizing what the problem is.
A.3 Errors
When you start using R, error messages will be inevitable. The good news is
that many errors are very easy to fix. First, you should check the versions of
R and RStudio that you have installed (see §A.1 or §2.2.1). If your computer
is too old, your operating system will likely not be up-to-date, which in turn
may affect which versions of R and RStudio you can install. That being said,
unless your computer is a decade old, you shouldn’t have any problems. In
some cases, an error will go away once you restart RStudio or your com-
puter—updating packages may also solve the issue, in which case R will give
you instructions. As already mentioned, all the code in this book has been
tested on different operating systems, but each system is different, and you
might come across errors.
It’s important to notice that R is case-sensitive, so spelling matters a lot. For
example, you may be running Head() when the function is actually head()—
this will generate the error could not find function “Head”. Or you may be
running library(1me4) when it’s called library(lme4)—this will generate the
error unexpected symbol … . Another common issue is trying to load a
package that hasn’t been installed yet—which will generate the error there is
not package called … . All these scenarios will throw an error on your
screen, but all are easy to fix.
One common error you may come across is: Error in …: object ’X’ not
found—replace X with any given variable. This happens when you are
calling an object that is not present or loaded in your environment. For
example, you try running the code to create a figure for the estimates of a
given model, but you haven’t run the actual model yet. Suppose you run
some models, which are now objects in your environment. You then restart
RStudio and try running some code that attempts to create a figure based on
the model. You will get an error, since the models are no longer in RStudio’s
“memory” (i.e., its environment). You need to run the models—or source()
the script that runs the models.
Another common error you may get is: Error in percent_format(): could
not find function “percent_format”. This is simply telling you that R
couldn’t find percent_format (used to add percentage points to axes in differ-
ent plots). If you get this message, it’s because you haven’t loaded the scales
package, which is where percent_format is located.
244 Appendix A. Troubleshooting 244
You may come across more specific or difficult errors to fix, of course. First,
read the error message carefully—bear in mind that your error may be the result
of some issue in your system, not in R per se. Second, check for typos in your
code. Third, google the error message you got. Online forums will almost
always help you figure it out—the R community is huge (and helpful!).
A.4 Warnings
As you run the code blocks in this book, you may come across warning
messages. Unlike error messages, warning messages will not interrupt any pro-
cesses—so they’re usually not cause for alarm. If you run a function that is a
little outdated, you may get a warning message telling you that a more
updated version of the function is available and should be used instead. For
example, if you run stat_summary(fun.y = mean, geom = “point”), you
will get a warning message: Warning message: ‘fun.y‘ is deprecated. Use
‘fun‘ instead. Here, the message is self-explanatory, so the solution is obvious.
In chapter 10, you may run into warning messages such as “There were 2
divergent transitions after warmup […]”. This is merely telling you that
you should visually inspect your model to check whether the chains look
healthy enough (e.g., via a trace plot).
A.5 Plots
At first, your plots may look slightly different from the plots shown in this
book, especially when figures have multiple facets—even though you are
using the exact same code. But the difference is not actually real: once you
save your plots using the ggsave() function, your figures will look exactly
the same as mine—provided that you use the exact same code (including the
specifications provided for ggsave()).
These apparent differences occur because when you generate a figure in
RStudio, the preview of said figure will be adjusted to your screen size. A
full HD laptop screen has much less room than a 4K external monitor.
RStudio will resize the preview of your plot to fit the space available. As a
result, if your screen is not 4K, you may see overlapping labels, points that
are too large, and so on. You can resize the plot window to improve your
preview, click on the Zoom button, or make changes to the code, but the
bottom line is that once you save the figure, it will look perfectly fine, so
this is not a problem.
245
Appendix B
RStudio Shortcuts
N Population size
μ Population mean
σ Population standard deviation
n Sample size
x. Sample mean
s Sample standard deviation
H0 Null hypothesis
SE Standard Error
CI Confidence Interval
β Estimate in statistical models
α Significance level in hypothesis testing
Random intercept in hierarchical models
γ Random slope in hierarchical models
. Error term
^e Estimated error term
^y Predicted response
N Normal (Gaussian) distribution
U Uniform distribution
HDI Highest density interval
P(A|B) Probability of A given B
247
Appendix D
Files Used in This Book
The structure shown in Fig. D.1 illustrates how the files are organized in dif-
ferent directories. Naturally, you may wish to organize the files differently: the
structure used in this book is merely a suggestion to keep files in different direc-
tories and to use R Projects to manage the different components of the topics
examined throughout the book. In a realistic scenario, you could have one
directory and one R Project for each research project you have.
By adopting R Project files and using a single directory for all the files in
your project, you can refer to said files locally, that is, you won’t need to use
full paths to your files. Of course, you can always import a file that is not in
your current directory by using complete paths—or you can create copies of
the file and have it in different directories if you plan to use the same data
file across different projects. This latter option is in fact what you see in this
book: feedbackData.csv, for example, is located in multiple directories
(plots, Frequentist, Bayesian in the figure). These duplicated files (both
scripts and data files) ensure that we have all the files we need within each
working directory.
248 Appendix D. Files Used in This Book 248
basics.Rproj
¤
sampleData.csv data
2 3
basics/ rBasics.R
6dataImport.R7
(ch. 2) 6 7
6 dataPrep.R7
6 7
4 eda.R5
stats.R scripts
figures/...
plots.Rproj
∙ ¸
feedbackData.csv
rClauseData.csv data
plots/
(ch. 3{5) 2 3
continuousDataPlots.R
4categoricalDataPlots:R5
optimizingPlots:R scripts
figures/...
Frequentist.RProj
bookFiles/
∙ ¸
feedbackData.csv
rClauseData.csv data
2 3
dataPrepLinearModels.R
Frequentist/ 6
6 plotsLinearModels.R77
(ch. 6{9) 6
6 linearModels.R77
6 dataPrepCatModels.R7
6 7
6 plotsCatModels.R7
6 7
6 logModels.R7
6 7
4 ordModels.R5
hierModels.R scripts
models/
(ch. 6{10) figures/...
Bayesian.RProj
∙ ¸
feedbackData.csv
rClauseData.csv data
Bayesian/
(ch. 10) 2 3
dataPrepLinearModels.R
6 dataPrepCatModels.R7
6 7
4 plotsBayesianModels.R5
bayesianModels.R scripts
figures/...
TABLE D.1 List of all Scripts and their Respective Code Blocks
Here you can see how dummy coding works for a factor (L1) with three levels
(German, Italian, Japanese). The factor in question, from feedback, is ordered
alphabetically by R, which means German will be our reference level (i.e., our
intercept) in a typical model—unless, of course, we manually change it using
relevel(). Because we have three levels, we only need two columns in our con-
trast coding in Table E.1 (compare with Table 6.1 in chapter 6).
TABLE E.1 Example of Contrast Coding for More Than Two Levels
L1 Italian Japanese
German 0 0
German 0 0
Italian 1 0
Japanese ! 0 1
Italian 1 0
Japanese 0 1
German 0 0
… … …
251
Appendix F
Models and Nested Data
R code
1 source("dataPrepCatModels.R")
2 library(tidyverse)
3 library(ordinal)
4
5 rc = rc %>%
6 mutate(Condition = relevel(factor(Condition), ref = "NoBreak"))
7
8 # Create table with estimates and more: two models
9 rc %>% nest_by(L1) %>%
10 mutate(model = list(clm(Certainty3 s Condition, data = data))) %>%
11 summarize(broom::tidy(model)) %>%
12 mutate(lowerCI = estimate - 1.96 * std.error,
13 upperCI = estimate + 1.96 * std.error) %>%
14 filter(coefficient_type == "beta") %>%
15 dplyr::select(-c(statistic, p.value, coefficient_type))
16
17 # Output:
18 # A tibble: 4 x 6
19 # Groups: L1 [2]
20 # L1 term estimate std.error lowerCI upperCI
21 # <fct> <chr> <dbl> <dbl> <dbl> <dbl>
22 # 1 English ConditionHigh -0.631 0.264 -1.15 -0.114
23 # 2 English ConditionLow 0.730 0.270 0.201 1.26
24 # 3 Spanish ConditionHigh 1.08 0.190 0.704 1.45
25 # 4 Spanish ConditionLow -0.877 0.192 -1.25 -0.501
Note
1. This function can also be used to display the results of a model, much like summary()
and display(), that is, broom::tidy(fit). It’s more minimalistic and organized than
summary() but less minimalistic than display().
253
GLOSSARY
assumed to be constant. If our data doesn’t meet this assumption, we say that
we have heteroscedastic data. p. 114
leave-one-out cross-validation (LOO) Information criterion commonly
used to assess a model’s fit (mainly in relation to other models). The idea
is to capture the out-of-sample prediction error by (i) training a given
model on a dataset that contains all but one item (or set) and (ii) evaluating
its accuracy at predicting said item (or set). p. 233
Markov Chain Monte Carlo (MCMC) Techniques used to fit Bayesian
models. In a nutshell, MCMC methods draw samples from the posterior dis-
tribution instead of attempting to compute the distribution directly. The
concept behind MCMC can be described as a random walk through the
space of possible parameter values. Different MCMC algorithms can be
used for this random walk, for example, Metropolis-Hastings, Gibbs, Ham-
iltonian. A Markov chain is a stochastic model where the probability of an
event only depends on the previous state. Metaphorically, the next step we
take in our random walk only depends on our previous step. Finally,
“Monte Carlo” refers to the famous Casino in Monaco. p. 218
null hypothesis (H0) The hypothesis that there is no significant difference
between two groups. More generally, the null hypothesis assumes that
there is no effect of a given variable. p. 7
p-hacking Conscious or unconscious data manipulation to achieve a statisti-
cally significant result. See Nuzzo (2014). p. 8
pipe (%>%) Operator that takes the output of a statement in R and makes it
the input of the next statement (see §2.5). p. 38
^ Also known as the Gelman-Rubin convergence diagnostic. A metric used to
R
monitor the convergence of the chains of a given model. R ^ is a measure that
compares the variance of the simulations from each chain with the variance
of all the chains mixed together. If all chains are at equilibrium, R^ should
equal 1. See Brooks et al. (2011). p. 255
shrinkage The reduction of variance in the estimators relative to the data
being modeled. For instance, in hierarchical models where by-participant
effects are taken into account, the individual-level predictions are shifted
(i.e., shrunken) towards the overall mean. See Fig. 9.4. p. 201
slice notation The use of square brackets to access elements in a data struc-
ture, for example, vectors, lists, and data frames. For example, in a vector
called A, we can run A[2] to pick the second element in A. In a data
frame called B, we can run B[3,5] to pick the element found in the third
row, fifth column; B[,5] picks all the rows in the fifth column; and B[3,]
picks all the columns but only the third row. Because data frames have
two dimensions, we use a comma to separate rows from columns when
using slice notation. p. 25 and p. 29
string Sequence of characters. Each of the following is a string in R: “hello”,
“I like chocolate”, “384”, “a”, “!!”, “dogs and cats?”. Notice that any
255 Glossary 255
sequence of characters will be a string as long as you have quotes around it.
p. xxi (preface) and p. 24
t-test Perhaps the most popular statistical test used in null hypothesis significance
testing (NHST). t-tests are used to determine whether there’s a statistical dif-
ference in means between two groups. Unlike ANOVAs, t-tests can’t
handle more than two groups. As in ANOVAs, t-tests require the
response/dependent variable to be continuous. p. 3
tibble Data structure that is more efficient than a data frame. Unlike data
frames, tibbles require a package to be used (tibble, which comes with
tidyverse). For more information (and examples), check the documentation
at https://tibble.tidyverse.org. p. 31 and p. 40
tidy data “Tidy data” (Wickham et al. 2014) is data with a specific format
that optimizes both exploratory data analysis and statistical analysis. A data
frame or tibble is tidy if every variable goes in a column and every
column is a variable (see more at https://www.tidyverse.org/packages/).
p. 37
Tukey HSD Tukey’s honestly significant difference generates multiple pair-
wise comparisons while controlling for Type I error, which would be a
problem if we ran multiple t-tests. p. 53
Type I error Also known as a false positive conclusion, Type I error is the
rejection of the null hypothesis when it is, in fact, true. p. 4
Type II error Also known as a false negative conclusion, Type II error is the
failure to reject the null hypothesis when it is, in fact, false. p. 8
variable Object used to store information to be referenced later. For
example, if we establish that x = 10, then every time we use x we will
be essentially using 10, that is, the value of the variable in question. p. xxi
(preface) and p. 22
vector Basic data structure in R. A vector can hold multiple elements as long
as all its elements belong to the same class. For example, we can have a
vector with numbers, a vector with characters, and so on. In a typical
dataset, each column is a vector. p. 24
widely applicable information criterion (WAIC) Watanabe (2010), also
known as Watanabe Akaike information criterion. A cross-validation method
to assess the fit of a (Bayesian) model. It is calculated by averaging the
log-likelihood over the posterior distribution taking into account individual
data points. See McElreath (2020, p. 191) for a discussion on the differences
among WAIC, AIC, BIC, and DIC. For advantages of WAIC over DIC,
see Gelman et al. (2014b). p. 233
working directory The directory (or folder) assumed by R to be the loca-
tion of your files. As a result, if you need to reference a file in your working
directory, you can just tell R the name of the file and R will know where it
is. You can find out what your working directory is by running the getwd()
function, and you can change it with the setwd() function. You can also
256 Glossary 256
reference files that are not in your working directory, but doing so will
require that you fully specify the path to the file in question. For
example, if a file called myFile.csv is located in your (Mac OS) desktop,
and your working directory is not currently set to your desktop, you will
need to tell R that the file you want is “~/Desktop/myFile.csv”. If the
file were in your working directory, you would instead tell R that the
file is “myFile.csv”. p. 19
257
REFERENCES
Agresti, A. (2002). Categorical data analysis. John Wiley & Sons, Hoboken, New Jersey.
Agresti, A. (2010). Analysis of ordinal categorical data. John Wiley & Sons, Hoboken, New
Jersey.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions
on Automatic Control, 19(6):716–723.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R.
Cambridge University Press, New York.
Bache, S. M. and Wickham, H. (2014). magrittr: A forward-pipe operator for R. R package
version 1.5.
Barr, D., Levy, R., Scheepers, C., and Tily, H. (2013). Random effects structure for
confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language,
68(3):255–278.
Bartoń, K. (2020). MuMIn: Multi-model inference. R package version 1.43.17.
Bates, D., Kliegl, R., Vasishth, S., and Baayen, H. (2015a). Parsimonious mixed models.
arXiv:1506.04967.
Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015b). Fitting linear mixed-effects
models using Ime4. Journal of Statistical Software, 67(1):1–48.
Bayes, T. (1763). LII. An essay towards solving a problem in the doctrine of chances. By
the late Rev. Mr. Bayes, F.R.S. communicated by Mr. Price, in a letter to John
Canton, A.M.F.R.S. a letter from the late Reverend Mr. Thomas Bayes, F.R.S.,
to John Canton, M.A. and F.R.S. Author(s): Mr. Bayes and Mr. Price. Philosophical
Transactions (1683–1775), 53:370–418.
Brooks, S., Gelman, A., Jones, G. L., and Meng, X.-L. (2011). Handbook of Markov
Chain Monte Carlo. Chapman and Hall/CRC, Boca Raton.
Bürkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package
brms. The R Journal, 10(1):395–411.
Campbell, J. P. (1982). Editorial: Some remarks from the outgoing editor. Journal of
Applied Psychology, 67(6):691–700.
258 References 258
Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Bru-
baker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming
language. Journal of Statistical Software, Articles, 76(1):1–32.
Chater, N., Tenenbaum, J. B., and Yuille, A. (2006). Probabilistic models of cognition:
Conceptual foundations. Trends in Cognitive Sciences, 10(7):287–344.
Christensen, R. H. B. (2019). ordinal: Regression models for ordinal data. R package version
2019.12-10.
Cuetos, F. and Mitchell, D. C. (1988). Cross-linguistic differences in parsing: Restric-
tions on the use of the late closure strategy in Spanish. Cognition, 30(1):73–105.
Cunnings, I. (2012). An overview of mixed-effects statistical models for second language
researchers. Second Language Research, 28(3):369–382.
Dale, A. I. (1982). Bayes or Laplace? An examination of the origin and early applications
of Bayes’ theorem. Archive for History of Exact Sciences, 27:23–47.
Dowle, M. and Srinivasan, A. (2019). data.table: Extension of ‘data.frame’. R package
version 1.12.8.
Efron, B. and Morris, C. (1977). Stein’s paradox in statistics. Scientific American, 236
(5):119–127.
Fairbanks, M. (2020). tidytable: Tidy Interface to ’data.table’. R package version 0.5.5.
Fernández, E. M. (2002). Relative clause attachment in bilinguals and monolinguals. In
Advances in Psychology, volume 134, pages 187–215. Elsevier, Amsterdam.
Fodor, J. D. (2002). Prosodic disambiguation in silent reading. In Hirotani, M., editor,
Proceedings of the 32nd annual meeting of the North East linguistic society. GLSA Publica-
tions, Amherst, MA.
Fullerton, A. S. and Xu, J. (2016). Ordered regression models: Parallel, partial, and non-
parallel alternatives. Chapman & Hall/CRC, Boca Raton.
Garcia, G. D. (2014). Portuguese Stress Lexicon. Comprehensive list of non-verbs in Portuguese.
Available at http://guilhermegarcia.github.io/psl.html.
Garcia, G. D. (2017). Weight gradience and stress in Portuguese. Phonology, 34(1):41–79.
Project materials available at http://guilhermegarcia.github.io/garciaphon2017.html.
Garcia, G. D. (2020). Language transfer and positional bias in English stress. Second Lan-
guage Research, 36(4):445–474.
Gelman, A. (2008a). Objections to Bayesian statistics. Bayesian Analysis, 3(3):445–449.
Gelman, A. (2008b). Scaling regression inputs by dividing by two standard deviations.
Statistics in Medicine, 27(15):2865–2873.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2014a).
Bayesian data analysis, volume 2. Chapman & Hall/CRC, Boca Raton, 3rd edition.
Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical
models. Cambridge University Press, New York.
Gelman, A., Hwang, J., and Vehtari, A. (2014b). Understanding predictive information
criteria for Bayesian models. Statistics and Computing, 24(6):997–1016.
Gelman, A. and Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics.
British Journal of Mathematical and Statistical Psychology, 66(1):8–38.
Gelman, A. and Su, Y.-S. (2018). arm: Data analysis using regression and Multilevel/Hier-
archical Models. R package version 1.10-1.
Ghasemi, A. and Zahediasl, S. (2012). Normality tests for statistical analysis: A guide for
non-statisticians. International Journal of Endocrinology and Metabolism, 10(2):486.
Goad, H., Guzzo, N. B., and White, L. (2020). Parsing ambiguous relative clauses in L2
English: Learner sensitivity to prosodic cues. Studies in Second Language Acquisition.
To appear.
259 References 259
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N.,
and Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and
power: A guide to misinterpretations. European Journal of Epidemiology, 31(4):337–
350.
Gries, S. T. (2013). Statistics for linguistics with R: A practical introduction. Walter de
Gruyter, Berlin, 2nd edition.
Herschensohn, J. (2013). Age-related effects. In Herschensohn, J. and Young-Scholten, M.,
editors, The Cambridge handbook of second language acquisition, pages 317–337. Cambridge
University Press, New York.
Hu, Y. and Plonsky, L. (2020). Statistical assumptions in L2 research: A systematic review.
Second Language Research.
Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or
not) and towards logit mixed models. Journal of Memory and Language, 59(4):434–446.
Kay, M. (2020). tidybayes: Tidy Data and Geoms for Bayesian Models. R package version
2.1.1.
Kruschke, J. K. (2015). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan.
Academic Press, London, 2nd edition.
Kruschke, J. K. and Liddell, T. M. (2018). Bayesian data analysis for newcomers. Psy-
chonomic Bulletin & Review, 25(1):155–177.
Kuznetsova, A., Brockho, P., and Christensen, R. (2017). ImerTest package: Tests in
linear mixed effects models. Journal of Statistical Software, Articles, 82(13):1–26.
Lee, M. D. and Wagenmakers, E.-J. (2014). Bayesian cognitive modeling: A practical course.
Cambridge University Press, Cambridge.
Levshina, N. (2015). How to do linguistics with R: Data exploration and statistical analysis.
John Benjamins Publishing Company, Amsterdam.
Loewen, S. (2012). The role of feedback. In Gass, S. and Mackey, A., editors, The Rou-
tledge handbook of second language acquisition, pages 24–40. Routledge, New York.
Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in R. Behavior
Research Methods, 49(4):1494–1502.
Lyster, R. and Saito, K. (2010). Oral feedback in classroom SLA: A meta-analysis.
Studies in Second Language Acquisition, 32(2):265–302.
Mackey, A. and Gass, S. M. (2016). Second language research: Methodology and design. Rou-
tledge, New York, 2nd edition.
Marsden, E., Morgan-Short, K., Thompson, S., and Abugaber, D. (2018). Replication
in second language research: Narrative and systematic reviews and recommendations
for the field. Language Learning, 68(2):321–391.
Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., and Bates, D. (2017). Balancing
Type I error and power in linear mixed models. Journal of Memory and Language,
94:305–315.
McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan.
Chapman & Hall/CRC, Boca Raton, 2nd edition.
McGrayne, S. B. (2011). The theory that would not die: How Bayes’ rule cracked the enigma
code, hunted down Russian submarines, & emerged triumphant from two centuries of contro-
versy. Yale University Press, New Haven.
Nakagawa, S. and Schielzeth, H. (2013). A general and simple method for obtaining R2
from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4
(2):133–142.
Norouzian, R., de Miranda, M., and Plonsky, L. (2018). The Bayesian revolution in
second language research: An applied approach. Language Learning, 68(4):1032–1075.
260 References 260
SUBJECT INDEX
FUNCTION INDEX
ifelse() 149
install.packages() 36