Baumer - 2015 - A Data Science Course For Undergraduates Thinking
Baumer - 2015 - A Data Science Course For Undergraduates Thinking
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Taylor & Francis, Ltd., American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to The American Statistician
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
s Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/TAS
Ben Baumer
334 © 2015 American Statistical Association DOI: 10.1080/00031305.2015.1081105 The American Statistician, November 2015, Vol. 69, No. 4
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
What we describe in this article is a course at a liberal arts lyzed. Furthermore, while some might view data managem
as a perfunctory skill on intellectual par with data entry,
college in data science that is atypical within the current statis
tics curriculum. Nevertheless, what we present here is wholly are others thinking more broadly about data. (Examples of
consistent with the vision for the future of the undergradu data management abound, but one of the most common is fa
ate statistics curriculum articulated by Horton (2015) and the to separate the actual data from their analysis. Microsoft
ASA Undergraduate Guidelines Workgroup (2014). The pur is a particular villain in this arena, where merged cells, roun
pose of this course is to prepare students to work with these induced by formatted columns, and recomputed formula
modern data streams as described above. Some of the top
result in the ultimate disaster: losing the original recorded d
ics covered in this course have historically been the purview Just as Wilkinson (2006) brought structure to graphics thr
of computer science. But while the course we describe indis"grammar," Wickham (2014) and Wickham and Francois
putably contains elements of statistics and computer science, brought structure to data manipulation through the five "ve
select, filter, mutate, arrange, and summarise. These com
it just as indisputably belongs exclusively to neither discipline.
single-table data manipulation operations are the practic
Furthermore, it is not simply a collection of topics from existing
courses in statistics and computer science, but rather an inte scendents of theoretical work on data structures by comp
grated presentation of something more holistic. Nevertheless, scientists who developed notions of normal forms, relat
the course consists of a series of largely independent modules, algebras, and database management systems.
each of which could be expanded into stand-alone curricular ele While the emphasis on computing within the stat
tics curriculum may be growing, it belongs to a lar
ments (e.g., a full-semester, half-semester, or interterm course).
Thus, while some readers might view this article as a spec
more gradual evolution in statistics education toward
ification for a single new offering, others might use it as aanalysis—with computers—and encourages us to refle
shifting boundaries between statistics and computer scie
blueprint for a significant expansion of the existing statistics
curriculum. Moore (1998)—viewing statistics as an ongoing quest to "
son about data, variation, and chance"—saw statistical thin
2. BACKGROUND AND RELATED WORK as a powerful anchor that can prevent statistics from being
whelmed by technology." Cobb (2011) argued for an incr
While many believe that to understand statistical emphasis
theory,on a conceptual topics in statistics, but also saw th
solid foundation in mathematics is necessary, it velopment
seems clearof statistical theory as an anachronistic conseq
that computing skills are required for one to become of a lack of computing power (Cobb 2007). Moreover, wh
a func
much
tional, practicing statistician. In making this analogy, of statistical
Nolan and theory is designed to make the stronge
Temple Lang (2010) argued strongly for a larger presence
ference possible from what are historically scarce data, w
for computing in the statistics curriculum. Citingnow theiroften
work,challenged trying to draw meaningful conclu
the American Statistical Association Undergraduate from an abundance of data.
Guidelines
Breiman (2001) articulated the distinction between "statisti
Workgroup (2014) underscores the importance of computing
cal data models" and "algorithmic models" that in many ways
skills (even using the words "data science") in undergraduate
majors in statistical science. Here, by computing, characterizes
we mean sta the relationship between statistics and machine
tistical programing in an environment such as R. It learning, viewing the former as being far more limited than
is important
the latter.
to recognize this as a distinct—and more valuable—skill And while machine learning and data mining have
than
being able to perform statistical computations intraditionally
a menu-and been subfields of computer science, Finzer (2013)
click environment such as Minitab. Indeed, Nolannotedandthat data science does not have a natural home within
Temple
Lang (2010) went even further, advocating for thetraditional
importancedepartments, belonging exclusively to neither math
of teaching general command-line programs, suchematics,
as grepstatistics,
(for nor computer science. Indeed, in Cleveland's
regular expressions) and other common UNIX commands(2001) seminal
that action plan for data science, he envisions data
science
really have nothing to do with statistics, per se, but are as a "partnership" between statisticians (i.e., data ana
very
useful for cleaning and manipulating documents oflysts)
manyand computer scientists.
types.
Although practicing statisticians seem to largely agree that
the lion's share of the time spent on many projects is devoted
to data cleaning and manipulation (or data wrangling, as it is 3. THE COURSE
often called Kandel et al. 2011), the motivation for adding these
In thisnor
skills to the statistics curriculum is not simply convenience, article, we describe an experimental course—c
"Data Science"
should a lack of skills or interest on the part of instructors stand (a.k.a., SDS 292) and now offered throug
in the way. Finzer (2013) described a "data habit ofStatistical
mind... that
& Data Sciences Program at Smith College—th
grows out of working with data." (This is not to be confused with
offered in the Fall of 2013 and again in the Fall of 2014.
"statistical thinking" as articulated by Chance (2002),
first which
year, 18 students completed the course, as did anothe
contains no mention of computing.) In this case, a the
datafollowing
habit of year. The prerequisites are an introductor
mind comes from experience working with data, andtics
is manifest
course and some programing experience. Existing
in people who start thinking about data formatting at before data
the University of California at Berkeley, as well as Ma
are collected (Zhu et al. 2013), and have foresightand
about how
St. Olaf Colleges, are the pedagogical cousins of SD
data should be stored that is informed by how they will
(see be anaet al. 2015 for a comprehensive comparison
Hardin
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
This represents a typical data science research proj
is to produce students who have confidence and foundational could also use this discussion as a segue to topics in experimental
skills—not necessarily expertise—to tackle each step in this design, or introduce the ASA's Ethical Guidelines for Statistical
Practice (Committee on Professional Ethics 1999).
modern data analysis cycle, both immediately and in their future
careers. Finally, students are asked directly how they would go about
replicating this study. That is, they are asked to identify all of
3.1 Day One the steps necessary to conduct this study, from collecting the
data to writing a report, and to think about whether they could
The first class provides an important opportunity to hook
accomplish this with their current skills and knowledge. While
students into data science. Since most students do not have a
students are able to generate many of the steps as a broad outline,
firm grasp of what data science is, and in particular, how it
most are unfamiliar with the practical considerations necessary.
differs from statistics, the diagram in Figure 1 can help draw
For example, students recognize that the data must be down
these distinctions. The goal is to illustrate the richness and
loaded from Twitter, but few have any idea how to do that.
vibrance of data science, and emphasize its breadth by high
This leads to the concept of an application programing interface
lighting the different skills necessary for each task. Students
(API), which is provided by Twitter (and can be used in several
should be sure within the first 5 min of the semester that there
environments, notably R and Python). Moreover, most students
is something interesting and useful for them to learn in the
do not recognize the potential difficulties of storing 500 million
course.
tweets. How big is a tweet? Where and how could you store
Next, we engage students immediately by exposing them to a
them? Spatial concerns also arise: does it matter in which Con
recent, relevant example of data science. Students are asked to
gressional district the person who tweeted was? Most students
read a provocative article by DiGrazia et al. (2013) and a rather
in the class have experience with R, and thus are comfortable
ambitious editorial in The Washington Post written by Rojas
building a regression model and overlaying it on a scatterplot.
(2013), a sociologist, in which he claims that Twitter will put
But few have considered anything beyond the default plotting
political pollsters out of work. (See Finger and Dutta 2014 for
options. How do you add annotations to the plot to make it
a counterpoint.)
more understandable? What principles of data graphic design
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
would help to determine which annotations are necessary or 3.3 Data Manipulation/Data Wrangling
appropriate?
As noted earlier, it is a common refrain among statisticians
Students are then advised that this course will give them the
that "cleaning and manipulating the data" takes up an over
tools necessary to carry out a similar study. This will involve
whelming majority of the time spent on a statistical project.
improving their skills with programing, data management, data
In the introductory class, we do everything we can to shield
visualization, and statistical computing. The goal is to leave
students from this reality, exposing them only to carefully cu
students feeling energized, but open to exploring their newly
rated datasets. By contrast, in SDS 292 students are expected
acquired, more complex understanding of data science.
to master a variety of common data manipulation techniques.
The term data management has a boring, IT connotation, but
3.2 Data Visualization
there is a growing acknowledgment that such data wrangling,
From the first day of class, students are reminded that or data
sta manipulation skills are not only valuable, but in fact be
tistical work is of limited value unless it can be communicated long to a broader intellectual discipline (Wickham 2014; Horton,
to nonstatisticians (Swires-Hennessy 2014). More specifically,Baumer, and Wickham 2015). One of the primary goals of SDS
most data scientists working in government or industry (as op 292 is to develop students' capacity to "think with data" (Nolan
posed to those in academia) will work for a boss who gener and Temple Lang 2010), in both a practical and theoretical sense.
ally possesses less technical knowledge than the employee. A Over the next 3 weeks, students are given rapid instruction in
perfect, but complicated, statistical model may not be persuadata manipulation in R and SQL. In the spirit of the data ma
sive to nonstatisticians if it cannot be communicated clearly.nipulation "verbs" advocated by Wickham and Francois (2015),
Data graphics provide a mechanism for illustrating relation students learn how to perform the most fundamental data op
ships among data, but most students have never been exposederations in both R and SQL, and are asked to think about their
to structured ideas about how to create effective data graphics.connection.
In SDS 292, the first 2 weeks of class are devoted to data
• select: subset variables (SELECT in SQL, select() in R
visualization. This serves two purposes: (1) it is an engaging
(dplyr))
hook for a science course and (2) it gives students with weaker
• filter, subset rows (WHERE, HAVING in SQL, filter () inR)
programing backgrounds a chance to get comfortable in R.
• mutate: add new columns (. . . AS ... in SQL, mutate()
Students read the classic text of Tufte (1983) in its entirety, as
inR)
well as excerpts from Yau (2013). The former provides a won
• summarise: reduce to a single row (GROUP BY in SQL,
derfully cantankerous account of what not to do when creating
summarise(group_by()) in R)
data graphics, as well as thoughtful analyses of how data graph
• arrange: reorder the rows (ORDER BY in SQL, arrange () in
ics should be constructed. Students take delight in critiquing
R)
data graphics that they find online through the lens crafted by
Tufte. The latter text, along with Yau (2011), provides many By the end, students are able to see that an SQL query con
examples of interesting data visualizations that can be used intaining
the beginning of class to inspire students to think broadly about SELECT ... FROM a JOIN b WHERE ... GROUP BY ...
what can be done with data (e.g., data art). Moreover, it pro HAVING ... ORDER BY ...
vides a well-structured taxonomy for composing data graphics is equivalent to a chain of R commands involving
that gives students an orientation into data graphic design. For a
example, a data graphic that uses color as a visual cue in a Carte select (...)
sian coordinate system is what we commonly call a "heat map." filter(...) %>•/.
Students are also exposed to the hierarchy of visual perception inner_join(b, ...)
that stems from work by Cleveland (2001). group_by ( . . . ) %>%
Homework questions from this part of the course focus on summarise(...) %>%
demonstrating understanding by critiquing data graphics found f ilter (...) %>%
"in the wild," an exercise that builds confidence (i.e., "Geez, I arrange (...)
already know more about data visualization than this guy... ") A summary of analogous R and SQL syntax is shown in
and encourages critical thinking. Computational assignments inTable 1.
troduce students to some of the less trivial aspects of annotating Moreover, students learn to determine for themselves, based
data graphics in R (e.g., adding textual annotations and manipu on the attributes of the data (most notably size), which tool is
lating colors, scales, legends, etc.). We discuss additional topicsmore appropriate for the type of analysis they wish to perform.
in data visualization in Section 3.6. They learn that R stores data in memory, so that the size of the
Table 1. Conceptually analogous SQL and R commands. Suppose a and b are SQL tables or R data. frames
Filter by rows and columns SELECT coll, col2 FROM a WHERE col3 - 'x' select (filter (a , col3 == 'x'), coll, col2)
Aggregate by rows SELECT id, sum(coll) as total FROM a GROUP BY id summarise (group _by (a , id), total = sum(coll))
Merge two tables SELECT
SELECT *
* FROM
FROM aa JOIN
JOIN b
b ON
ON a.id
a.id =- b.id
b.id inner_join(x=a,
inner_join(x=a, y=b
y=b ,, by=5id'))
by=}id'))
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
The first objective underlines the statistical element
m < n
course, encouraging students to put observations in
context by demonstrating an understanding of variation
data. The second objective, while not a substitute for a
course in regression analysis, helps reinforce a practic
standing of regression, and sets the stage for the su
Figure 2. The filter operation. machine learning portion of the course.
data with which you wish to work is limited by the 3.5 Machine
amount of Learning
memory available to the computer, whereas SQL stores data on
Two weeks are devoted to introductory topics in m
disk, and is thus much better suited for storing large amounts
learning. Some instructors may find that this portio
of data. However, students learn to appreciate the virtually lim
course overlaps too heavily with existing offerings in
itless array of operations that can be performed on data in R,
science or applied statistics. Others might argue that the
whereas the number of useful computational functions in SQL
will not be of interest to students who are primarily
is limited. Thus, students learn to make choices about software
in the communication and visualization side of data science.
in the context of hardware—and data.
Care must be taken to make sure that what students are learn
However, a brief introduction to machine learning gives stu
dents a functional framework for testing algorithmic models
ing at this stage of the course is not purely programing syntax
Assignments force them to grapple with the limitations of large
(although that is a desired side effect). Rather, they are learning
datasets, and pursue statistical techniques that are beyond intr
more generally about operations that can be performed on data,
ductory.
in two languages. To reinforce this, students are asked to think
To appreciate machine learning, one must recognize the dif
about a physical representation of what these operations do.ferences
For between the mindset of the data miner and the statisti
example. Figure 2 illustrates conceptually what happens when
cian. Breiman (2001) distinguished two types of models/ for y,
row filtering is performed on a data.frame in R or a table
the response variable, in terms of x, a vector of explanatory vari
in SQL. Less trivially, Figure 3 illustrates the useful gather
ables. One might consider a data model f such that y ~ /(x).
operation in R.
If it can be determined that / is a reasonable approximation of
the real-world process by which y was generated from x, then
3.4 Computational Statistics
we can proceed to make inferences about/. The goal is to learn
Now that students have the intellectual and practical tools about
to that unknown real process, and the conceit is that / is a
work with data and visualize them, the third part of the coursemeaningful reflection of it. Alternatively, one might construct
provides students with computational statistical methods an foralgorithmic model f, such that y ~ /(x), and use/ to predict
unobserved values of y. If it can be determined that / does in
analyzing data in the interest of answering a statistical question.
There are two major objectives for this section of the course:
fact do a good job of predicting values of y, one might not care
to learn much about/. In the former case, since we want to learn
1. Developing facility constructing interval estimates using re
about/, a simpler model may be preferred. Conversely, in the
sampling techniques (e.g., the bootstrap). Understandinglatter
the case, since we want to predict new values of y, we may
nature of variation in observational data and the benefit of
be indifferent to model complexity (other than concerns about
presenting interval estimates over point estimates.
overfitting and scalability).
2. Developing the capacity to fit and assess regression models,
These are very different perspectives to take toward learn
beginning with simple linear regression (which all students
ing from data, so after reinforcing the former perspective that
should have already seen in their intro course), but contin
students learned in their introductory course, SDS 292 students
uing to include multiple and logistic regression, and are
a few
exposed to the latter point of view. These ideas are further
techniques for automated feature selection.
explored in a class discussion about Chris Anderson's famous
article on The End of Theory (Anderson 2008), in which he
Vl
«1 V2 argues that the abundance of data and computing power will
eliminate the need for scientific modeling.
The notions of cross-validation and the "confusion" matrix
J/il/2
2/1 2/2vvXXV\
Vi Vi
Vi
frame the machine learning unit (receiver operating characteris
tic (ROC) curves are also presented as an evaluation technique).
ik ■
nk
The goal is typically to predict the outcome of a binary response
V2 2/2 variable. Once students understand that these predictions can
be evaluated using a confusion matrix, and that models can
be tested via cross-validation schemes, the rest of the unit is
spent learning classification techniques. The following tech
niques are presented, mainly at a conceptual and practical level:
decision/classification trees, random forests, ^-nearest neighbor,
Figure 3. The gather operation. naïve Bayes, artificial neural networks, and ensemble methods.
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
One of the most satisfying aspects of this unit is that stu that do express enthusiasm and satisfaction as they bec
dents can tackle a massive dataset. Past instances of the KDD more comfortable. Newly focused on becoming data scie
Cup (http.V/www.sigkdd.orgAddcup/index.php) are an excellent several students will go on to take subsequent courses o
source for such examples. We explore data from the 2008 KDD structures or algorithms offered by the computer science
Cup on breast cancer. Each of the n observations contains ment. dig
itized data from an X-ray image of a breast. Each observaIn this course, programing occurs exclusively in R and SQL.
tion corresponds to a small area of a particular breast, which Others may assert that Python is also necessary, and future
may or may not depict a malignant tumor—this provides incarnations the of this course may include more Python. In my
binary response variable. In addition to a handful of well view these are the three must-have languages for data science.
defined variables ((x, j>)-location, etc.), each observation has (SQL is a mature technology that is widely used, but useful for
117 nameless attributes, about which no information is pro a specific purpose. R is a flexible, extensible platform that is
vided. Knowing nothing about what these variables mean, specifically
stu designed for statistical computing, and represents
dents recognize the need to employ machine learning techniques the current state of the art. Python has become something of a
to sift through them and find relationships. The size of the data lingua franca, capable of performing many of the data analysis
and number of variables make manual exploration of the data operations otherwise done in R, but also being a full-fledged
impractical. general purpose programing language with lots of supporting
Students are asked to take part in a multi-stage machine learn packages and documentation. At Smith, all introductory com
ing "exam" (Cohen and Henle 1995) on this breast cancer data. puter science students learn Python, and all introductory statis
In the first stage, students are given several days to work alone tics students in the statistical and data sciences program learn R.
and try to find the best logistic regression model that fits the However, it is not clear yet how large the intersection of these
data. In the second stage, students form groups of three, discuss two groups is. It is probably easier for those who know Python
the strengths and weaknesses of their respective models, and to learn R than it is for those who know R to learn Python,
then build a classifier, using any means available to them, that and thus the decision was made in this instance to avoid Python
best fits the data. (These classifiers are ultimately evaluated on and focus on R. Other instructors may make different choices
as yet unseen data.) The third stage of the exam is a traditional without disruption.)
in-class exam.
4. COMPUTING
via online tutorials and self-study. Here again, some experience
and practice are important.
Practica!, functional programing, and computational For students,
abilitiesprior programing experience is essential. Expe
rience with R
are essential for a data scientist, and as such no attempt is made is not required, and in my experience, computer
scienceown
to shield students from the burden of writing their majors with weaker statistical backgrounds usually fare
code.
better than
Copious examples are given, and detailed lecture notes containstudents with stronger statistical backgrounds but
less programing
ing annotated computations in R are disseminated each class.experience. This is a demanding course that re
quires
Lectures jump between illustrating concepts on the most students to spend a substantial amount of time work
blackboard
ing through
and writing code on the computer projected overhead, andassignments.
stu However, even students who struggle
dents are expected to bring their laptops to classare so convinced
each day and that what they are learning is useful that there
arethe
participate actively. While it is true that many of fewstudents
serious complaints. Nevertheless one could certainly
struggle with the programing aspect of the course,experiment
even with
thoseslowing down the pace of the course.
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
5. ASSIGNMENTS validation also came during the ASA Five College DataFest, the
local version of the national data analysis competition (Gould
Reading assignments in SDS 292 are culled from a variety
et al. 2014). ASA DataFest is an open-ended data analysis com
of textbooks and articles available for free online. Nontrivial
petition that challenges students working in teams of up to five
sections are assigned from a number of texts (Tan, Steinbach,
to develop insights from a difficult dataset. In each of the last
and Kumar 2006; Murrell 2010; Rajaraman and Ullman 2011; 2 years, a team of five students from Smith—four of whom
James et al. 2013). (Please see our supplementary materials for
had taken this course—won the Best In Show prize (one stu
more information.)
dent was a member of both teams). In the first year in particular,
Concepts from the readings are developed further during the
skills developed in the course helped these students perform data
lecture periods in conjunction with implementations demon
manipulation tasks with considerably less difficulty than other
strated in R. Homework, consisting of conceptual questions groups. For example, each observation in this particular dataset
requiring written responses as well as computational questions
included a date field, but the values were encoded as strings of
requiring coding in R, is due approximately every 2 weeks. Two
text. Most groups struggled to work sensibly with these data,
exams are given—both of which have in-class and take-home
as familiar workflows were infeasible (e.g., the data was too
components. The first exam is given after the first two mod large to open in Excel, so "Format Cells..." was not a viable
ules, and focuses on data visualization and data manipulation
solution). The winning group was able to quickly tokenize these
principles demonstrated in written form. The second exam un
strings in R, and—having cleared this hurdle—had more time
folds over 2 weeks, and focuses on the challenging breast cancer
to spend on their analysis and interpretation.
classification problem discussed above. An open-ended project
(described below) brings the semester to a close. More details
7. DISCUSSION
on these assignments, including sample questions, are presented
in our supplementary materials.
It is clear that the popularity of data science has br
Project. The culmination of the course is an open-ended term both opportunities and challenges to the statistics pr
project that students complete in groups of three. Only threeWhile statisticians are openly grappling with question
conditions are given: the relationship of our field to data science (Bartlett
vidian 2013a, 2013b; Franck 2013; Horton 2015; Wasse
1. Your project must be centered around data 2015), there appears to be less conflict among comp
2. Your project must tell us something entists, who (rightly or wrongly) distinguish data scien
statistics on the basis of the heterogeneity and lack of
3. To get an A, you must show something beyond what we have
done in class of the data with which data scientists, as opposed to stati
work (Dhar 2013). As Big Data (which is clearly related
too often conflated with—data science) is often associa
Just like in other statistics courses, the project is segmented
so that each group submits a proposal that has to be approved
computer science, computer scientists tend to have an
before the group proceeds (Halvorsen and Moore 2001). The
attitude toward data science.
final deliverable is a 10 min in-class presentation as well as a A popular joke is that, "a data scientist is a statistician who
lives in San Francisco," but Hadley Wickham (2012), a Ph.D.
written "blog post" crafted in R Markdown (Allaire et al. 2015).
Examples of successful projects are presented in the supple statistician, floated a more cynical take on Twitter: "a data scien
mentary materials. tist is a statistician who is useful." Statisticians are the guardians
of statistical inference, and it is our responsibility to educate
6. EPILOG practitioners about using models appropriately, and the haz
ards of ignoring model assumptions when making inferences.
The feedback that I have received on this course—through But many model assumptions are only truly met under idealized
informal and formal evaluations—has been nearlyconditions, universally and thus, as Box (1979) eloquently argued, one must
positive. In particular, the 42 students (mostly from think carefully
Smith but about when statistical inferences are valid. When
also including five students from three nearby colleges) they areseemed
not, statisticians are caught in the awkward position, as
convinced that they learned "useful things." More Wickham specific suggests,
feed of always saying "no." This position can be
back is available in the supplementary materials. dissatisfying.
Several of these students were able to channel these useful If data science represents the new reality for data analysis,
then there is a real risk to the field of statistics if we fail to
skills into their careers almost immediately. Internships and job
offers followed in the spring for a handful of students: twoembrace it. The damage could come on two fronts: first, we
students spent their summers at NIST (one of whom later aclose data science and all of the students who are interested in it
to computer science; and second, the world will become popu
cepted a full-time job offer from MIT's Lincoln Laboratory; the
other is headed to the Ph.D. program in statistics at Berkeleylated by data analysts who do not fully understand or appreciate
and is a trainee in the NSF-funded "Environment and Society: the importance of statistics. While the former blow would be
Data Science for the 21st Century" research program), one stu damaging, the latter could be catastrophic—and not just for our
profession. Conversely, while the potential that data science is
dent landed a job as a research analyst at the nonprofit research
organization MDRC, and three students have joined the new a fad certainly exists, it seems less likely each day. It is hard to
Data Science Development Program at MassMutual. External imagine waking up to a future in which decision-makers are not
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
interested in what data (however they may have been collected (2013b), "The ASA and Big Data," AMSTAT News. Available at
http://magazine.amstat.org/blog/2013/06/01/the-asa-and-big-data/l340]
and however they may be structured) can offer them.
Data science courses like the one described in this article Dhar, V. (2013), "Data Science and Predic
tion," Communications of the ACM, 56,
provide a mechanism to develop students' abilities to work with
64-73. Available at http://cacm.acm.org/magazines/201
modern data, and these skills are quickly transitioning from science-and-prediction/fulltext. [340]
desirable to necessary.
DiGrazia, J., McKelvey, K., Bollen, J., and Rojas, F. (2013), "M
Votes: Social Media as a Quantitative Indicator of Political
Science Research Network. Available at http://ssrn.com/
[336]
SUPPLEMENTARY MATERIALS
Finger, L., and Dutta, S. (2014), Ask, Measure, Learn: Using Social Media
Analytics to Understand and Influence Customer Behavior, Sebastopol, CA:
Please see our supplementary materials for more detail about
O'Reilly Media, Inc. [336]
the additional topics listed in Section 3.6, two successful class
Finzer, W. (2013), "The Data Science Education Dilemma," Tech
projects alluded to in Section 5, course feedback summarized in
nology Innovations in Statistics Education, 7, 1-9. Available at
Section 6, as well as three sample exam questions. http://escholarship.org/uc/item/7gv0q9dc.pdf. [335]
REFERENCES Gould, R., Baumer, B., Çetinkaya Rundel, M., and Bray, A.
(2014), "Big Data Goes to College," AMSTAT News. Available at
http://magazine.amstat.org/blog/2014/06/01/datafest/l340]
Allaire, J., Horner, J., Marti, V., and Porte, N. (2015), markdown:
Markdown Rendering for R. R package version 0.7.7. AvailableHalvorsen,
at K. T., and Moore, T. L. (2001), "Motivating, Monitoring, and Eval
http://CRAN.R-project.org/package=markdown. [340] uating Student Projects," MAA Notes, pp. 27-32. [340]
American Statistical Association Undergraduate Guidelines Work Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, 0„
group (2014), "2014 Curriculum Guidelines for Undergraduate Murrell, P., Peng, R., Roback, P., Temple Lang, D„ and Ward, M. D. (2015),
Programs in Statistical Science." Available at http://www.amstat. "Data Science in the Statistics Curricula: Preparing Students to Think with
org/education/curriculumguidelines.cfin [335] Data'," The American Statistician, this issue. [335]
Anderson, C. (2008), "The End of Theory," Wired. Available at Harris, J. G., Shetterley, N., Alter, A. E„ and Schnell, K. (2014), "It Takes Teams
http://www.wired.com/science/discoveries/magazine/16-07/pb_theory to Solve the Data Scientist Shortage," The Wall Street Journal. Available at
[338] http://blogs.wsj.com/cio/2014/02/14/it-takes-teams-to-solve-the-data-scien
tist-shortage/ [334]
Bartlett, R. (2013), "We Are Data Science," AMSTAT News. Available at
http://magazine.amstat.org/blog/2013/10/01/we-are-data-science/YiAO] Horton, N. J. (2015), "Challenges and Opportunities for Statistics and Statistical
Education: Looking Back, Looking Forward," The American Statistician,
Box, G. E. (1979), "Some Problems of Statistics and Everyday Life," Journal
69, 138-145. [335,340]
of the American Statistical Association, 74, 1—4. [340]
Horton, N. J., Baumer, B. S„ and Wickham, H. (2015), "Setting the Stage
Breiman, L. (2001), "Statistical Modeling: The Two Cultures," Statistical Scie
for Data Science: Integration of Data Management Skills in Introduc
nce, 16, 199-215. Available at http://www.jstor.org/stable/2676681.
tory and Second Courses in Statistics," CHANCE, 28, 40-50. Available
[335,338]
at http://chance.amstat.org/2015/04/setting-the-stage/. [337]
Chance, B. L. (2002), "Components of Statistical Thinking and Implications for
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An In
Instruction and Assessment," Journal of Statistics Education, 10. [335]
troduction to Statistical Learning, New York: Springer. Available at
Cleveland, W. S. (2001), "Data Science: An Action Plan for Expanding the http://www-bcf.usc.edu/gareth/ISL/. [340]
Technical Areas of the Field of Statistics," International Statistical Review,
Kandel, S., Heer, J., Plaisant, C., Kennedy, J., van Ham, F., Riche, N. H.,
69, 21-26. Available at http://www.jstor.org/stable/1403527. [335,337]
Weaver, C., Lee, B., Brodbeck, D., and Buono, P. (2011), "Research Di
Cobb, G. W. (2007), "The Introductory Statistics Course: A Ptolemaic Curricu rections in Data Wrangling: Visualizations and Transformations for Usable
lum?" Technology Innovations in Statistics Education, 1, 1-15. Available at and Credible Data," Information Visualization, 10, 271-288. Available at
http://escholarship.org/uc/item/6hb3k0nz. [335] http://research.microsoft.com/EN-US/UM/REDMOND/GROUPS/cue/info
(2011), "Teaching Statistics: Some Important Tensions," Chilean Jour vis/. [335]
nal of Statistics, 2, 31-62. Available at http://chjs.deuv.cl/Vol2Nl/ChJS-02Linkins, J. (2013), "Let's Calm Down About Twitter Being Able To
01-03.pdf. [335] Predict Elections, Guys," The Huffington Post. Available at http://
Cohen, D., and Henle, J. (1995), "The Pyramid Exam," Undergraduate Mathe www. hufflngtonpost. com/2013/08/14/twitter-predict-elections_n_37553
matics Education Trends, 7, 2. [339] 26.html [336]
Committee on Professional Ethics (1999), "Ethical Guidelines for Statistical Lohr, S. (2009), "For Today's Graduate, Just One Word: Statistics,"
Practice." Available at http://www.amstat.org/about/ethicalguidelines.cfm, The New York Times. Available at http://www.nytimes.com/2009/08/06/
last accessed: 2015-05-19. [336] technology/06stats.html [334]
Davenport, T. H., and Patil, D. (2012), "Data Scientist: The Sex Moore, D. S. (1998), "Statistics Among the Liberal Arts," Journal of
iest Job of the 21st Century," Harvard Business Review. Avail the American Statistical Association, 93, 1253-1259. Available at
able at http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st http://www.jstor.org/stable/2670040. [335]
century/ar/1 [334] Murrell, P. (2010), Introduction to Data Technologies,
Davidian, M. (2013a), "Aren't We Data Science?" AMSTAT News. Available at Boca Raton, FL: Chapman and Hall/CRC. Available at
http://magazine.amstat.org/blog/2013/07/01/datascience/YiA0] https://www. stat. auckland. ac. nz/paul/ItDT/. [340]
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
Nolan, D., and Temple Lang, D. (2010), "Computing in the Statis Wasserstein, R. (2015), "Communicating the Power and Impact of Our Pro
tics Curricula," The American Statistician, 64, 97-107. Available at fession: A Heads Up for the Next Executive Directors of the ASA," The
http://www.stat.berkeley.edu/users/statcur/Preprints/ComputingCurric3.pdf. American Statistician, 69, 96-99. [340]
[335,337]
Wickham, H. (2012), "My Cynical Definition: A Data Scien
Patil, D. (2011), Building Data Science Teams, Sebastopol, CA: O'Reilly Media, tist is a Statistician Who is Useful)." Available at https://
Inc. [336] twitter.com/hadleywickham/status/263750846246969344 [340]
Rajaraman, A., and Ullman, J. D. (2011), Mining of Massive (2014), 'Tidy Data," The Journal of Statistical Software, 59, 1-23.
Datasets, Cambridge, UK: Cambridge University Press. Available Available
at at http://vita.had.co.nz/papers/tidy-data.html. [335,337]
http://www.mmds.org/. [340]
Wickham, H., and Francois, R. (2015), dplyr: A Grammar of
Rojas, F. (2013), "How Twitter Can Help Predict an Election," The Data Manipulation. R package version 0.4.2. Available at
Washington Post. Available at http://www.washingtonpost.com/opinions/ho http://CRAN.R-project.org/package=dplyr. [335,337]
w-twitter-can-predict-an-election/2013/08/1 l/35ef885a-0108-l Ie3-96a8-d3
Wilkinson, L. (2006), The Grammar of Graphics, New York: Springer. [335]
b921c0924a_story.html [336]
Yau, N. (2011), Visualize This: The Flowing Data Guide to Design, Visualiza
Swires-Hennessy, E. (2014), Presenting Data: How to Communicate Your
tion, and Statistics, Indianapolis, IN: Wiley. [337]
Message Effectively (1st ed.). West Sussex, UK: Wiley. Available at
http://www.wiley.com/WileyCDA/WileyTitle/productCd-1I18489594.html. (2013), Data Points: Visualization that Means Something, Indianapolis,
[337] IN: Wiley. [337]
Tan, P. N., Steinbach. M„ and Kumar, V. (2006), Introduction to Data Zhu, Y., Hernandez, L. M., Mueller, P., Dong, Y., and Forman, M. R. (2013),
Mining (1st ed.), Boston, MA: Pearson Addison-Wesley. Available at "Data Acquisition and Preprocessing in Studies on Humans: What is
http://www-users.cs.umn.edu/kumar/dmbook/index.php. [340] Not Taught in Statistics Classes?" The American Statistician, 67, 235
241. [335]
Tufte, E. R. (1983), The Visual Display of Quantitative Information (2nd ed.),
Cheshire, CT: Graphics Press. [337]
This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms