0% found this document useful (0 votes)

17 views

Baumer - 2015 - A Data Science Course For Undergraduates Thinking

Uploaded by

loui.clement76

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Baumer - 2015 - A Data Science Course For Undergraduates Thinking

Uploaded by

loui.clement76

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A Data Science Course for Undergraduates: Thinking With Data

Author(s): Ben Baumer

Source: The American Statistician, Vol. 69, No. 4, Special Issue on Statistics and the
Undergraduate Curriculum (NOVEMBER 2015), pp. 334-342
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
Stable URL: https://www.jstor.org/stable/24592135
Accessed: 04-06-2024 12:09 +00:00

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Taylor & Francis, Ltd., American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to The American Statistician

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
s Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/TAS

A Data Science Course for Undergraduates: Thinking With Data

Ben Baumer

students are eager to develop their ability to analyze d

Data science is an emerging interdisciplinary fieldare
thatwisely
com investing in these skills.
But while
bines elements of mathematics, statistics, computer science, and this data onslaught has strengthened inte
knowledge in a particular application domain for the statistics,
purpose it has also brought challenges. Modern data
are importantly
of extracting meaningful information from the increasingly so different than the data with which ma
phisticated array of data available in many settings. ticians, and in turn many statistics students, are accus
These data
tend to be nontraditional, in the sense that they are working. For example, the typical dataset a student encou
often live,
an introductory
large, complex, and/or messy. A first course in statistics at statistics course consists of a several do
the undergraduate level typically introduces students andtothree
a va or four columns of noncollinear variables, co
from
riety of techniques to analyze small, neat, and clean a simple random sample or a randomized trial. T
datasets.
However, whether they pursue more formal training data
in that
statisare likely to meet the conditions necessary for s
tics or not, many of these students will end up working inferencewithin a multiple regression model. From a peda
data that are considerably more complex, and will need point of view, this makes both the students and the in
facility
with statistical computing techniques. More importantly, happy, because the data fit the model, and thus we can pr
these
students require a framework for thinking structurally apply the techniques we have learned to draw meanin
about
data. We describe an undergraduate course in a liberal clusions.
arts en However, the data that many of our current
vironment that provides students with the tools necessarywill be asked
to to analyze—especially if they go into gov
apply data science. The course emphasizes modern, or industry—will not be so neat and tidy. Indeed, these
practical,
not likely
and useful skills that cover the full data analysis spectrum, from to come from an experiment—they are mu
asking an interesting question to acquiring, managing, likely to be observational. Second, they will not likely
manip
ulating, processing, querying, analyzing, and visualizingin a two-dimensional
data, row-and-column format—they m
as well communicating findings in written, graphical, stored in a database, or a structured text document (e.g
and oral
forms. Supplementary materials for this article are oravailable
come from more than one source with no obvious co
online. identifier, or worse, have no structure at all (e.g., data
from the web). These data might not exist at a fixed m
KEY WORDS: Computational statistics; Data science; Data time, but rather be part of a live stream (e.g., Twitter
visualization; Data wrangling; Machine learning; Statisticaldata might not even be numerical, but rather consist
computing; Undergraduate curriculum. images, or video. Finally, these data may consist of so
observations that many traditional inferential techniqu
not make sense to use, or even be computationally feas
In 2009, Hal Varian, chief economist at Google, de
1. INTRODUCTION
statistician as the "sexy job in the next 10 years" (Loh
The last decade has brought considerable attention Yet to
bythe
2012, the Harvard Business Review used simila
field
of statistics, as undergraduate enrollments have swollen across scientist as the "sexiest job of the
to declare data
the country. Fueling the interest in statistics is thetury" (Davenport
proliferation of and Patil 2012). Speaking at the 2013
data being generated by scientists, large InternetStatistical
companies,Meetings,
and Nate Silver—as always—helpe
ravel what had
electronic devices of all shapes and sizes. There is widespread happened. He noted that "data scientist
sexed up term for a statistician." If Silver is right, then t
acknowledgment—coming naturally from scientists, but also
from CEOs and government officials—that these tics curriculum
data could be needs to be updated to include topics
useful for informing decisions. Accordingly, thecurrently
job market more
for closely associated with data science th
statistics (e.g., data visualization, database querying, da
people who can translate these data into actionable information
gling,
is very strong, and there is evidence that demand algorithmic
for this type concerns about computational tech
(the
of labor far exceeds supply (Harris et al. 2014). By allWikipedia
accounts, defines "data science" as "the extra
knowledge from data," whereas "statistics" is "the stud
collection, analysis, interpretation, presentation, and
tionSmith
Ben Baumer is Assistant Professor, Statistical and Data Sciences, of data."
College,Does writing an SQL query belong to
Northampton, MA 01063 (E-mail: [email protected]). Statisticians and data scientists share a common goal—
to be
Color versions of one or more of the figures in the article can use data
found appropriately
online to inform decision-making.
at www.tandfonline.com/r/tas.

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
What we describe in this article is a course at a liberal arts lyzed. Furthermore, while some might view data managem
as a perfunctory skill on intellectual par with data entry,
college in data science that is atypical within the current statis
tics curriculum. Nevertheless, what we present here is wholly are others thinking more broadly about data. (Examples of
consistent with the vision for the future of the undergradu data management abound, but one of the most common is fa
ate statistics curriculum articulated by Horton (2015) and the to separate the actual data from their analysis. Microsoft
ASA Undergraduate Guidelines Workgroup (2014). The pur is a particular villain in this arena, where merged cells, roun
pose of this course is to prepare students to work with these induced by formatted columns, and recomputed formula
modern data streams as described above. Some of the top
result in the ultimate disaster: losing the original recorded d
ics covered in this course have historically been the purview Just as Wilkinson (2006) brought structure to graphics thr
of computer science. But while the course we describe indis"grammar," Wickham (2014) and Wickham and Francois
putably contains elements of statistics and computer science, brought structure to data manipulation through the five "ve
select, filter, mutate, arrange, and summarise. These com
it just as indisputably belongs exclusively to neither discipline.
single-table data manipulation operations are the practic
Furthermore, it is not simply a collection of topics from existing
courses in statistics and computer science, but rather an inte scendents of theoretical work on data structures by comp
grated presentation of something more holistic. Nevertheless, scientists who developed notions of normal forms, relat
the course consists of a series of largely independent modules, algebras, and database management systems.
each of which could be expanded into stand-alone curricular ele While the emphasis on computing within the stat
tics curriculum may be growing, it belongs to a lar
ments (e.g., a full-semester, half-semester, or interterm course).
Thus, while some readers might view this article as a spec
more gradual evolution in statistics education toward
ification for a single new offering, others might use it as aanalysis—with computers—and encourages us to refle
shifting boundaries between statistics and computer scie
blueprint for a significant expansion of the existing statistics
curriculum. Moore (1998)—viewing statistics as an ongoing quest to "
son about data, variation, and chance"—saw statistical thin
2. BACKGROUND AND RELATED WORK as a powerful anchor that can prevent statistics from being
whelmed by technology." Cobb (2011) argued for an incr
While many believe that to understand statistical emphasis
theory,on a conceptual topics in statistics, but also saw th
solid foundation in mathematics is necessary, it velopment
seems clearof statistical theory as an anachronistic conseq
that computing skills are required for one to become of a lack of computing power (Cobb 2007). Moreover, wh
a func
much
tional, practicing statistician. In making this analogy, of statistical
Nolan and theory is designed to make the stronge
Temple Lang (2010) argued strongly for a larger presence
ference possible from what are historically scarce data, w
for computing in the statistics curriculum. Citingnow theiroften
work,challenged trying to draw meaningful conclu
the American Statistical Association Undergraduate from an abundance of data.
Guidelines
Breiman (2001) articulated the distinction between "statisti
Workgroup (2014) underscores the importance of computing
cal data models" and "algorithmic models" that in many ways
skills (even using the words "data science") in undergraduate
majors in statistical science. Here, by computing, characterizes
we mean sta the relationship between statistics and machine
tistical programing in an environment such as R. It learning, viewing the former as being far more limited than
is important
the latter.
to recognize this as a distinct—and more valuable—skill And while machine learning and data mining have
than
being able to perform statistical computations intraditionally
a menu-and been subfields of computer science, Finzer (2013)
click environment such as Minitab. Indeed, Nolannotedandthat data science does not have a natural home within
Temple
Lang (2010) went even further, advocating for thetraditional
importancedepartments, belonging exclusively to neither math
of teaching general command-line programs, suchematics,
as grepstatistics,
(for nor computer science. Indeed, in Cleveland's
regular expressions) and other common UNIX commands(2001) seminal
that action plan for data science, he envisions data
science
really have nothing to do with statistics, per se, but are as a "partnership" between statisticians (i.e., data ana
very
useful for cleaning and manipulating documents oflysts)
manyand computer scientists.
types.
Although practicing statisticians seem to largely agree that
the lion's share of the time spent on many projects is devoted
to data cleaning and manipulation (or data wrangling, as it is 3. THE COURSE
often called Kandel et al. 2011), the motivation for adding these
In thisnor
skills to the statistics curriculum is not simply convenience, article, we describe an experimental course—c
"Data Science"
should a lack of skills or interest on the part of instructors stand (a.k.a., SDS 292) and now offered throug
in the way. Finzer (2013) described a "data habit ofStatistical
mind... that
& Data Sciences Program at Smith College—th
grows out of working with data." (This is not to be confused with
offered in the Fall of 2013 and again in the Fall of 2014.
"statistical thinking" as articulated by Chance (2002),
first which
year, 18 students completed the course, as did anothe
contains no mention of computing.) In this case, a the
datafollowing
habit of year. The prerequisites are an introductor
mind comes from experience working with data, andtics
is manifest
course and some programing experience. Existing
in people who start thinking about data formatting at before data
the University of California at Berkeley, as well as Ma
are collected (Zhu et al. 2013), and have foresightand
about how
St. Olaf Colleges, are the pedagogical cousins of SD
data should be stored that is informed by how they will
(see be anaet al. 2015 for a comprehensive comparison
Hardin

The American Statistician, November 2015, Vol. 69, No. 4

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
This represents a typical data science research proj

The data being analyzed were scraped from the

collected from a survey or clinical trial. Typica
assumptions about random sampling or random
are clearly not met.
The research question was addressed by combin
knowledge (i.e., knowledge of how Congressiona
T with a data source (Twitter) that had no obvious
f^Data Graphic
Data Design 1
Graphic one another.
Design ^
•I- *[„III!
4 "4
A large amount of data (500 million tweets!) was collected
Presentation f Oration j
(although only 500,000 tweets were analyzed)—so large that
the data itself were a challenge to manage. In this case, the
data were big enough that the Center for Complex Networks
and Systems
Figure 1. Schematic of the modern statistical analysis Research
process. Theat Indiana University was enlisted to
assist.undergraduate
introductory statistics course (and in many cases, the
The project
statistics curriculum) emphasizes the central column. In this was undertaken
data sci by a team of researchers from
ence course, we provide instruction into the bubbles to the left and
multiple fields (i.e., sociology, computing) working in differ
right. ent departments who brought complementary skills to bear
on the problem—a paradigm that many consider to be opti
mal (Patil 2011).

SDS 292 is organized into a series of 2-3 week modules:

Students are then asked to pair up and critically review the
data visualization, data manipulation/data wrangling, computa
article. The major findings reported by the authors stem from
tional statistics, machine/statistical learning, and additional top
the interpretation of two scatterplots and two multiple regres
ics. In what follows we provide greater detail on each of these
sion models, both of which are accessible to students who have
modules.
had an introductory statistics course. There are several potential
Learning Outcomes. In Figure 1, we present a schematic ofweaknesses
a in both the plots presented in the article (Linkins
2013; Gelman 2013), and the interpretation of the coefficients in
modern statistical analysis process, from formulating a question
to obtaining an answer. In the introductory statistics course,the
wemultiple regression model, which some students will iden
teach a streamlined version of this process, wherein challenges The exercise serves to refresh students' memories about
tify.
with the data, computational methods, and visualization and statistical thinking, encourages them to think critically about the
presentation are typically elided. The entire process informsdisplay of data, and illustrates the potential hazards of drawing
conclusions
the material presented in the data science course. The goal from data in the absence of a statistician. Instructors

is to produce students who have confidence and foundational could also use this discussion as a segue to topics in experimental
skills—not necessarily expertise—to tackle each step in this design, or introduce the ASA's Ethical Guidelines for Statistical
Practice (Committee on Professional Ethics 1999).
modern data analysis cycle, both immediately and in their future
careers. Finally, students are asked directly how they would go about
replicating this study. That is, they are asked to identify all of
3.1 Day One the steps necessary to conduct this study, from collecting the
data to writing a report, and to think about whether they could
The first class provides an important opportunity to hook
accomplish this with their current skills and knowledge. While
students into data science. Since most students do not have a
students are able to generate many of the steps as a broad outline,
firm grasp of what data science is, and in particular, how it
most are unfamiliar with the practical considerations necessary.
differs from statistics, the diagram in Figure 1 can help draw
For example, students recognize that the data must be down
these distinctions. The goal is to illustrate the richness and
loaded from Twitter, but few have any idea how to do that.
vibrance of data science, and emphasize its breadth by high
This leads to the concept of an application programing interface
lighting the different skills necessary for each task. Students
(API), which is provided by Twitter (and can be used in several
should be sure within the first 5 min of the semester that there
environments, notably R and Python). Moreover, most students
is something interesting and useful for them to learn in the
do not recognize the potential difficulties of storing 500 million
course.
tweets. How big is a tweet? Where and how could you store
Next, we engage students immediately by exposing them to a
them? Spatial concerns also arise: does it matter in which Con
recent, relevant example of data science. Students are asked to
gressional district the person who tweeted was? Most students
read a provocative article by DiGrazia et al. (2013) and a rather
in the class have experience with R, and thus are comfortable
ambitious editorial in The Washington Post written by Rojas
building a regression model and overlaying it on a scatterplot.
(2013), a sociologist, in which he claims that Twitter will put
But few have considered anything beyond the default plotting
political pollsters out of work. (See Finger and Dutta 2014 for
options. How do you add annotations to the plot to make it
a counterpoint.)
more understandable? What principles of data graphic design

336 Statistics and the Undergraduate Curriculum

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
would help to determine which annotations are necessary or 3.3 Data Manipulation/Data Wrangling
appropriate?
As noted earlier, it is a common refrain among statisticians
Students are then advised that this course will give them the
that "cleaning and manipulating the data" takes up an over
tools necessary to carry out a similar study. This will involve
whelming majority of the time spent on a statistical project.
improving their skills with programing, data management, data
In the introductory class, we do everything we can to shield
visualization, and statistical computing. The goal is to leave
students from this reality, exposing them only to carefully cu
students feeling energized, but open to exploring their newly
rated datasets. By contrast, in SDS 292 students are expected
acquired, more complex understanding of data science.
to master a variety of common data manipulation techniques.
The term data management has a boring, IT connotation, but
3.2 Data Visualization
there is a growing acknowledgment that such data wrangling,
From the first day of class, students are reminded that or data
sta manipulation skills are not only valuable, but in fact be
tistical work is of limited value unless it can be communicated long to a broader intellectual discipline (Wickham 2014; Horton,
to nonstatisticians (Swires-Hennessy 2014). More specifically,Baumer, and Wickham 2015). One of the primary goals of SDS
most data scientists working in government or industry (as op 292 is to develop students' capacity to "think with data" (Nolan
posed to those in academia) will work for a boss who gener and Temple Lang 2010), in both a practical and theoretical sense.
ally possesses less technical knowledge than the employee. A Over the next 3 weeks, students are given rapid instruction in
perfect, but complicated, statistical model may not be persuadata manipulation in R and SQL. In the spirit of the data ma
sive to nonstatisticians if it cannot be communicated clearly.nipulation "verbs" advocated by Wickham and Francois (2015),
Data graphics provide a mechanism for illustrating relation students learn how to perform the most fundamental data op
ships among data, but most students have never been exposederations in both R and SQL, and are asked to think about their
to structured ideas about how to create effective data graphics.connection.
In SDS 292, the first 2 weeks of class are devoted to data
• select: subset variables (SELECT in SQL, select() in R
visualization. This serves two purposes: (1) it is an engaging
(dplyr))
hook for a science course and (2) it gives students with weaker
• filter, subset rows (WHERE, HAVING in SQL, filter () inR)
programing backgrounds a chance to get comfortable in R.
• mutate: add new columns (. . . AS ... in SQL, mutate()
Students read the classic text of Tufte (1983) in its entirety, as
inR)
well as excerpts from Yau (2013). The former provides a won
• summarise: reduce to a single row (GROUP BY in SQL,
derfully cantankerous account of what not to do when creating
summarise(group_by()) in R)
data graphics, as well as thoughtful analyses of how data graph
• arrange: reorder the rows (ORDER BY in SQL, arrange () in
ics should be constructed. Students take delight in critiquing
R)
data graphics that they find online through the lens crafted by
Tufte. The latter text, along with Yau (2011), provides many By the end, students are able to see that an SQL query con
examples of interesting data visualizations that can be used intaining
the beginning of class to inspire students to think broadly about SELECT ... FROM a JOIN b WHERE ... GROUP BY ...

what can be done with data (e.g., data art). Moreover, it pro HAVING ... ORDER BY ...
vides a well-structured taxonomy for composing data graphics is equivalent to a chain of R commands involving
that gives students an orientation into data graphic design. For a
example, a data graphic that uses color as a visual cue in a Carte select (...)
sian coordinate system is what we commonly call a "heat map." filter(...) %>•/.
Students are also exposed to the hierarchy of visual perception inner_join(b, ...)
that stems from work by Cleveland (2001). group_by ( . . . ) %>%
Homework questions from this part of the course focus on summarise(...) %>%
demonstrating understanding by critiquing data graphics found f ilter (...) %>%
"in the wild," an exercise that builds confidence (i.e., "Geez, I arrange (...)
already know more about data visualization than this guy... ") A summary of analogous R and SQL syntax is shown in
and encourages critical thinking. Computational assignments inTable 1.
troduce students to some of the less trivial aspects of annotating Moreover, students learn to determine for themselves, based
data graphics in R (e.g., adding textual annotations and manipu on the attributes of the data (most notably size), which tool is
lating colors, scales, legends, etc.). We discuss additional topicsmore appropriate for the type of analysis they wish to perform.
in data visualization in Section 3.6. They learn that R stores data in memory, so that the size of the

Table 1. Conceptually analogous SQL and R commands. Suppose a and b are SQL tables or R data. frames

Concept SQL R (dplyr)

Filter by rows and columns SELECT coll, col2 FROM a WHERE col3 - 'x' select (filter (a , col3 == 'x'), coll, col2)
Aggregate by rows SELECT id, sum(coll) as total FROM a GROUP BY id summarise (group _by (a , id), total = sum(coll))
Merge two tables SELECT
SELECT *
* FROM
FROM aa JOIN
JOIN b
b ON
ON a.id
a.id =- b.id
b.id inner_join(x=a,
inner_join(x=a, y=b
y=b ,, by=5id'))
by=}id'))

The American Statistician, November 2015, Vol. 69, No. 4

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
The first objective underlines the statistical element
m < n
course, encouraging students to put observations in
context by demonstrating an understanding of variation
data. The second objective, while not a substitute for a
course in regression analysis, helps reinforce a practic
standing of regression, and sets the stage for the su
Figure 2. The filter operation. machine learning portion of the course.

data with which you wish to work is limited by the 3.5 Machine
amount of Learning
memory available to the computer, whereas SQL stores data on
Two weeks are devoted to introductory topics in m
disk, and is thus much better suited for storing large amounts
learning. Some instructors may find that this portio
of data. However, students learn to appreciate the virtually lim
course overlaps too heavily with existing offerings in
itless array of operations that can be performed on data in R,
science or applied statistics. Others might argue that the
whereas the number of useful computational functions in SQL
will not be of interest to students who are primarily
is limited. Thus, students learn to make choices about software
in the communication and visualization side of data science.
in the context of hardware—and data.
Care must be taken to make sure that what students are learn
However, a brief introduction to machine learning gives stu
dents a functional framework for testing algorithmic models
ing at this stage of the course is not purely programing syntax
Assignments force them to grapple with the limitations of large
(although that is a desired side effect). Rather, they are learning
datasets, and pursue statistical techniques that are beyond intr
more generally about operations that can be performed on data,
ductory.
in two languages. To reinforce this, students are asked to think
To appreciate machine learning, one must recognize the dif
about a physical representation of what these operations do.ferences
For between the mindset of the data miner and the statisti
example. Figure 2 illustrates conceptually what happens when
cian. Breiman (2001) distinguished two types of models/ for y,
row filtering is performed on a data.frame in R or a table
the response variable, in terms of x, a vector of explanatory vari
in SQL. Less trivially, Figure 3 illustrates the useful gather
ables. One might consider a data model f such that y ~ /(x).
operation in R.
If it can be determined that / is a reasonable approximation of
the real-world process by which y was generated from x, then
3.4 Computational Statistics
we can proceed to make inferences about/. The goal is to learn
Now that students have the intellectual and practical tools about
to that unknown real process, and the conceit is that / is a
work with data and visualize them, the third part of the coursemeaningful reflection of it. Alternatively, one might construct
provides students with computational statistical methods an foralgorithmic model f, such that y ~ /(x), and use/ to predict
unobserved values of y. If it can be determined that / does in
analyzing data in the interest of answering a statistical question.
There are two major objectives for this section of the course:
fact do a good job of predicting values of y, one might not care
to learn much about/. In the former case, since we want to learn
1. Developing facility constructing interval estimates using re
about/, a simpler model may be preferred. Conversely, in the
sampling techniques (e.g., the bootstrap). Understandinglatter
the case, since we want to predict new values of y, we may
nature of variation in observational data and the benefit of
be indifferent to model complexity (other than concerns about
presenting interval estimates over point estimates.
overfitting and scalability).
2. Developing the capacity to fit and assess regression models,
These are very different perspectives to take toward learn
beginning with simple linear regression (which all students
ing from data, so after reinforcing the former perspective that
should have already seen in their intro course), but contin
students learned in their introductory course, SDS 292 students
uing to include multiple and logistic regression, and are
a few
exposed to the latter point of view. These ideas are further
techniques for automated feature selection.
explored in a class discussion about Chris Anderson's famous
article on The End of Theory (Anderson 2008), in which he
Vl
«1 V2 argues that the abundance of data and computing power will
eliminate the need for scientific modeling.
The notions of cross-validation and the "confusion" matrix
J/il/2
2/1 2/2vvXXV\
Vi Vi
Vi
frame the machine learning unit (receiver operating characteris
tic (ROC) curves are also presented as an evaluation technique).
ik ■
nk
The goal is typically to predict the outcome of a binary response
V2 2/2 variable. Once students understand that these predictions can
be evaluated using a confusion matrix, and that models can
be tested via cross-validation schemes, the rest of the unit is
spent learning classification techniques. The following tech
niques are presented, mainly at a conceptual and practical level:
decision/classification trees, random forests, ^-nearest neighbor,
Figure 3. The gather operation. naïve Bayes, artificial neural networks, and ensemble methods.

jjö statistics and the undergraduate Curriculum

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
One of the most satisfying aspects of this unit is that stu that do express enthusiasm and satisfaction as they bec
dents can tackle a massive dataset. Past instances of the KDD more comfortable. Newly focused on becoming data scie
Cup (http.V/www.sigkdd.orgAddcup/index.php) are an excellent several students will go on to take subsequent courses o
source for such examples. We explore data from the 2008 KDD structures or algorithms offered by the computer science
Cup on breast cancer. Each of the n observations contains ment. dig
itized data from an X-ray image of a breast. Each observaIn this course, programing occurs exclusively in R and SQL.
tion corresponds to a small area of a particular breast, which Others may assert that Python is also necessary, and future
may or may not depict a malignant tumor—this provides incarnations the of this course may include more Python. In my
binary response variable. In addition to a handful of well view these are the three must-have languages for data science.
defined variables ((x, j>)-location, etc.), each observation has (SQL is a mature technology that is widely used, but useful for
117 nameless attributes, about which no information is pro a specific purpose. R is a flexible, extensible platform that is
vided. Knowing nothing about what these variables mean, specifically
stu designed for statistical computing, and represents
dents recognize the need to employ machine learning techniques the current state of the art. Python has become something of a
to sift through them and find relationships. The size of the data lingua franca, capable of performing many of the data analysis
and number of variables make manual exploration of the data operations otherwise done in R, but also being a full-fledged
impractical. general purpose programing language with lots of supporting
Students are asked to take part in a multi-stage machine learn packages and documentation. At Smith, all introductory com
ing "exam" (Cohen and Henle 1995) on this breast cancer data. puter science students learn Python, and all introductory statis
In the first stage, students are given several days to work alone tics students in the statistical and data sciences program learn R.
and try to find the best logistic regression model that fits the However, it is not clear yet how large the intersection of these
data. In the second stage, students form groups of three, discuss two groups is. It is probably easier for those who know Python
the strengths and weaknesses of their respective models, and to learn R than it is for those who know R to learn Python,
then build a classifier, using any means available to them, that and thus the decision was made in this instance to avoid Python
best fits the data. (These classifiers are ultimately evaluated on and focus on R. Other instructors may make different choices
as yet unseen data.) The third stage of the exam is a traditional without disruption.)
in-class exam.

3.6 Additional Topics 4.1 A Note to Prospective Instructors

As outlined above, data visualization, data manipulation,
Several people familiar with this course have asked about the
computational statistics, and machine learning are the four pil
skills required to teach it. From my point of view the most im
lars of this data science course. However, additional content can
portant thing is to have the same willingness to learn new things
be layered in at the instructor's discretion. We list a few such
that you ask of your students. In terms of the content, a deep
topics below. Greater detail is provided in our supplementary
knowledge of all subjects is not required, although comfort and
materials.
troubleshooting ability with R is necessary. Students are willing
• Spatial analysis: creating appropriate and meaningful graphto accept a certain amount of frustration that goes hand-in-hand
ical displays for data that contain geographic coordinates. with learning a new programing language, but when they en
• Text mining and regular expressions: learning how to use reg counter roadblocks that seem immovable, that frustration can
ular expressions to produce data from large text documents.mutate into helplessness. The instructor must provide support
• Data expo: exposing students to the questions and challenges mechanisms to avoid this—student teaching assistants and of
that people outside the classroom face with their own data. fice hours can be especially helpful.
• Network science: developing methods for data that exist in a Even without prior knowledge, enough of the material on
network setting (i.e., on a graph). data visualization and machine learning can be absorbed in a
• relatively
Big data: illustrating the next frontier for working with data short period of time by reading a few of the books
that are truly large scale. cited. SQL has many subtleties—but most are not likely to come
up in this course, and the basics are not difficult to learn, even

4. COMPUTING
via online tutorials and self-study. Here again, some experience
and practice are important.
Practica!, functional programing, and computational For students,
abilitiesprior programing experience is essential. Expe
rience with R
are essential for a data scientist, and as such no attempt is made is not required, and in my experience, computer
scienceown
to shield students from the burden of writing their majors with weaker statistical backgrounds usually fare
code.
better than
Copious examples are given, and detailed lecture notes containstudents with stronger statistical backgrounds but
less programing
ing annotated computations in R are disseminated each class.experience. This is a demanding course that re
quires
Lectures jump between illustrating concepts on the most students to spend a substantial amount of time work
blackboard
ing through
and writing code on the computer projected overhead, andassignments.
stu However, even students who struggle
dents are expected to bring their laptops to classare so convinced
each day and that what they are learning is useful that there
arethe
participate actively. While it is true that many of fewstudents
serious complaints. Nevertheless one could certainly
struggle with the programing aspect of the course,experiment
even with
thoseslowing down the pace of the course.

The American Statistician, November 2015, Vol 69, No. 4 339

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
5. ASSIGNMENTS validation also came during the ASA Five College DataFest, the
local version of the national data analysis competition (Gould
Reading assignments in SDS 292 are culled from a variety
et al. 2014). ASA DataFest is an open-ended data analysis com
of textbooks and articles available for free online. Nontrivial
petition that challenges students working in teams of up to five
sections are assigned from a number of texts (Tan, Steinbach,
to develop insights from a difficult dataset. In each of the last
and Kumar 2006; Murrell 2010; Rajaraman and Ullman 2011; 2 years, a team of five students from Smith—four of whom
James et al. 2013). (Please see our supplementary materials for
had taken this course—won the Best In Show prize (one stu
more information.)
dent was a member of both teams). In the first year in particular,
Concepts from the readings are developed further during the
skills developed in the course helped these students perform data
lecture periods in conjunction with implementations demon
manipulation tasks with considerably less difficulty than other
strated in R. Homework, consisting of conceptual questions groups. For example, each observation in this particular dataset
requiring written responses as well as computational questions
included a date field, but the values were encoded as strings of
requiring coding in R, is due approximately every 2 weeks. Two
text. Most groups struggled to work sensibly with these data,
exams are given—both of which have in-class and take-home
as familiar workflows were infeasible (e.g., the data was too
components. The first exam is given after the first two mod large to open in Excel, so "Format Cells..." was not a viable
ules, and focuses on data visualization and data manipulation
solution). The winning group was able to quickly tokenize these
principles demonstrated in written form. The second exam un
strings in R, and—having cleared this hurdle—had more time
folds over 2 weeks, and focuses on the challenging breast cancer
to spend on their analysis and interpretation.
classification problem discussed above. An open-ended project
(described below) brings the semester to a close. More details
7. DISCUSSION
on these assignments, including sample questions, are presented
in our supplementary materials.
It is clear that the popularity of data science has br
Project. The culmination of the course is an open-ended term both opportunities and challenges to the statistics pr
project that students complete in groups of three. Only threeWhile statisticians are openly grappling with question
conditions are given: the relationship of our field to data science (Bartlett
vidian 2013a, 2013b; Franck 2013; Horton 2015; Wasse
1. Your project must be centered around data 2015), there appears to be less conflict among comp
2. Your project must tell us something entists, who (rightly or wrongly) distinguish data scien
statistics on the basis of the heterogeneity and lack of
3. To get an A, you must show something beyond what we have
done in class of the data with which data scientists, as opposed to stati
work (Dhar 2013). As Big Data (which is clearly related
too often conflated with—data science) is often associa
Just like in other statistics courses, the project is segmented
so that each group submits a proposal that has to be approved
computer science, computer scientists tend to have an
before the group proceeds (Halvorsen and Moore 2001). The
attitude toward data science.
final deliverable is a 10 min in-class presentation as well as a A popular joke is that, "a data scientist is a statistician who
lives in San Francisco," but Hadley Wickham (2012), a Ph.D.
written "blog post" crafted in R Markdown (Allaire et al. 2015).
Examples of successful projects are presented in the supple statistician, floated a more cynical take on Twitter: "a data scien
mentary materials. tist is a statistician who is useful." Statisticians are the guardians
of statistical inference, and it is our responsibility to educate
6. EPILOG practitioners about using models appropriately, and the haz
ards of ignoring model assumptions when making inferences.
The feedback that I have received on this course—through But many model assumptions are only truly met under idealized
informal and formal evaluations—has been nearlyconditions, universally and thus, as Box (1979) eloquently argued, one must
positive. In particular, the 42 students (mostly from think carefully
Smith but about when statistical inferences are valid. When
also including five students from three nearby colleges) they areseemed
not, statisticians are caught in the awkward position, as
convinced that they learned "useful things." More Wickham specific suggests,
feed of always saying "no." This position can be
back is available in the supplementary materials. dissatisfying.
Several of these students were able to channel these useful If data science represents the new reality for data analysis,
then there is a real risk to the field of statistics if we fail to
skills into their careers almost immediately. Internships and job
offers followed in the spring for a handful of students: twoembrace it. The damage could come on two fronts: first, we
students spent their summers at NIST (one of whom later aclose data science and all of the students who are interested in it
to computer science; and second, the world will become popu
cepted a full-time job offer from MIT's Lincoln Laboratory; the
other is headed to the Ph.D. program in statistics at Berkeleylated by data analysts who do not fully understand or appreciate
and is a trainee in the NSF-funded "Environment and Society: the importance of statistics. While the former blow would be
Data Science for the 21st Century" research program), one stu damaging, the latter could be catastrophic—and not just for our
profession. Conversely, while the potential that data science is
dent landed a job as a research analyst at the nonprofit research
organization MDRC, and three students have joined the new a fad certainly exists, it seems less likely each day. It is hard to
Data Science Development Program at MassMutual. External imagine waking up to a future in which decision-makers are not

340 Statistics and the Undergraduate Curriculum

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
interested in what data (however they may have been collected (2013b), "The ASA and Big Data," AMSTAT News. Available at
http://magazine.amstat.org/blog/2013/06/01/the-asa-and-big-data/l340]
and however they may be structured) can offer them.
Data science courses like the one described in this article Dhar, V. (2013), "Data Science and Predic
tion," Communications of the ACM, 56,
provide a mechanism to develop students' abilities to work with
64-73. Available at http://cacm.acm.org/magazines/201
modern data, and these skills are quickly transitioning from science-and-prediction/fulltext. [340]
desirable to necessary.
DiGrazia, J., McKelvey, K., Bollen, J., and Rojas, F. (2013), "M
Votes: Social Media as a Quantitative Indicator of Political
Science Research Network. Available at http://ssrn.com/
[336]
SUPPLEMENTARY MATERIALS
Finger, L., and Dutta, S. (2014), Ask, Measure, Learn: Using Social Media
Analytics to Understand and Influence Customer Behavior, Sebastopol, CA:
Please see our supplementary materials for more detail about
O'Reilly Media, Inc. [336]
the additional topics listed in Section 3.6, two successful class
Finzer, W. (2013), "The Data Science Education Dilemma," Tech
projects alluded to in Section 5, course feedback summarized in
nology Innovations in Statistics Education, 7, 1-9. Available at
Section 6, as well as three sample exam questions. http://escholarship.org/uc/item/7gv0q9dc.pdf. [335]

Franck, C. (2013), "Is Nate Silver a Statistician?" AMSTAT News. Available at

http://magazine.amstat.org/blog/2013/10/01/is-nate-silver/ [340]
[Received June 2014. Revised July 2015.]
Gelman, A. (2013), "The Tweets-Votes Curve," available at http://
andrewgelman.com/2013/04/24/the-tweets-votes-curve/ [336]

REFERENCES Gould, R., Baumer, B., Çetinkaya Rundel, M., and Bray, A.
(2014), "Big Data Goes to College," AMSTAT News. Available at
http://magazine.amstat.org/blog/2014/06/01/datafest/l340]
Allaire, J., Horner, J., Marti, V., and Porte, N. (2015), markdown:
Markdown Rendering for R. R package version 0.7.7. AvailableHalvorsen,
at K. T., and Moore, T. L. (2001), "Motivating, Monitoring, and Eval
http://CRAN.R-project.org/package=markdown. [340] uating Student Projects," MAA Notes, pp. 27-32. [340]

American Statistical Association Undergraduate Guidelines Work Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, 0„
group (2014), "2014 Curriculum Guidelines for Undergraduate Murrell, P., Peng, R., Roback, P., Temple Lang, D„ and Ward, M. D. (2015),
Programs in Statistical Science." Available at http://www.amstat. "Data Science in the Statistics Curricula: Preparing Students to Think with
org/education/curriculumguidelines.cfin [335] Data'," The American Statistician, this issue. [335]

Anderson, C. (2008), "The End of Theory," Wired. Available at Harris, J. G., Shetterley, N., Alter, A. E„ and Schnell, K. (2014), "It Takes Teams
http://www.wired.com/science/discoveries/magazine/16-07/pb_theory to Solve the Data Scientist Shortage," The Wall Street Journal. Available at
[338] http://blogs.wsj.com/cio/2014/02/14/it-takes-teams-to-solve-the-data-scien
tist-shortage/ [334]
Bartlett, R. (2013), "We Are Data Science," AMSTAT News. Available at
http://magazine.amstat.org/blog/2013/10/01/we-are-data-science/YiAO] Horton, N. J. (2015), "Challenges and Opportunities for Statistics and Statistical
Education: Looking Back, Looking Forward," The American Statistician,
Box, G. E. (1979), "Some Problems of Statistics and Everyday Life," Journal
69, 138-145. [335,340]
of the American Statistical Association, 74, 1—4. [340]
Horton, N. J., Baumer, B. S„ and Wickham, H. (2015), "Setting the Stage
Breiman, L. (2001), "Statistical Modeling: The Two Cultures," Statistical Scie
for Data Science: Integration of Data Management Skills in Introduc
nce, 16, 199-215. Available at http://www.jstor.org/stable/2676681.
tory and Second Courses in Statistics," CHANCE, 28, 40-50. Available
[335,338]
at http://chance.amstat.org/2015/04/setting-the-stage/. [337]
Chance, B. L. (2002), "Components of Statistical Thinking and Implications for
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An In
Instruction and Assessment," Journal of Statistics Education, 10. [335]
troduction to Statistical Learning, New York: Springer. Available at
Cleveland, W. S. (2001), "Data Science: An Action Plan for Expanding the http://www-bcf.usc.edu/gareth/ISL/. [340]
Technical Areas of the Field of Statistics," International Statistical Review,
Kandel, S., Heer, J., Plaisant, C., Kennedy, J., van Ham, F., Riche, N. H.,
69, 21-26. Available at http://www.jstor.org/stable/1403527. [335,337]
Weaver, C., Lee, B., Brodbeck, D., and Buono, P. (2011), "Research Di
Cobb, G. W. (2007), "The Introductory Statistics Course: A Ptolemaic Curricu rections in Data Wrangling: Visualizations and Transformations for Usable
lum?" Technology Innovations in Statistics Education, 1, 1-15. Available at and Credible Data," Information Visualization, 10, 271-288. Available at
http://escholarship.org/uc/item/6hb3k0nz. [335] http://research.microsoft.com/EN-US/UM/REDMOND/GROUPS/cue/info
(2011), "Teaching Statistics: Some Important Tensions," Chilean Jour vis/. [335]
nal of Statistics, 2, 31-62. Available at http://chjs.deuv.cl/Vol2Nl/ChJS-02Linkins, J. (2013), "Let's Calm Down About Twitter Being Able To
01-03.pdf. [335] Predict Elections, Guys," The Huffington Post. Available at http://
Cohen, D., and Henle, J. (1995), "The Pyramid Exam," Undergraduate Mathe www. hufflngtonpost. com/2013/08/14/twitter-predict-elections_n_37553
matics Education Trends, 7, 2. [339] 26.html [336]

Committee on Professional Ethics (1999), "Ethical Guidelines for Statistical Lohr, S. (2009), "For Today's Graduate, Just One Word: Statistics,"
Practice." Available at http://www.amstat.org/about/ethicalguidelines.cfm, The New York Times. Available at http://www.nytimes.com/2009/08/06/
last accessed: 2015-05-19. [336] technology/06stats.html [334]

Davenport, T. H., and Patil, D. (2012), "Data Scientist: The Sex Moore, D. S. (1998), "Statistics Among the Liberal Arts," Journal of
iest Job of the 21st Century," Harvard Business Review. Avail the American Statistical Association, 93, 1253-1259. Available at
able at http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st http://www.jstor.org/stable/2670040. [335]
century/ar/1 [334] Murrell, P. (2010), Introduction to Data Technologies,
Davidian, M. (2013a), "Aren't We Data Science?" AMSTAT News. Available at Boca Raton, FL: Chapman and Hall/CRC. Available at
http://magazine.amstat.org/blog/2013/07/01/datascience/YiA0] https://www. stat. auckland. ac. nz/paul/ItDT/. [340]

The American Statistician, November 2015, Vol. 69, No. 4 341

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms
Nolan, D., and Temple Lang, D. (2010), "Computing in the Statis Wasserstein, R. (2015), "Communicating the Power and Impact of Our Pro
tics Curricula," The American Statistician, 64, 97-107. Available at fession: A Heads Up for the Next Executive Directors of the ASA," The
http://www.stat.berkeley.edu/users/statcur/Preprints/ComputingCurric3.pdf. American Statistician, 69, 96-99. [340]
[335,337]
Wickham, H. (2012), "My Cynical Definition: A Data Scien
Patil, D. (2011), Building Data Science Teams, Sebastopol, CA: O'Reilly Media, tist is a Statistician Who is Useful)." Available at https://
Inc. [336] twitter.com/hadleywickham/status/263750846246969344 [340]

Rajaraman, A., and Ullman, J. D. (2011), Mining of Massive (2014), 'Tidy Data," The Journal of Statistical Software, 59, 1-23.
Datasets, Cambridge, UK: Cambridge University Press. Available Available
at at http://vita.had.co.nz/papers/tidy-data.html. [335,337]
http://www.mmds.org/. [340]
Wickham, H., and Francois, R. (2015), dplyr: A Grammar of
Rojas, F. (2013), "How Twitter Can Help Predict an Election," The Data Manipulation. R package version 0.4.2. Available at
Washington Post. Available at http://www.washingtonpost.com/opinions/ho http://CRAN.R-project.org/package=dplyr. [335,337]
w-twitter-can-predict-an-election/2013/08/1 l/35ef885a-0108-l Ie3-96a8-d3
Wilkinson, L. (2006), The Grammar of Graphics, New York: Springer. [335]
b921c0924a_story.html [336]
Yau, N. (2011), Visualize This: The Flowing Data Guide to Design, Visualiza
Swires-Hennessy, E. (2014), Presenting Data: How to Communicate Your
tion, and Statistics, Indianapolis, IN: Wiley. [337]
Message Effectively (1st ed.). West Sussex, UK: Wiley. Available at
http://www.wiley.com/WileyCDA/WileyTitle/productCd-1I18489594.html. (2013), Data Points: Visualization that Means Something, Indianapolis,
[337] IN: Wiley. [337]

Tan, P. N., Steinbach. M„ and Kumar, V. (2006), Introduction to Data Zhu, Y., Hernandez, L. M., Mueller, P., Dong, Y., and Forman, M. R. (2013),
Mining (1st ed.), Boston, MA: Pearson Addison-Wesley. Available at "Data Acquisition and Preprocessing in Studies on Humans: What is
http://www-users.cs.umn.edu/kumar/dmbook/index.php. [340] Not Taught in Statistics Classes?" The American Statistician, 67, 235
241. [335]
Tufte, E. R. (1983), The Visual Display of Quantitative Information (2nd ed.),
Cheshire, CT: Graphics Press. [337]

342 Statistics and the Undergraduate Curriculum

This content downloaded from 194.57.109.68 on Tue, 04 Jun 2024 12:09:14 +00:00
All use subject to https://about.jstor.org/terms

A Hans On Introduction To Data Science-1-300
No ratings yet
A Hans On Introduction To Data Science-1-300
300 pages
BCBR - Review + Cycle 3
82% (11)
BCBR - Review + Cycle 3
103 pages
Statistics by Jim PDF
40% (5)
Statistics by Jim PDF
25 pages
2022 Bookmatter StatisticsForDataScientists
No ratings yet
2022 Bookmatter StatisticsForDataScientists
24 pages
Internal Audit Key Performance Indicators (KPIs)
100% (1)
Internal Audit Key Performance Indicators (KPIs)
15 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
A Guide To Teaching Data Science (Hicks)
No ratings yet
A Guide To Teaching Data Science (Hicks)
45 pages
Introducing Data Science Techniques by Connecting Database Concepts and Dplyr
No ratings yet
Introducing Data Science Techniques by Connecting Database Concepts and Dplyr
8 pages
The Fivethirtyeight R Package
No ratings yet
The Fivethirtyeight R Package
23 pages
Carmichael MArron 2018 OJO
No ratings yet
Carmichael MArron 2018 OJO
22 pages
Data Science vs. Statistics: Two Cultures?
No ratings yet
Data Science vs. Statistics: Two Cultures?
22 pages
Statistical Thinking - The Bedrock of Data Science - American Statistical Association
No ratings yet
Statistical Thinking - The Bedrock of Data Science - American Statistical Association
2 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
Tang and Sae-Lim - 2016 - Data Science Programs in U.S. Higher Education An
No ratings yet
Tang and Sae-Lim - 2016 - Data Science Programs in U.S. Higher Education An
23 pages
Module1 21CS644 DSV
No ratings yet
Module1 21CS644 DSV
16 pages
De VEAUX_Curriculum Guidelines for Undergraduate Programes in Data Science
No ratings yet
De VEAUX_Curriculum Guidelines for Undergraduate Programes in Data Science
18 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
introduction
No ratings yet
introduction
22 pages
DS 1
No ratings yet
DS 1
56 pages
Module 1
No ratings yet
Module 1
19 pages
Think Stats
100% (2)
Think Stats
142 pages
Curriculum Guidelines For Undergraduate Programs in Data Science
No ratings yet
Curriculum Guidelines For Undergraduate Programs in Data Science
16 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Module 1 Data Science
No ratings yet
Module 1 Data Science
5 pages
Data Science
No ratings yet
Data Science
35 pages
Think Stats: Probability and Statistics For Programmers
100% (1)
Think Stats: Probability and Statistics For Programmers
142 pages
The Future of Statistics and The Data Science
No ratings yet
The Future of Statistics and The Data Science
5 pages
Semana 1: The Data Scientist's Toolbox
No ratings yet
Semana 1: The Data Scientist's Toolbox
20 pages
Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute Rcpaffenroth@wpi - Edu 2014
No ratings yet
Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute Rcpaffenroth@wpi - Edu 2014
22 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
DV - Unit 1
No ratings yet
DV - Unit 1
40 pages
Data Science 5
100% (3)
Data Science 5
216 pages
Andrews M. Doing Data Science in R. an Introduction...2021
No ratings yet
Andrews M. Doing Data Science in R. an Introduction...2021
486 pages
Data Science Text
100% (3)
Data Science Text
460 pages
Sample Intro Statistics Intuitive Guide
50% (2)
Sample Intro Statistics Intuitive Guide
25 pages
Chapter - 10 Data Science
No ratings yet
Chapter - 10 Data Science
15 pages
Brochure ADS 021120 2
No ratings yet
Brochure ADS 021120 2
16 pages
Introduction To Data Science
75% (4)
Introduction To Data Science
74 pages
STAT121 / AC209 / E-109: CS109 Data Science
No ratings yet
STAT121 / AC209 / E-109: CS109 Data Science
74 pages
Project Report
No ratings yet
Project Report
29 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Stuvia 2
No ratings yet
Stuvia 2
4 pages
IDS Syllabus
No ratings yet
IDS Syllabus
3 pages
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
No ratings yet
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
35 pages
An Introduction to Data Science 1st Edition, (Ebook PDF) pdf download
100% (2)
An Introduction to Data Science 1st Edition, (Ebook PDF) pdf download
59 pages
Careers in Data Science -- Institute For Career Research -- Careers Ebooks, 2021 -- Institute For Career Research -- ebfd11929f2ac2f452ee720512c40219 -- Anna’s Archive
No ratings yet
Careers in Data Science -- Institute For Career Research -- Careers Ebooks, 2021 -- Institute For Career Research -- ebfd11929f2ac2f452ee720512c40219 -- Anna’s Archive
43 pages
2606
No ratings yet
2606
51 pages
6220010
No ratings yet
6220010
37 pages
KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
51 pages
Data Ana With R
No ratings yet
Data Ana With R
45 pages
Class 02
No ratings yet
Class 02
31 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
unit 1
No ratings yet
unit 1
33 pages
Lecture 1 - overview
No ratings yet
Lecture 1 - overview
27 pages
ENGG1003_06_DataScience
No ratings yet
ENGG1003_06_DataScience
44 pages
Download ebooks file (Ebook) Data Mining and Exploration: From Traditional Statistics to Modern Data Science by Chong Ho Alex Yu ISBN 9780367721466, 0367721465 all chapters
100% (6)
Download ebooks file (Ebook) Data Mining and Exploration: From Traditional Statistics to Modern Data Science by Chong Ho Alex Yu ISBN 9780367721466, 0367721465 all chapters
81 pages
Finding Data Patterns in the Noise: A Data Scientist's Tale
From Everand
Finding Data Patterns in the Noise: A Data Scientist's Tale
Olayinka Ugwu
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
The Data Revolution
From Everand
The Data Revolution
Pasquale De Marco
No ratings yet
Weights
No ratings yet
Weights
10 pages
Average
No ratings yet
Average
4 pages
Module 6 STATISTICAL CONCEPTS OF QUALITY MANAGEMENT
No ratings yet
Module 6 STATISTICAL CONCEPTS OF QUALITY MANAGEMENT
5 pages
Coursework Bank Light Intensity
100% (2)
Coursework Bank Light Intensity
6 pages
Research - The Little Seagull Handbook (2017)
No ratings yet
Research - The Little Seagull Handbook (2017)
6 pages
Term Paper Sample For High School
100% (1)
Term Paper Sample For High School
5 pages
Translation of English Complex Sentences by Goran & Govan Final
No ratings yet
Translation of English Complex Sentences by Goran & Govan Final
37 pages
Investigating Second Level Primary Students Learning Styles and Geometry Self Efficacy in Terms of Certain Variables
No ratings yet
Investigating Second Level Primary Students Learning Styles and Geometry Self Efficacy in Terms of Certain Variables
14 pages
The Public Engagement Process For Sidewalk Toronto
No ratings yet
The Public Engagement Process For Sidewalk Toronto
29 pages
Phicometer Test
No ratings yet
Phicometer Test
5 pages
Flouts of The Cooperative Principle Maxi
No ratings yet
Flouts of The Cooperative Principle Maxi
14 pages
Writing at Postgraduate Level - Critical Reading and Writing
No ratings yet
Writing at Postgraduate Level - Critical Reading and Writing
23 pages
Reginawati+et+al +2023
No ratings yet
Reginawati+et+al +2023
5 pages
Qualitative Research in Across Different Fields
No ratings yet
Qualitative Research in Across Different Fields
1 page
Geotechnical Map of Thi Qar Governorate Using Geographical Information Systems (GIS)
No ratings yet
Geotechnical Map of Thi Qar Governorate Using Geographical Information Systems (GIS)
11 pages
Project Report On Customer Satisfaction
No ratings yet
Project Report On Customer Satisfaction
16 pages
Project Plan For Web Startup1
No ratings yet
Project Plan For Web Startup1
61 pages
Anastasya2023
No ratings yet
Anastasya2023
12 pages
Comprehensive Exam Sched
No ratings yet
Comprehensive Exam Sched
16 pages
Labeeb Khan Resume
No ratings yet
Labeeb Khan Resume
2 pages
Gompertz and Logistic growth models
No ratings yet
Gompertz and Logistic growth models
8 pages
Engineering Statistics - Chapter 2 Jafar
No ratings yet
Engineering Statistics - Chapter 2 Jafar
37 pages
Worksheet 1.3
No ratings yet
Worksheet 1.3
4 pages
Montgomery Watt Thesis
100% (2)
Montgomery Watt Thesis
7 pages
Mental Health 2. Perceived Self-Efficacy
No ratings yet
Mental Health 2. Perceived Self-Efficacy
2 pages
Ashok Final Project
No ratings yet
Ashok Final Project
96 pages
CLC 12 - Final Capstone Proposal and Reflection
No ratings yet
CLC 12 - Final Capstone Proposal and Reflection
7 pages
BINAURAL BEATS, BRAIN WAVE ENTRAINMENT AND THE HEMI-SYNC PROCESS (PDFDrive)
No ratings yet
BINAURAL BEATS, BRAIN WAVE ENTRAINMENT AND THE HEMI-SYNC PROCESS (PDFDrive)
57 pages

Baumer - 2015 - A Data Science Course For Undergraduates Thinking

Uploaded by

Baumer - 2015 - A Data Science Course For Undergraduates Thinking

Uploaded by

A Data Science Course for Undergraduates: Thinking With Data

Author(s): Ben Baumer

A Data Science Course for Undergraduates: Thinking With Data

students are eager to develop their ability to analyze d

The American Statistician, November 2015, Vol. 69, No. 4

The data being analyzed were scraped from the

SDS 292 is organized into a series of 2-3 week modules:

336 Statistics and the Undergraduate Curriculum

Concept SQL R (dplyr)

The American Statistician, November 2015, Vol. 69, No. 4

jjö statistics and the undergraduate Curriculum

3.6 Additional Topics 4.1 A Note to Prospective Instructors

The American Statistician, November 2015, Vol 69, No. 4 339

340 Statistics and the Undergraduate Curriculum

Franck, C. (2013), "Is Nate Silver a Statistician?" AMSTAT News. Available at

The American Statistician, November 2015, Vol. 69, No. 4 341

342 Statistics and the Undergraduate Curriculum

You might also like