0% found this document useful (0 votes)
3 views3 pages

Week3 Cheat Sheet Exploratory Data Analysis

This document is a cheat sheet for Exploratory Data Analysis (EDA) that provides a summary of various R functions and their syntax, including 'summarize', 'group_by', 'cor', 'cor.test', 'aov', 'count', 'ggplot', and others. Each function is accompanied by a brief description and an example of its usage. The document also includes a changelog detailing updates made by different authors.

Uploaded by

moonb4115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views3 pages

Week3 Cheat Sheet Exploratory Data Analysis

This document is a cheat sheet for Exploratory Data Analysis (EDA) that provides a summary of various R functions and their syntax, including 'summarize', 'group_by', 'cor', 'cor.test', 'aov', 'count', 'ggplot', and others. Each function is accompanied by a brief description and an example of its usage. The document also includes a changelog detailing updates made by different authors.

Uploaded by

moonb4115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Cheat Sheet: Exploratory Data Analysis

Command Syntax Description Example


summarize function reduces a
data frame to a summary of
just one vector or value.

.data

A data frame, data frame


extension (e.g. a tibble), or a avg_delays <- sub_airline %>%
lazy data frame group_by(Reporting_Airline,
DayOfWeek) %>%
summarize() summarize(.data, ...)
summarize(mean_delays =
… mean(ArrDelayMinutes),
.groups = 'keep')
Name-value pairs of
summary functions. The
name will be the name of the
variable in the result. The
value should be an expression
that returns a single value like
min(x), n(), or sum(is.na(y))
group_by function takes an
existing table and converts it
into a grouped table where
operations are performed "by
group".

.data
A data frame, data frame
extension (e.g. a tibble), or a sub_airline %>%
group_by(.data, ..., .add =
lazy data frame group_by(Reporting_Airline)
group_by() FALSE, .drop =
%>% summarize(mean_delays =
group_by_drop_default(.data))
.add mean(ArrDelayMinutes))
When FALSE, the default,
group_by() will override
existing groups.

.drop
Drop groups formed by factor
levels that don’t appear in the
data
cor() cor(x, use=, method= ) cor function computes the sub_airline %>%
correlation coefficient select(DepDelayMinutes,
ArrDelayMinutes) %>%
cor(method = "pearson")
x: Matrix or data frame

use: Specifies the handling of


missing data.
method: Specifies the type of
correlation. Options are
pearson, spearman or kendall.
cor.test function is a test for
association/correlation
cor.test(x, y, alternative =
between paired samples. It
c("two.sided", "less", returns both the correlation
"greater"), method = coefficient and the sub_airline %>%
cor.test() c("pearson", "kendall", significance level(or p-value) cor.test(~DepDelayMinutes +
"spearman"), exact = NULL, of the correlation . ArrDelayMinutes, data = .)
conf.level = 0.95, continuity
= FALSE, …)
x, y: numeric vectors of data
values. x and y must have the
same length.
aov function (Analysis of
Variance (ANOVA)) is a
statistical method used to test
whether there are significant
aa_as_subset <- sub_airline
differences between the %>% select(ArrDelay,
means of two or more groups. Reporting_Airline) %>%
filter(Reporting_Airline ==
aov(formula, data = NULL, formula: A formula 'AA' | Reporting_Airline ==
aov projections = FALSE, qr =
TRUE, contrasts = NULL, …) specifying the model. 'AS')

data: A data frame in which ad_aov <- aov(ArrDelay ~


Reporting_Airline, data =
the variables specified in the aa_as_subset)
formula will be found. If
missing, the variables are
searched for in the standard
way.
count function lets you
quickly count the unique
values of one or more
variables
count(df, vars = NULL, wt_var sub_airline %>%
count() = NULL) count(Reporting_Airline)
df: data frame to be processed

vars: variables to count


unique values of
ggplot function initializes a
ggplot object. It can be used
to declare the input data
ggplot(aes(x =
ggplot(data = NULL, mapping = frame for a graphic and to
Reporting_Airline, y =
ggplot() aes(), ..., environment = specify the set of plot DayOfWeek, fill =
parent.frame()) aesthetics intended to be mean_delays))
common throughout all
subsequent layers unless
specifically overridden.
corrplot() corrplot(method=, type=,....) corrplot function provides a corrplot(airlines_cor, method
visual exploratory tool on = "color", col = col(200),
type = "upper", order =
correlation matrix that "hclust", addCoef.col =
supports automatic variable "black", # Add coefficient of
reordering to help detect correlation tl.col = "black",
hidden patterns among tl.srt = 45, #Text label
variables. color and rotation )

method: There are seven


visualization methods
(parameter method) in
corrplot package, named
‘circle’, ‘square’, ‘ellipse’,
‘number’, ‘shade’, ‘color’,
‘pie’

type: There are three layout


types (parameter type): ‘full’,
‘upper’ and ‘lower’.
geom_bar
ggplot(aes(x =
Reporting_Airline, y =
geom_bar(mapping = NULL, data function is used to produce Average_Delays)) +
geom_bar() = NULL, stat = "bin", position
1d area plots: bar charts for geom_bar(stat = "identity") +
= "stack", ...)
categorical x, and histograms ggtitle("Average Arrival
for continuous y. Delays by Airline")
ggplot(avg_delays, aes(x =
Reporting_Airline, y =
geom_tile(mapping = NULL, data geom_tile function tile plane lubridate::wday(DayOfWeek,
geom_tile() = NULL, stat = "identity",
position = "identity", ...) with rectangles. label = TRUE), fill = bins))
+ geom_tile(colour = "white",
size = 0.2)
ggplot(avg_delays, aes(x =
Reporting_Airline, y =
geom_text(mapping = NULL, data lubridate::wday(DayOfWeek,
= NULL, stat = "identity", geom_text used for text label = TRUE), fill = bins))
geom_text() position = "identity", parse = annotation. + geom_tile(colour = "white",
FALSE, ...) size = 0.2) +
geom_text(aes(label =
round(mean_delays, 3)))
ggplot(avg_delays, aes(x =
Reporting_Airline, y =
labs(...)
lubridate::wday(DayOfWeek,
labs Change axis labels and label = TRUE), labs(x =
labs() …
a list of new names in the legend titles "Reporting Airline",y = "Day
of Week",title = "Average
form aesthetic = “new name”
Arrival Delays") fill =
bins)) +
scale_fill_manual function
Change axis labels and
legend titles


scale_fill_manual(values =
common discrete scale c("#d53e4f", "#f46d43",
scale_fill_manual() scale_fill_manual(..., values) parameters: name, breaks, "#fdae61", "#fee08b",
labels, na.value, limits and "#e6f598", "#abdda4"))
guide. See discrete_scale for
more details

values: a set of aesthetic


values to map data values to.

Author(s)
Lakshmi Holla

Changelog
Date Version Changed by Change Description
2023-05-11 1.1 Eric Hao & Vladislav Boyko Updated Page Frames
2021-08-09 1.0 Lakshmi Holla Initial Version

You might also like