Data Exploration Visulization and Feature Engineering Using R
Data Exploration Visulization and Feature Engineering Using R
Engineering using R
Yuhui Zhang, and Raja Iqbal
library(datasets)
## scatter plot
plot(x = airquality$Temp, y = airquality$Ozone)
150
airquality$Ozone
100
50
0
60 70 80 90
airquality$Temp
1
Base plotting system
## par() function is used to specify global graphics parameters that affect all plots in an R session.
## Type ?par to see all parameters
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(airquality, {
plot(Wind, Ozone, main="Ozone and Wind")
plot(Temp, Ozone, main="Ozone and Temperature")
mtext("Ozone and Weather in New York City", outer=TRUE)})
150
100
100
Ozone
Ozone
50
50
0
5 10 15 20 60 70 80 90
Wind Temp
PHASE ONE: Mount a canvas panel on the easel, and draw the draft. (Initialize a plot.)
Remember to use ?plot or str(plot), etc. to check the arguments when you want to make more personalized
plots. A tutorial of base plotting system with more details: http://bcb.dfci.harvard.edu/~aedin/courses/
BiocDec2011/2.Plotting.pdf
PHASE TWO: Add more details on your canvas, and make an artwork. (Add more on an existing
plot.)
• lines: adds liens to a plot, given a vector of x values and corresponding vector of y values
2
• points: adds a point to the plot
• text: add text labels to a plot using specified x,y coordinates
• title: add annotations to x,y axis labels, title, subtitles, outer margin
• mtext: add arbitrary text to margins (inner or outer) of plot
• axis: specify axis ticks
R can generate graphics (of varying levels of quality) on almost any type of display or printing device. Like:
## the layout
par(mfrow = c(2, 1), mar = c(2, 0, 2, 0), oma = c(0, 0, 0, 0))
## histogram at the top
hist(airquality$Ozone, breaks=12, main = "Histogram of Ozone")
## box plot below for comparison
boxplot(airquality$Ozone, horizontal=TRUE, main = "Box plot of Ozone")
3
Histogram of Ozone
0 50 100 150
Box plot of Ozone
airquality$Ozone
0 50 100 150
−2 −1 0 1 2 −2 −1 0 1 2
4
y
−2
−2 −1 0 1 2 −2 −1 0 1 2
4
xyplot(y ~ x | f, panel = function(x, y, ...) {
# call the default panel function for xyplot
panel.xyplot(x, y, ...)
# adds a horizontal line at the median
panel.abline(h = median(y), lty = 2)
# overlays a simple linear regression line
panel.lmline(x, y, col = 2)
})
Group 3 Group 4
8
6
4
2
0
−2
Group 1 Group 2
y
8
6
4
2
0
−2
−2 −1 0 1 2
Plotting functions * xyplot(): main function for creating scatterplots * bwplot(): box and whiskers plots
(box plots) * histogram(): histograms * stripplot(): box plot with actual points * dotplot(): plot dots on
“violin strings” * splom(): scatterplot matrix (like pairs() in base plotting system) * levelplot()/contourplot():
plotting image data
5
Very useful when we want a lot. . .
6.5
Sepal.Length
4.5
3.5
Sepal.Width
2.0
7
5
Petal.Length
3
1
2.0
Petal.Width
0.5
ggplot2
qplot function
The qplot() function is the analog to plot() but with many build-in features
Syntax somewhere in between base/lattice
Difficult to be customized (don’t bother, use full ggplot2 power in that case)
6
library(ggplot2) ## need to install and load this library
qplot(displ, hwy, data = mpg, facets = .~drv)
4 f r
40
30
hwy
20
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ
ggplot function
ggplot function
7
setosa versicolor virginica
4.5
4.0
3.5
Sepal.Width
3.0
2.5
2.0
5 6 7 8 5 6 7 8 5 6 7 8
Sepal.Length
ggplot function
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8 5 6 7 8 5 6 7 8
Sepal.Length
Great documentation
Great documentation of ggplot with all functions in step II and III and demos:
http://docs.ggplot2.org/current/
8
Titanic tragedy data
titanic = read.csv('Titanic_train.csv')
options(width = 110)
head(titanic)
sapply(titanic,class)
9
Converting class label to a factor
str(titanic$Survived)
str(titanic$Sex)
survivedTable = table(titanic$Survived)
survivedTable
##
## died survived
## 549 342
Died
Survived
10
Is Sex a good predictor?
male = titanic[titanic$Sex=="male",]
female = titanic[titanic$Sex=="female",]
par(mfrow = c(1, 2), mar = c(0, 0, 2, 0), oma = c(0, 1, 0, 1))
pie(table(male$Survived),labels=c("Dead","Survived"), main="Survival Portion Among Men")
pie(table(female$Survived),labels=c("Dead","Survived"), main="Survival Portion Among Women")
Dead
Dead
Survived
Survived
summary(titanic[titanic$Survived=="0",]$Age)
summary(titanic[titanic$Survived=="1",]$Age)
11
par(mfrow = c(1, 2), mar = c(4, 4, 2, 2), oma = c(1, 1, 1, 1))
boxplot(titanic$Age~titanic$Sex, main="Age Distribution By Gender",col=c("red","green"))
boxplot(titanic$Age~titanic$Survived, main="Age Distribution By Survival",col=c("red","green"),
xlab="0:Died 1:Survived",ylab="Age")
80
60
60
Age
40
40
20
20
0
0:Died 1:Survived
Histogram of Age
12
Distribution of Passenger Ages on Titanic
200
150
Frequency
100
50
0
0 20 40 60 80
Age
13
kernel density of Ages of Titanic Passengers
0.030
0.020
Density
0.010
0.000
0 20 40 60 80
14
Kernel Density Plot of Ages By Sex
female
0.030
male
0.020
Density
0.010
0.000
−20 0 20 40 60 80 100
Age of Passenger
0.035
female died
male survived
0.030
0.030
0.030
0.025
0.025
0.025
0.020
0.020
0.020
Density
Density
Density
0.015
0.015
0.015
0.010
0.010
0.010
0.005
0.005
0.005
0.000
0.000
0.000
15
## Now we need to create categories: NA = Unknown, 1 = Child, 2 = Adult
## Every age below 13 (exclusive) is classified into age group 1
Child[Child<13] <- 1
## Every child 13 or above is classified into age group 2
Child[Child>=13] <- 2
Fare matters?
Adult Child
500
400
300
female
200
100
0
Fare
500
400
300
male
200
100
0
died survived died survived
Survived
***
16
How about fare, ship class, port embarkation?
unkown Cherbourg Queenstown Southampton
500
400
300
Fare
200
100
1 2 3 1 2 3 1 2 3 1 2 3
Pclass
Diamond data
Histogram of carat
17
library(ggplot2)
ggplot(data=diamonds) + geom_histogram(aes(x=carat))
7500
count
5000
2500
0 2 4
carat
ggplot(data=diamonds) +
geom_density(aes(x=carat),fill="gray50")
18
1.5
1.0
density
0.5
0.0
0 1 2 3 4 5
carat
19
15000
price
10000
5000
0 1 2 3 4 5
carat
20
15000
color
D
E
F
price
10000
G
H
I
J
5000
0 1 2 3 4 5
carat
g + geom_point(aes(color=color)) + facet_wrap(~color)
21
D E F
15000
10000
5000
0
G H I
color
D
15000
E
F
price
10000
G
H
5000
I
J
0
J
15000
10000
5000
0
0 1 2 3 4 5
carat
15000
Fair
10000
5000
15000
Good
10000
5000
color
0
D
15000 E
Very Good
F
price
10000
G
5000 H
I
0
J
15000
Premium
10000
5000
15000
Ideal
10000
5000
0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
carat
22
Your trun!
• It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code
chunks that are run so their output can be included in the final document.
Markdown is a markup language with plain text formatting syntax designed so that it can be
converted to HTML and many other formats using a tool by the same name.
One minute you get the point, and always check the cheat sheets
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#lists
23
Then, R Markdown sample code
R Markdown
• YAML header
• Edit Markdown, and R chunks
• Run!
RStudio: knitr button
Command line: render(“file.Rmd”)
Titanic
24