0% found this document useful (0 votes)

115 views

Chapter 03 Visualization (R)

The document discusses different types of graphs that can be used for data exploration and visualization, including basic plots like line graphs, bar charts and scatterplots, as well as distribution plots like histograms and boxplots. It also covers more advanced visualization techniques like heatmaps, multidimensional plots, and network graphs.

Uploaded by

hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views

Chapter 03 Visualization (R)

Uploaded by

hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Chapter 3 – Data Visualization

Data Mining for Business Analytics in

R
Shmueli, Bruce, Yahav, Patel & Lichtendahl

Basic Plots Distribution Plots

Line Graphs Boxplots
Bar Charts Histograms
Scatterplots
Graphs for Data Exploration
Basic Plots

⚫Easy to create.
⚫Most commonly used in business.
⚫Display the relationship between 2 variables
⚫Good to use to explore your data ( missing values, types of variables, etc)

⚫Line graphs Useful to draw the relationship between two continuous variables ( usually time is the one on
the x axis)

⚫Bar chars Useful for comparing a single statistic ( average, count, percentage) across groups. The height of
the bar represents the value of the statistic and different bars corresponds to different groups. The x-axis
variable must be categorial

⚫Scatterplot describes the type of relationship between two data sets. The line of best fit is the line that
comes closest to all the points on a scatter plot
Line Graph for Time Series
Amtrak Ridership

# use time series analysis

library(forecast)
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
Bar Chart for Categorical Variable
Average median neighborhood value for neighborhoods that do and do
not border the Charles River

## barchart of CHAS vs. mean MEDV

# compute mean MEDV per CHAS = (0, 1)

data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$CHAS), FUN = mean)

names(data.for.plot) <- c("CHAS", "MeanMEDV")
barplot(data.for.plot$MeanMEDV, names.arg = data.for.plot$CHAS,
xlab = "CHAS", ylab = "Avg. MEDV")
Scatterplot
Displays relationship between two numerical variables

MEDV

## scatter plot with axes names

plot(housing.df$MEDV ~ housing.df$LSTAT, xlab = "MDEV", ylab = "LSTAT")
Distribution Plots
⚫Display “how many” of each value occur in a data set

⚫Or, for continuous data or data with many possible

values, “how many” values are in each of a series of
ranges or “bins”
Histograms

Boston Housing example:

Histogram shows the

distribution of the
outcome variable
(median house value) About 40 neighborhoods had a median
house value < $10,000 (these data are
from mid-20th century)

## histogram of MEDV

hist(housing.df$MEDV, xlab = "MEDV")

Boxplots
Side-by-side boxplots are useful for comparing subgroups

Houses in neighborhoods on
Charles river (1) are more
valuable than those not (0)

## boxplot of MEDV for different values of CHAS

boxplot(housing.df$MEDV ~ housing.df$CHAS, xlab = "CHAS", ylab = "MEDV")

Box Plot
⚫Top outliers defined as
those above Q3+1.5(Q3-
Q1).
outliers
⚫“max” = maximum of
non-outliers
“max”
⚫Analogous definitions
Quartile 3 for bottom outliers and
Median
for “min”
Quartile 1
⚫Details may differ
“min”
across software
Heat Maps

Color conveys information

In data mining, used to visualize

Correlations
Missing Data
Heatmap to highlight correlations
Darker & redder = more negative correlation
Lighter and yellower = more positive correlation

## simple heatmap of correlations (without values)

heatmap(cor(housing.df), Rowv = NA, Colv = NA)

## heatmap with values

library(gplots)
heatmap.2(cor(housing.df), Rowv = FALSE, Colv = FALSE, dendrogram = "none",
cellnote = round(cor(housing.df),2),
notecol = "black", key = FALSE, trace = 'none', margins = c(10,10))
Multidimensional Visualization
Scatterplot with color/shade added

Boston Housing

NOX vs. LSTAT

light shade = low median
value
dark shade = high
median value

# alternative plot with ggplot [text has more complex Base R coding too]
library(ggplot2)
ggplot(housing.df, aes(y = NOX, x = LSTAT, colour= CAT..MEDV)) +
geom_point(alpha = 0.6)
Bar chart for MEDV vs. RAD, first for low-value
neighborhoods, then for high value neighborhoods

data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$RAD,

housing.df$CHAS),FUN = mean, drop = FALSE)

ggplot(data.for.plot) +
geom_bar(aes(x = as.factor(RAD), y = `meanMEDV`), stat = "identity") +
xlab("RAD") + facet_grid(CHAS ~ .)

see Fig. 3.6 for more

Diagonal plot is
Matrix Scatterplot
the frequency
distribution for the
variable

## simple plot
# use plot() to generate a matrix of 4X4 panels with variable name on the diagonal,
# and scatter plots in the remaining panels.
plot(housing.df[, c(1, 3, 12, 13)])

# alternative, nicer plot (displayed)

library(GGally)
ggpairs(housing.df[, c(1, 3, 12, 13)])
Manipulation:

• Rescaling
• Aggregation
• Zooming
• Filtering
Rescaling to log scale (on right)
“uncrowds” the data

## scatter plot: regular and log scale

plot(housing.df$MEDV ~ housing.df$CRIM, xlab = "CRIM", ylab = "MEDV")

# to use logarithmic scale set argument log = to either 'x', 'y', or 'xy'.
plot(housing.df$MEDV ~ housing.df$CRIM,
xlab = "CRIM", ylab = "MEDV", log = 'xy')
Amtrak Ridership – Monthly Data – Curve Added

## fit curve
ridership.lm <- tslm(ridership.ts ~ trend + I(trend^2))
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim =
c(1300, 2300))
lines(ridership.lm$fitted, lwd = 2)
Amtrak Ridership
Monthly Average Zoom in

## zoom in, monthly, and annual plots

ridership.2yrs <- window(ridership.ts, start = c(1991,1), end = c(1992,12))
plot(ridership.2yrs, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
monthly.ridership.ts <- tapply(ridership.ts, cycle(ridership.ts), mean)
plot(monthly.ridership.ts, xlab = "Month", ylab = "Average Ridership",
ylim = c(1300, 2300), type = "l", xaxt = 'n')
## set x labels
axis(1, at = c(1:12), labels = c("Jan","Feb","Mar", "Apr","May","Jun",
"Jul","Aug","Sep", "Oct","Nov","Dec"))
annual.ridership.ts <- aggregate(ridership.ts, FUN = mean)
plot(annual.ridership.ts, xlab = "Year", ylab = "Average Ridership",
ylim = c(1300, 2300))
Scatter Plot with Labels (Utilities)

plot(utilities.df$Fuel_Cost ~ utilities.df$Sales,
xlab = "Sales", ylab = "Fuel Cost", xlim = c(2000, 20000))
text(x = utilities.df$Sales, y = utilities.df$Fuel_Cost,
labels = utilities.df$Company, pos = 4, cex = 0.8, srt = 20, offset = 0.2)
Scaling: Smaller markers, jittering, color contrast
(Universal Bank; red = accept loan)

Jitter = add noise to

“unstack” markers
that hide markers
underneath

# use function alpha() in library scales to add transparent colors

library(scales)
plot(jitter(universal.df$CCAvg, 1) ~ jitter(universal.df$Income, 1),
col = alpha(ifelse(universal.df$Securities.Account == 0, “gray", “red"), 0.4),
pch = 20, log = 'xy', ylim = c(0.1, 10),
xlab = "Income", ylab = "CCAvg")
Without jittering With jittering
Parallel Coordinate Plot (Boston Housing)

All variables are rescaled to 0-1 scale Each line is a single record

library(MASS)
par(mfcol = c(2,1))
parcoord(housing.df[housing.df$CAT..MEDV == 0, -14], main = "CAT.MEDV = 0")
parcoord(housing.df[housing.df$CAT..MEDV == 1, -14], main = "CAT.MEDV = 1")
Linked plots
(same record is highlighted in each plot)

Produced in Spotfire
Network Graph
eBay Auctions

library(igraph)
ebay.df <- read.csv("eBayNetwork.csv")
# transform node ids to factors
ebay.df[,1] <- as.factor(ebay.df[,1])
ebay.df[,2] <- as.factor(ebay.df[,2])
graph.edges <- as.matrix(ebay.df[,1:2])
g <- graph.edgelist(graph.edges, directed = FALSE)
isBuyer <- V(g)$name %in% graph.edges[,2]
plot(g, vertex.label = NA, vertex.color = ifelse(isBuyer, "gray",
"black"),
vertex.size = ifelse(isBuyer, 7, 10))
Treemap – eBay Auctions Detail on this
corner – next slide
(Hierarchical eBay data: Category> sub-category> Brand)

library(treemap)
tree.df <- read.csv("EbayTreemap.csv")
# add column for negative feedback
tree.df$negative.feedback <- 1* (tree.df$Seller.Feedback < 0)
# draw treemap
treemap(tree.df, index = c("Category","Sub.Category", "Brand"),
vSize = "High.Bid", vColor = "negative.feedback", fun.aggregate = "mean",
align.labels = list(c("left", "top"), c("right", "bottom"), c("center", "center")),
palette = rev(gray.colors(3)), type = "manual", title = "")
Category

Brand

Rectangle size =
avg. value of item

Subcategory

Darker = more negative feedback

Using Google Maps
Location of Statistics.com students and instructors

library(ggmap)
SCstudents <- read.csv("SC-US-students-GPS-data-2016.csv")
Map <- get_map("Denver, CO", zoom = 3)
ggmap(Map) + geom_point(aes(x = longitude, y = latitude), data = SCstudents,
alpha = 0.4, colour = "red", size = 0.5)
Map Chart
(Comparing countries’ well-being with GDP)

Darker = higher value

library(mosaic)
gdp.df <- read.csv("gdp.csv", skip = 4, stringsAsFactors = FALSE)
names(gdp.df)[5] <- "GDP2015"
happiness.df <- read.csv("Veerhoven.csv")
# gdp map
mWorldMap(gdp.df, key = "Country.Name", fill = "GDP2015") + coord_map()
# well-being map
mWorldMap(happiness.df, key = "Nation", fill = "Score") + coord_map() +
scale_fill_continuous(name = "Happiness")

Pneumatic Conveying System Design Calculation: Input Parameters Unit Value
50% (2)
Pneumatic Conveying System Design Calculation: Input Parameters Unit Value
6 pages
Some Salient Features of The Time-Averaged Ground Vehicle Wake
No ratings yet
Some Salient Features of The Time-Averaged Ground Vehicle Wake
34 pages
Romans The Clearest Gospel of All
100% (1)
Romans The Clearest Gospel of All
1,282 pages
PTS Controller: Over Fuel Dispensers and ATG Systems For Petrol Stations
No ratings yet
PTS Controller: Over Fuel Dispensers and ATG Systems For Petrol Stations
161 pages
Note 2
No ratings yet
Note 2
27 pages
Chapter 3 - Data Visualization Chapter 4 - Summary Statistics
No ratings yet
Chapter 3 - Data Visualization Chapter 4 - Summary Statistics
38 pages
Graph Plotting in R Programming
No ratings yet
Graph Plotting in R Programming
12 pages
R Unit5
No ratings yet
R Unit5
12 pages
On Eda
No ratings yet
On Eda
60 pages
Chap3 Visualization
No ratings yet
Chap3 Visualization
28 pages
VISUALIZING A SINGLE VARIABLE USING R
No ratings yet
VISUALIZING A SINGLE VARIABLE USING R
9 pages
Experiment 3
No ratings yet
Experiment 3
43 pages
Unit III - R Programming
No ratings yet
Unit III - R Programming
21 pages
Data Visualization in R
No ratings yet
Data Visualization in R
12 pages
Business Analytics Unit - IV Notes_60637706_2025_05!15!02_16
No ratings yet
Business Analytics Unit - IV Notes_60637706_2025_05!15!02_16
28 pages
Grpahs and Charts in R
No ratings yet
Grpahs and Charts in R
12 pages
DVPD Final Lab Word PDF
No ratings yet
DVPD Final Lab Word PDF
93 pages
DV - Unit 2
No ratings yet
DV - Unit 2
73 pages
MIT 302 - Statistical Computing II - Tutorial 04
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 04
7 pages
05 Charts and Graphs in R
No ratings yet
05 Charts and Graphs in R
51 pages
DV Assignment-1
No ratings yet
DV Assignment-1
10 pages
Unit 3Data Visualization With Ggplot2
No ratings yet
Unit 3Data Visualization With Ggplot2
19 pages
Charts and Graphs in R
No ratings yet
Charts and Graphs in R
50 pages
ds-1
No ratings yet
ds-1
22 pages
Unit3__R
No ratings yet
Unit3__R
19 pages
2 1 Data Visualization
No ratings yet
2 1 Data Visualization
31 pages
Data Visualization in R Sem-III 2021 PDF
No ratings yet
Data Visualization in R Sem-III 2021 PDF
57 pages
ppt3
No ratings yet
ppt3
20 pages
David Gerbing - R Visualizations Derive Meaning From Data (2020) - 1 - CRC Press (9780429894923)
100% (1)
David Gerbing - R Visualizations Derive Meaning From Data (2020) - 1 - CRC Press (9780429894923)
252 pages
2 R - Zajecia - 4 - Eng
No ratings yet
2 R - Zajecia - 4 - Eng
7 pages
DSUR_EA2352001010391_W6
No ratings yet
DSUR_EA2352001010391_W6
4 pages
unit3_R[1] (1)
No ratings yet
unit3_R[1] (1)
30 pages
R Chart Exercise
No ratings yet
R Chart Exercise
9 pages
21CS644 Module 4
No ratings yet
21CS644 Module 4
24 pages
Graphics in R
No ratings yet
Graphics in R
8 pages
Data Visualization
No ratings yet
Data Visualization
46 pages
Ayush Sonar 310104230868 Practical 3 DS R
No ratings yet
Ayush Sonar 310104230868 Practical 3 DS R
10 pages
R – Charts and Graphs[1]
No ratings yet
R – Charts and Graphs[1]
21 pages
M4 DAR Part1
No ratings yet
M4 DAR Part1
16 pages
Data Visualization
No ratings yet
Data Visualization
30 pages
Exploratory Data Analysis Reference
No ratings yet
Exploratory Data Analysis Reference
50 pages
R Doc Ii Vee
No ratings yet
R Doc Ii Vee
24 pages
pdf copy
No ratings yet
pdf copy
19 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
Pres Dataviz
No ratings yet
Pres Dataviz
122 pages
Module 4-1
No ratings yet
Module 4-1
84 pages
2 Table and Graphical Representations
No ratings yet
2 Table and Graphical Representations
46 pages
Charts and Graphs in Python and R
No ratings yet
Charts and Graphs in Python and R
1 page
MATPLOTLIB BASICS
No ratings yet
MATPLOTLIB BASICS
27 pages
Network Visualization With R - Kateto 2023
No ratings yet
Network Visualization With R - Kateto 2023
72 pages
Data Visualization
No ratings yet
Data Visualization
48 pages
Saveetha Institute of Medical and Technical Sciences: Unit V Plotting and Regression Analysis in R
No ratings yet
Saveetha Institute of Medical and Technical Sciences: Unit V Plotting and Regression Analysis in R
63 pages
BDAExp 8
No ratings yet
BDAExp 8
9 pages
Big Data Visualization and Common Adopattation Issues
No ratings yet
Big Data Visualization and Common Adopattation Issues
34 pages
Exploratory_Data_Analysis_Course_Notes
No ratings yet
Exploratory_Data_Analysis_Course_Notes
55 pages
Plotting Technique Purpose
No ratings yet
Plotting Technique Purpose
4 pages
Data Visualization
No ratings yet
Data Visualization
31 pages
DVT (Lab) - R Language Manual
No ratings yet
DVT (Lab) - R Language Manual
20 pages
Module_4
No ratings yet
Module_4
23 pages
Exercise 2
No ratings yet
Exercise 2
3 pages
unit-2
No ratings yet
unit-2
52 pages
P6ADBMS
No ratings yet
P6ADBMS
34 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Lecture 7 - Applied Statistics - English Section-Third Year 2018
No ratings yet
Lecture 7 - Applied Statistics - English Section-Third Year 2018
8 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Introduction - Ch.1: Data Mining For Business Analytics in R
No ratings yet
Introduction - Ch.1: Data Mining For Business Analytics in R
17 pages
Chapter 04 Dimension Reduction (R)
No ratings yet
Chapter 04 Dimension Reduction (R)
27 pages
Analysis of Categorical Data
No ratings yet
Analysis of Categorical Data
60 pages
Welding Journal 1960 7
100% (1)
Welding Journal 1960 7
151 pages
Cobe B: 1 No
No ratings yet
Cobe B: 1 No
6 pages
EEE2203 PHYSICAL ELECTRONICS II Lesson1
No ratings yet
EEE2203 PHYSICAL ELECTRONICS II Lesson1
11 pages
6i Ag U2pt
No ratings yet
6i Ag U2pt
3 pages
API 5L Line Pipe OD
No ratings yet
API 5L Line Pipe OD
2 pages
Athletic Transcentral ACCA SBL Preseen - March 2024 ACCA
No ratings yet
Athletic Transcentral ACCA SBL Preseen - March 2024 ACCA
17 pages
Types of High Voltage Generators
No ratings yet
Types of High Voltage Generators
7 pages
S 073 Dmob
No ratings yet
S 073 Dmob
3 pages
Friedel Crafts Reactions
No ratings yet
Friedel Crafts Reactions
6 pages
Emission of Black Smoke From Boiler Due To Safety Valve Failure
No ratings yet
Emission of Black Smoke From Boiler Due To Safety Valve Failure
3 pages
PC Graph Rational Function
No ratings yet
PC Graph Rational Function
8 pages
2012 Composites in Construction
No ratings yet
2012 Composites in Construction
42 pages
Dolby Da20
No ratings yet
Dolby Da20
2 pages
Key Value Pair
No ratings yet
Key Value Pair
10 pages
AS NZS 2172-2010 (cot)
No ratings yet
AS NZS 2172-2010 (cot)
40 pages
Mim0701 Unit 10 Modernism
No ratings yet
Mim0701 Unit 10 Modernism
11 pages
Cerato Manual Basic
No ratings yet
Cerato Manual Basic
10 pages
Water Quality Analysis in Emergency Situations
No ratings yet
Water Quality Analysis in Emergency Situations
8 pages
Doorpost Summary
No ratings yet
Doorpost Summary
2 pages
Artifact Cyoa Final r2
No ratings yet
Artifact Cyoa Final r2
15 pages
Solar Space Heating and Cooling
No ratings yet
Solar Space Heating and Cooling
14 pages
Sensorcomm 2014 3 30 10103
No ratings yet
Sensorcomm 2014 3 30 10103
9 pages
Teslabib
No ratings yet
Teslabib
4 pages
Target Volume Definition in Radiation Oncology
No ratings yet
Target Volume Definition in Radiation Oncology
3 pages
Millennium B Bplus and B2 Manual
60% (5)
Millennium B Bplus and B2 Manual
102 pages
Assessment.docx NSDA -7 -Ans
No ratings yet
Assessment.docx NSDA -7 -Ans
3 pages