Chapter 03 Visualization (R)
Chapter 03 Visualization (R)
© Galit Shmueli and Peter Bruce 2017 rev Sep 14, 2019
Graphs for Data Exploration
⚫Easy to create.
⚫Most commonly used in business.
⚫Display the relationship between 2 variables
⚫Good to use to explore your data ( missing values, types of variables, etc)
⚫Line graphs Useful to draw the relationship between two continuous variables ( usually time is the one on
the x axis)
⚫Bar chars Useful for comparing a single statistic ( average, count, percentage) across groups. The height of
the bar represents the value of the statistic and different bars corresponds to different groups. The x-axis
variable must be categorial
⚫Scatterplot describes the type of relationship between two data sets. The line of best fit is the line that
comes closest to all the points on a scatter plot
Line Graph for Time Series
Amtrak Ridership
library(forecast)
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
Bar Chart for Categorical Variable
Average median neighborhood value for neighborhoods that do and do
not border the Charles River
MEDV
## histogram of MEDV
Houses in neighborhoods on
Charles river (1) are more
valuable than those not (0)
Boston Housing
# alternative plot with ggplot [text has more complex Base R coding too]
library(ggplot2)
ggplot(housing.df, aes(y = NOX, x = LSTAT, colour= CAT..MEDV)) +
geom_point(alpha = 0.6)
Bar chart for MEDV vs. RAD, first for low-value
neighborhoods, then for high value neighborhoods
ggplot(data.for.plot) +
geom_bar(aes(x = as.factor(RAD), y = `meanMEDV`), stat = "identity") +
xlab("RAD") + facet_grid(CHAS ~ .)
## simple plot
# use plot() to generate a matrix of 4X4 panels with variable name on the diagonal,
# and scatter plots in the remaining panels.
plot(housing.df[, c(1, 3, 12, 13)])
• Rescaling
• Aggregation
• Zooming
• Filtering
Rescaling to log scale (on right)
“uncrowds” the data
# to use logarithmic scale set argument log = to either 'x', 'y', or 'xy'.
plot(housing.df$MEDV ~ housing.df$CRIM,
xlab = "CRIM", ylab = "MEDV", log = 'xy')
Amtrak Ridership – Monthly Data – Curve Added
## fit curve
ridership.lm <- tslm(ridership.ts ~ trend + I(trend^2))
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim =
c(1300, 2300))
lines(ridership.lm$fitted, lwd = 2)
Amtrak Ridership
Monthly Average Zoom in
plot(utilities.df$Fuel_Cost ~ utilities.df$Sales,
xlab = "Sales", ylab = "Fuel Cost", xlim = c(2000, 20000))
text(x = utilities.df$Sales, y = utilities.df$Fuel_Cost,
labels = utilities.df$Company, pos = 4, cex = 0.8, srt = 20, offset = 0.2)
Scaling: Smaller markers, jittering, color contrast
(Universal Bank; red = accept loan)
All variables are rescaled to 0-1 scale Each line is a single record
library(MASS)
par(mfcol = c(2,1))
parcoord(housing.df[housing.df$CAT..MEDV == 0, -14], main = "CAT.MEDV = 0")
parcoord(housing.df[housing.df$CAT..MEDV == 1, -14], main = "CAT.MEDV = 1")
Linked plots
(same record is highlighted in each plot)
Produced in Spotfire
Network Graph
eBay Auctions
library(igraph)
ebay.df <- read.csv("eBayNetwork.csv")
# transform node ids to factors
ebay.df[,1] <- as.factor(ebay.df[,1])
ebay.df[,2] <- as.factor(ebay.df[,2])
graph.edges <- as.matrix(ebay.df[,1:2])
g <- graph.edgelist(graph.edges, directed = FALSE)
isBuyer <- V(g)$name %in% graph.edges[,2]
plot(g, vertex.label = NA, vertex.color = ifelse(isBuyer, "gray",
"black"),
vertex.size = ifelse(isBuyer, 7, 10))
Treemap – eBay Auctions Detail on this
corner – next slide
(Hierarchical eBay data: Category> sub-category> Brand)
library(treemap)
tree.df <- read.csv("EbayTreemap.csv")
# add column for negative feedback
tree.df$negative.feedback <- 1* (tree.df$Seller.Feedback < 0)
# draw treemap
treemap(tree.df, index = c("Category","Sub.Category", "Brand"),
vSize = "High.Bid", vColor = "negative.feedback", fun.aggregate = "mean",
align.labels = list(c("left", "top"), c("right", "bottom"), c("center", "center")),
palette = rev(gray.colors(3)), type = "manual", title = "")
Category
Brand
Rectangle size =
avg. value of item
Subcategory
library(ggmap)
SCstudents <- read.csv("SC-US-students-GPS-data-2016.csv")
Map <- get_map("Denver, CO", zoom = 3)
ggmap(Map) + geom_point(aes(x = longitude, y = latitude), data = SCstudents,
alpha = 0.4, colour = "red", size = 0.5)
Map Chart
(Comparing countries’ well-being with GDP)
library(mosaic)
gdp.df <- read.csv("gdp.csv", skip = 4, stringsAsFactors = FALSE)
names(gdp.df)[5] <- "GDP2015"
happiness.df <- read.csv("Veerhoven.csv")
# gdp map
mWorldMap(gdp.df, key = "Country.Name", fill = "GDP2015") + coord_map()
# well-being map
mWorldMap(happiness.df, key = "Nation", fill = "Score") + coord_map() +
scale_fill_continuous(name = "Happiness")