0% found this document useful (0 votes)
115 views

Chapter 03 Visualization (R)

The document discusses different types of graphs that can be used for data exploration and visualization, including basic plots like line graphs, bar charts and scatterplots, as well as distribution plots like histograms and boxplots. It also covers more advanced visualization techniques like heatmaps, multidimensional plots, and network graphs.

Uploaded by

hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views

Chapter 03 Visualization (R)

The document discusses different types of graphs that can be used for data exploration and visualization, including basic plots like line graphs, bar charts and scatterplots, as well as distribution plots like histograms and boxplots. It also covers more advanced visualization techniques like heatmaps, multidimensional plots, and network graphs.

Uploaded by

hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Chapter 3 – Data Visualization

Data Mining for Business Analytics in


R
Shmueli, Bruce, Yahav, Patel & Lichtendahl

© Galit Shmueli and Peter Bruce 2017 rev Sep 14, 2019
Graphs for Data Exploration

Basic Plots Distribution Plots


Line Graphs Boxplots
Bar Charts Histograms
Scatterplots
Graphs for Data Exploration
Basic Plots

⚫Easy to create.
⚫Most commonly used in business.
⚫Display the relationship between 2 variables
⚫Good to use to explore your data ( missing values, types of variables, etc)

⚫Line graphs Useful to draw the relationship between two continuous variables ( usually time is the one on
the x axis)

⚫Bar chars Useful for comparing a single statistic ( average, count, percentage) across groups. The height of
the bar represents the value of the statistic and different bars corresponds to different groups. The x-axis
variable must be categorial

⚫Scatterplot describes the type of relationship between two data sets. The line of best fit is the line that
comes closest to all the points on a scatter plot
Line Graph for Time Series
Amtrak Ridership

# use time series analysis

library(forecast)
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
Bar Chart for Categorical Variable
Average median neighborhood value for neighborhoods that do and do
not border the Charles River

## barchart of CHAS vs. mean MEDV


# compute mean MEDV per CHAS = (0, 1)

data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$CHAS), FUN = mean)


names(data.for.plot) <- c("CHAS", "MeanMEDV")
barplot(data.for.plot$MeanMEDV, names.arg = data.for.plot$CHAS,
xlab = "CHAS", ylab = "Avg. MEDV")
Scatterplot
Displays relationship between two numerical variables

MEDV

## scatter plot with axes names


plot(housing.df$MEDV ~ housing.df$LSTAT, xlab = "MDEV", ylab = "LSTAT")
Distribution Plots
⚫Display “how many” of each value occur in a data set

⚫Or, for continuous data or data with many possible


values, “how many” values are in each of a series of
ranges or “bins”
Histograms

Boston Housing example:

Histogram shows the


distribution of the
outcome variable
(median house value) About 40 neighborhoods had a median
house value < $10,000 (these data are
from mid-20th century)

## histogram of MEDV

hist(housing.df$MEDV, xlab = "MEDV")


Boxplots
Side-by-side boxplots are useful for comparing subgroups

Houses in neighborhoods on
Charles river (1) are more
valuable than those not (0)

## boxplot of MEDV for different values of CHAS

boxplot(housing.df$MEDV ~ housing.df$CHAS, xlab = "CHAS", ylab = "MEDV")


Box Plot
⚫Top outliers defined as
those above Q3+1.5(Q3-
Q1).
outliers
⚫“max” = maximum of
non-outliers
“max”
⚫Analogous definitions
Quartile 3 for bottom outliers and
Median
for “min”
Quartile 1
⚫Details may differ
“min”
across software
Heat Maps

Color conveys information

In data mining, used to visualize


Correlations
Missing Data
Heatmap to highlight correlations
Darker & redder = more negative correlation
Lighter and yellower = more positive correlation

## simple heatmap of correlations (without values)


heatmap(cor(housing.df), Rowv = NA, Colv = NA)

## heatmap with values


library(gplots)
heatmap.2(cor(housing.df), Rowv = FALSE, Colv = FALSE, dendrogram = "none",
cellnote = round(cor(housing.df),2),
notecol = "black", key = FALSE, trace = 'none', margins = c(10,10))
Multidimensional Visualization
Scatterplot with color/shade added

Boston Housing

NOX vs. LSTAT


light shade = low median
value
dark shade = high
median value

# alternative plot with ggplot [text has more complex Base R coding too]
library(ggplot2)
ggplot(housing.df, aes(y = NOX, x = LSTAT, colour= CAT..MEDV)) +
geom_point(alpha = 0.6)
Bar chart for MEDV vs. RAD, first for low-value
neighborhoods, then for high value neighborhoods

data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$RAD,


housing.df$CHAS),FUN = mean, drop = FALSE)

ggplot(data.for.plot) +
geom_bar(aes(x = as.factor(RAD), y = `meanMEDV`), stat = "identity") +
xlab("RAD") + facet_grid(CHAS ~ .)

see Fig. 3.6 for more


Diagonal plot is
Matrix Scatterplot
the frequency
distribution for the
variable

## simple plot
# use plot() to generate a matrix of 4X4 panels with variable name on the diagonal,
# and scatter plots in the remaining panels.
plot(housing.df[, c(1, 3, 12, 13)])

# alternative, nicer plot (displayed)


library(GGally)
ggpairs(housing.df[, c(1, 3, 12, 13)])
Manipulation:

• Rescaling
• Aggregation
• Zooming
• Filtering
Rescaling to log scale (on right)
“uncrowds” the data

## scatter plot: regular and log scale


plot(housing.df$MEDV ~ housing.df$CRIM, xlab = "CRIM", ylab = "MEDV")

# to use logarithmic scale set argument log = to either 'x', 'y', or 'xy'.
plot(housing.df$MEDV ~ housing.df$CRIM,
xlab = "CRIM", ylab = "MEDV", log = 'xy')
Amtrak Ridership – Monthly Data – Curve Added

## fit curve
ridership.lm <- tslm(ridership.ts ~ trend + I(trend^2))
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim =
c(1300, 2300))
lines(ridership.lm$fitted, lwd = 2)
Amtrak Ridership
Monthly Average Zoom in

## zoom in, monthly, and annual plots


ridership.2yrs <- window(ridership.ts, start = c(1991,1), end = c(1992,12))
plot(ridership.2yrs, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
monthly.ridership.ts <- tapply(ridership.ts, cycle(ridership.ts), mean)
plot(monthly.ridership.ts, xlab = "Month", ylab = "Average Ridership",
ylim = c(1300, 2300), type = "l", xaxt = 'n')
## set x labels
axis(1, at = c(1:12), labels = c("Jan","Feb","Mar", "Apr","May","Jun",
"Jul","Aug","Sep", "Oct","Nov","Dec"))
annual.ridership.ts <- aggregate(ridership.ts, FUN = mean)
plot(annual.ridership.ts, xlab = "Year", ylab = "Average Ridership",
ylim = c(1300, 2300))
Scatter Plot with Labels (Utilities)

plot(utilities.df$Fuel_Cost ~ utilities.df$Sales,
xlab = "Sales", ylab = "Fuel Cost", xlim = c(2000, 20000))
text(x = utilities.df$Sales, y = utilities.df$Fuel_Cost,
labels = utilities.df$Company, pos = 4, cex = 0.8, srt = 20, offset = 0.2)
Scaling: Smaller markers, jittering, color contrast
(Universal Bank; red = accept loan)

Jitter = add noise to


“unstack” markers
that hide markers
underneath

# use function alpha() in library scales to add transparent colors


library(scales)
plot(jitter(universal.df$CCAvg, 1) ~ jitter(universal.df$Income, 1),
col = alpha(ifelse(universal.df$Securities.Account == 0, “gray", “red"), 0.4),
pch = 20, log = 'xy', ylim = c(0.1, 10),
xlab = "Income", ylab = "CCAvg")
Without jittering With jittering
Parallel Coordinate Plot (Boston Housing)

All variables are rescaled to 0-1 scale Each line is a single record

library(MASS)
par(mfcol = c(2,1))
parcoord(housing.df[housing.df$CAT..MEDV == 0, -14], main = "CAT.MEDV = 0")
parcoord(housing.df[housing.df$CAT..MEDV == 1, -14], main = "CAT.MEDV = 1")
Linked plots
(same record is highlighted in each plot)

Produced in Spotfire
Network Graph
eBay Auctions

library(igraph)
ebay.df <- read.csv("eBayNetwork.csv")
# transform node ids to factors
ebay.df[,1] <- as.factor(ebay.df[,1])
ebay.df[,2] <- as.factor(ebay.df[,2])
graph.edges <- as.matrix(ebay.df[,1:2])
g <- graph.edgelist(graph.edges, directed = FALSE)
isBuyer <- V(g)$name %in% graph.edges[,2]
plot(g, vertex.label = NA, vertex.color = ifelse(isBuyer, "gray",
"black"),
vertex.size = ifelse(isBuyer, 7, 10))
Treemap – eBay Auctions Detail on this
corner – next slide
(Hierarchical eBay data: Category> sub-category> Brand)

library(treemap)
tree.df <- read.csv("EbayTreemap.csv")
# add column for negative feedback
tree.df$negative.feedback <- 1* (tree.df$Seller.Feedback < 0)
# draw treemap
treemap(tree.df, index = c("Category","Sub.Category", "Brand"),
vSize = "High.Bid", vColor = "negative.feedback", fun.aggregate = "mean",
align.labels = list(c("left", "top"), c("right", "bottom"), c("center", "center")),
palette = rev(gray.colors(3)), type = "manual", title = "")
Category

Brand

Rectangle size =
avg. value of item

Subcategory

Darker = more negative feedback


Using Google Maps
Location of Statistics.com students and instructors

library(ggmap)
SCstudents <- read.csv("SC-US-students-GPS-data-2016.csv")
Map <- get_map("Denver, CO", zoom = 3)
ggmap(Map) + geom_point(aes(x = longitude, y = latitude), data = SCstudents,
alpha = 0.4, colour = "red", size = 0.5)
Map Chart
(Comparing countries’ well-being with GDP)

Darker = higher value

library(mosaic)
gdp.df <- read.csv("gdp.csv", skip = 4, stringsAsFactors = FALSE)
names(gdp.df)[5] <- "GDP2015"
happiness.df <- read.csv("Veerhoven.csv")
# gdp map
mWorldMap(gdp.df, key = "Country.Name", fill = "GDP2015") + coord_map()
# well-being map
mWorldMap(happiness.df, key = "Nation", fill = "Score") + coord_map() +
scale_fill_continuous(name = "Happiness")

You might also like