0% found this document useful (0 votes)
4 views

5-RVisualizingData

Uploaded by

sharkyftw08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

5-RVisualizingData

Uploaded by

sharkyftw08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

R Programming for Data

Science and Data Analysis


Samatrix Consulting Pvt Ltd
R - Visualizing Data
Visualizing Data
• In the previous chapter, we discussed a number of methods to import data,
which is the first step in most data analysis.
• Before we load the data to any model, we need to view the data. Each
machine learning model has its own strengths.
• There is no universally accepted machine learning model.
• Hence before fitting any data to a model, we need to visualize the data to
analyse the patterns.
• For this chapter, we will use nycflights13 packages. We can install the
package using the following commands.

install.package("nycflights13")
Scatter Plot
Creating Scatter Plot
The function plot() is basic R function to visualize the data.
By providing a numeric or integer vector to plot(), we can produce a
scatter plot of value by index.
We can plot a scatter plot of 10 points in the increasing order as
follows:

plot(1:10)
Scatter Plot – 2 Vectors
We can generate two linearly correlated random numeric vectors to
create a more realistic scatter plot.

x <- rnorm(200)
y <- 2*x + rnorm(200)
plot(x,y)

As a result, we get the following plot.


Customize Chart Elements
We can customize several chart elements such as title (main or title()),
the label of the x axis (xlab), the label of y axis (ylab), the range of the x
axis (xlim), and the range of the y axis (ylim).

plot(x, y,
main = ”Correlated Random numbers",
xlab = "x", ylab = "2x + noise",
xlim = c(-3, 3), ylim = c(-6, 6))
Customize Chart Elements
We can specify the chart tile by either the main argument or a separate
title() function call. The following code will plot the same chart as given
above.

plot(x, y,
xlab = "x", ylab = "2x + noise",
xlim = c(-3, 3), ylim = c(-6, 6))
title("Correlated Random numbers")
Custom Point Style
For a scatter plot, the default point style is a circle. We can specify the
pch argument (plotting character), to change the point style. 26 point
styles are available in R

plot(0:25, 0:25, pch = 0:25,


xlim = c(-1, 26), ylim = c(-1, 26),
main = "Point styles (pch)")
text(0:25+1, 0:25-1, 0:25)
Custom Point Style
In the preceding code, we have created a scatter plot that includes all the point
styles while printing the corresponding pch number beside it.
First, we created a simple scatter plot using plot, then printed the pch number using
the text().
We can plot a scatter plot graph using non-default point style by setting pch=17.

x <- rnorm(200)
y <- 2*x + rnorm(200)
plot(x,y, pch = 17,
main = "Scatter plot with pch = 17")
Scatter Plot – Logical Condition
We can also distinguish the two groups of points by a logical condition. We know
that pch is vectorized.
So, we can use ifelse() to specify the point of each observation based on certain
condition.
The following example applies pch = 17 to the points satisfying x * y > 1 otherwise pch
= 1;

x <- rnorm(200)
y <- 2*x + rnorm(200)
plot(x,y,
pch = ifelse(x * y > 1, 17, 1),
main = "Scatter plot with conditional pch")
Scatter Plot – 2 Data Sets
A plot containing two separate datasets sharing the same x-axis can be drawn using plot() and points().
In the previous example, a normally distributed vector x, and a linearly correlated random vector y were
generated.
For this example, we will generate another random vector, z, that has a non-linear relationship with x. In
this example, we have plotted both y and z against x whereas both the plots have different point styles:

x <- rnorm(75)
y <- 1.5*x + rnorm(75)
z <- sqrt(1 + x ^ 2) + rnorm(75)
plot(x, y, pch = 1,
xlim = range(x), ylim = range(y, z),
xlab = "x", ylab = "value")
points(x, z, pch = 17)
title("Scatter plot with two datasets")
Scatter Plot – 2 Data Sets
• In the preceding example, first, we created datasets x, y, and z.
• Then we created a plot of x and y. Then we added another group of points z with
a different pch.
• We have specified ylim = range(y, z).
• This is to ensure that the plot builder consider the range of both y and z.
• The points() does not lengthen the axes created by plot().
• Due to which any point beyond the axes range will disappear.
• By specifying ylim = range(y, z), we have ensured that all the points in y and z are
shown in the plot area.
Customizing Point Colors
We can specify different point colors by setting the column of plot():

x <- rnorm(75)
y <- 1.5*x + rnorm(75)
plot(x, y, pch = 15, col = "blue", main = "Blue Color Scatter Plot")
Customizing Point Colors
Different colors can be applied to separate points that belong to different
categories if they satisfy certain conditions.

plot(x, y, pch = 16,


col = ifelse(y >= mean(y), "red", "green"),
main = "Scatter plot - conditional colors")
Customizing Point Colors
We can use col to distinguish different groups of point while plotting two different datasets using
plot() and points().

plot(x, y, col = "blue", pch = 0,


xlim = range(x), ylim = range(y, z),
xlab = "x", ylab = "value")
points(x, z, col = "red", pch = 1)
title("Scatter plot with two datasets in different color")

• R supports 657 colors in total.


• You can call the function colors() to get the list of all the
colors supported by R.
Line Plot
Create Line Plot
On several data analysis problems such as time series analysis, we use line plots to
demonstrate the trend and variation across time. We should use type=”l” while
calling plot().

t <- 1:50
y <- 2.5 * sin(t * pi / 60) + rnorm(t)
plot(t, y, type = "l", main = "Line plot")
Line Type and Width
For the line plot, we can use lty to specify the line type of a line plot. It is similar to
pch for scatter plot. The preview of the six-line types that R supports is shown
below.

lty_val <- 1:6


plot(lty_val, axes = FALSE, ann = FALSE, type = "n")
abline(h =lty_val, lwd = 2, lty = lty_val)
mtext(lty_val, at = lty_val, side = 2)
title("Line types (lty)")
Line Type and Width
• In the preceding code, we have used the parameter type = "n" to create an empty
canvas. The value "n" signifies no plotting.
• The parameters axes = FALSE, ann = FALSE are used to turn off axes and annotation.
• We used the abline() function to add straight lines through the current plot. The
parameter h =lty_val is used to draw the six horizontal line, for each value of lty_val.
• The line width has been set by lwd = 2. The different line types are specified by lty =
lty_val.
• We have used the function mtext() to draw the text on the margin. Please note that
abline() and mtext() are vectorized with respect to their argument.
Line Type and Width
In the following example, we have drawn the auxiliary lines
in a plot using the function abline().
In this example, first of all, we created a plot of y with time, t.
We have shown the mean value and the range (minimum and maximum values) of y along
with the time.
We can easily draw these auxiliary lines very easily by using different line types and colors.

plot(t, y, lwd = 2, type = "l")


abline(h = mean(y), col = "red", lty = 2)
abline(h = range(y), col = "blue", lty = 3)
abline(v = t[c(which.min(y), which.max(y))], col = "brown", lty = 3)
title("Line plot with auxiliary lines")
Multi-period line plot
In a multi-period line plot, we mix different line types.
For example, a time series dataset in which the first period is historic data and the second period is predictions.
In the example below, the first 40 observations of y represent the historic data and the remaining points represent the
predictions based on the historic data.
We have used the solid line to plot the historic data and the dashed line to plot the predictions.
In this case, we have plotted the data in the first period and then added dashed lines() for the data in the second period
of the plot.
As we used points() in the case of scatter plot,
we can use lines() in the case of the line plot.

p <- 40
plot(t[t <= p], y[t <= p], type = "l",
xlim = range(t), xlab = "t", ylab = "y")
lines(t[t >= p], y[t >= p], lty = 2)
title("Two period Line Plot")
Line Plot with Points
We can plot both the lines and points in the same chart. This can be done easily by first plotting a
line chart and then adding points() of the same data to the plot again.

plot(y, type = "l")


points(y, pch = 16)
title("Line plot with points")

Alternatively, first, we can plot a scatter plot using the plot()


function and then we can add lines using the lines() function.

plot(y, pch = 16)


lines(y)
title("Line plot with points")
Multi Series Chart with Legend
In the following code, we have generated two series, y1 and y2, with time t and created a chart with the two series with respect to
time t.

t <- 1:30
y1 <- 1.5 * t + 6 * rnorm(30)
y2 <- 2.5 * sqrt(t) + 8 * rnorm(30)
plot(t, y1, type = "l", col = "black",
ylim = range(y1, y2), ylab ="y1, y2")
points(y1, pch = 15)
lines(y2, col = "blue", lty = 2)
points(y2, col = "blue", pch = 16)
title ("Plot of two series")
legend("topleft",
legend = c("y1", "y1"),
col = c("black", "blue"),
lty = c(1, 2), pch = c(15, 16),
cex = 0.8, x.intersp = 0.5, y.intersp = 0.8)
Multi Series Chart with Legend
• In the above example, we have added a legend() on the top left.
• It shows the line and point styles of y1 and y2 respectively.
• We have also used cex to scale the font sizes of the legend and x.intersp and y.intersp
to make some minor adjustments to the legend.
Bar Chart
Bar Chart
The bar charts are one of the most commonly used charts. We use bar charts to
visualize the qualitative data by category frequency.
To plot the bar chart we use barplot() function instead of plot() function.
The function draws either vertical or horizontal bars that are separated by white
space.
Even though we display the raw frequencies, but we can use barplot to visualize
other quantities, such as means or proportions, which directly depend upon these
frequencies.
Bar Chart
The basic syntax to create a barplot in R is:

barplot(H, xlab, ylab, main, names.arg, col)

H: is a vector or matrix containing numeric values


xlab: label for x-axis
ylab: label for y-axis
main: title of the bar chart
names.arg: vector of names appearing under each bar
col: color for the bars in the graph

Let’s plot a simple bar chart

barplot(1:10, names.arg = LETTERS[1:10])


Bar Chart
If the numeric vector is a named vector, the names will automatically be the names
on the x-axis.
Hence, we get the same results from the following code, as we received from the
previous code.

ints <- 1:10


names(ints) <- LETTERS[1:10]
barplot(ints)
Project NYCflights – Part 1
• Now we will draw the barplot using the flights dataset in nycflights13.
• This package contains information about 336,776 flights that departed from NYC
to destinations in 2013.
• The data table flights contains the data of all flights that departed from NYC in
2013.
• In this example, we will create a bar plot of the top eight carriers with the most
flights in the record.
• Before we can start using the dataset, we will use the command
install.packages("nycflights13") to install the dataset.
Project NYCflights – Part 1
data("flights", package = "nycflights13")
carriers <- table(flights$carrier)

carriers

9E AA AS B6 DL EV F9 FL HA MQ OO UA US VX WN
18460 32729 714 54635 48110 54173 685 3260 342 26397 32 58665 20536 5162 12275

YV

601

In the previous code, we have used table() to count the number of flights in the record for each carrier. Now sort the carriers in decreasing order.
carriers_sort <- sort(carriers, decreasing = TRUE)
carriers_sort

UA B6 EV DL AA MQ US 9E WN VX FL AS F9 YV HA
58665 54635 54173 48110 32729 26397 20536 18460 12275 5162 3260 714 685 601 342

OO
32
Project NYCflights – Part 1
Now we can take the first 8 elements from the table and draw a bar plot:

barplot(head(carriers_sort, 8),
ylim = c(0, max(carriers_sort) * 1.1),
xlab = "Carrier", ylab = "Flights",
main ="Top 8 carriers ordered by number of flights")
Pie Chart
Pie Chart
Pie charts are also useful charts for data analysis. We can use the pie() function to create a pie
chart. The pie-chart is a representation of values as slices of a circle with different colors.

The basic syntax of plotting a pie-chart by using R programming is as follows.

pie(x, labels, radius, main, col, clockwise)

x: vector that contains the numeric values that are used in the pie chart
labels: to provide the description of the slices
radius: to provide the radius of the circle of the pie chart (value between -1 and +1)
main: to provide the title of the chart
col: indicates the color palette
clockwise: indicates whether the slices are drawn clockwise or anti-clockwise
Pie Chart
The following code is an example of the implementation of pie() function.

grades <- c(A = 2, B = 10, C = 12, D = 8)


pie(grades, main = "Grades", radius = 1)
Histogram
Histogram
• We use the histogram to represent the frequencies of values of a variable
bucketed into ranges.
• Histogram groups the values into continuous ranges.
• Each bar in the histogram shows the number of observations that are present in
that range.
• We can create the histogram using hist() function.
• The function accepts a vector as an input along with some more parameters to
plot histograms.
Histogram
The basic syntax for creating a histogram is as follows
hist(v,main,xlab,xlim,ylim,breaks,col,border)

The description of the parameters are as follows:


v: a vector containing the numeric values that are used in the histogram
main: title of the chart
xlab: description of the x-axis
xlim: range of values on the x-axis
ylim: range of values on the y-axis
breaks: width of each bar
col: color of the bars
border: border-color of each bar
Histogram
In the following example, we have demonstrated how we can use hist() to plot a
histogram using a normally distributed random numeric vector and the density
function of the normal distribution.

random_norm <- rnorm(10000)


hist(random_norm)
Histogram
We can overlay the curve of a probability density function of a standard normal
distribution by using dnorm() function.
We need to ensure that the y-axis of the histogram represents the probability. We
can add the curve to the histogram

hist(random_norm, probability = TRUE, col = "lightgray", main="Histogram - Normally Distributed Data")


curve(dnorm, add = TRUE, lwd = 2, col ="blue")
In this case, we have used the curve() function.
We have used the parameter add = TRUE
to add the curve to the existing plot.
Project NYCflights – Part 2

Now we can make a histogram of the speed of an aircraft from the nycflights13
dataset.
We can calculate the speed of an aircraft by dividing the distance of the trip
(distance) by the air time (air_time)

ft_speed <- flights$distance / flights$air_time


hist(ft_speed, xlab="Flight Speed", main = "Flight speed - Histogram")
Project NYCflights – Part 2

We observe that the distribution is different from a normal distribution. So, we can
use density() function to estimate the empirical distribution of the speed and plot a
smooth probability distribution curve. We have also added a vertical line to indicate
the global average of all the observations.

plot(density(ft_speed, from = 2, to = 10, na.rm = TRUE),


main ="Empirical distribution of flight speed", xlab="Flight Speed")
abline(v = mean(ft_speed, na.rm = TRUE),
col = "blue", lty = 2)
Project NYCflights – Part 2
We can combine both plots to get a better understanding of the data.

hist(ft_speed,
probability = TRUE, ylim = c(0, 0.5),
main ="Histogram & distribution of flight speed",
xlab = "Flight Speed",
border ="gray", col = "lightgray")
lines(density(ft_speed, from = 2, na.rm = TRUE),
col ="darkgray", lwd = 2)
abline(v = mean(ft_speed, na.rm = TRUE),
col ="blue", lty =2)
Thanks
Samatrix Consulting Pvt Ltd

You might also like