0% found this document useful (0 votes)
4 views

Visualization Techniques

The document outlines various visualization techniques and tools used for data analysis, including histograms, density plots, box plots, bar graphs, pie charts, line charts, and scatterplots. It emphasizes the importance of these techniques in transforming complex data into intuitive graphical representations for better interpretation and decision-making. Additionally, it provides examples using the Titanic dataset and R programming for implementing these visualization methods.

Uploaded by

hoeofjimin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Visualization Techniques

The document outlines various visualization techniques and tools used for data analysis, including histograms, density plots, box plots, bar graphs, pie charts, line charts, and scatterplots. It emphasizes the importance of these techniques in transforming complex data into intuitive graphical representations for better interpretation and decision-making. Additionally, it provides examples using the Titanic dataset and R programming for implementing these visualization methods.

Uploaded by

hoeofjimin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Visualization Techniques

Visualization Tools

1. HISTOGRAM
2. DENSITY PLOT
3. BOX PLOT
4. BAR GRAPH
5. PIE CHART
6. LINE CHART
7. SCATTERPLOT
8. JOINT BAR
GRAPH
Visualization Techniques

 We will utilize various datasets to explore and


understand different visualization techniques.
 Visualization techniques play a crucial role in
transforming complex data into intuitive graphical
representations, allowing for better interpretation and
decision-making.
 By utilizing diverse visualization methods, we can
enhance data analysis, uncover patterns, and
communicate findings more effectively.
Datasets
Here, we have used combination of in-built datasets in R
and external datasets.

 titanic – This data set will have to


import using a R code or else we can
directly import from ‘Import Datasets’
tab in R. (look at attached snapshot)
 titanic_data <- read.csv(path/titanic.csv", header =
TRUE, sep = ",")
 Header = TRUE tells that the first row of
the CSV file contains column names
(headers) instead of data.
 In-Built datasets used: mtcars,
Variables of `titanic’ dataset
Key variables in the Titanic dataset include:

PassengerId: Unique identifier for SibSp: Number of siblings/spouses on


each passenger board
Survived : 1 for survival and 0 for not Parch: Number of parents/children on
surviving board
Pclass: Passenger class (1st, 2nd, or Ticket: Ticket number
3rd) Fare: Ticket fare
Name: Passenger's name Cabin: Cabin number
Gender: Passenger's gender (male or Embarked: Port of embarkation
female) (C = Cherbourg, Q = Queenstown, S =
Age: Passenger's age Southampton)
Histogram- using simple method
• Historgam is used to see how
a distribution of a numerical
variable looks like.
• From the visualization, it can
be concluded that a
significant portion of Explanation

passengers were in the aged  Hist: function to draw hist,


 titanic$Age: To extract values
between 20-30 from.
hist(titanic$Age, main =
containing Age column/variable
"Histogram of Age", xlab = from titanic dataset
"Age“,  Main: Name of the histogram
ggplot2 Terminology

Some of the terminologies used in ggplot2:


•data- what we want to visualize and consists of variables
•geoms - type of plot
geometric objects that are drawn to represent the data, such as bars, lines, and points
(geom_*())
•aesthetics - Defines how variables in the data are mapped to visual
properties
such as x and y position, line color, point shapes, etc
•There are mappings from data values to aesthetics
•Scales -Adjust axes, colors, and sizes in the plot
Histogram : using ggplot

ggplot(titanic, aes(x = Age)) +


geom_histogram(binwidth = 5,
fill = "steelblue", color =
"black", alpha = 0.7) +
labs(title = "Distribution of
Ages of Titanic Passengers", Explanati
onaes(x = Age): Defines the aesthetics of the plot, mapping the

x = "Age", y = "Count") +
Age on x-axis
theme_minimal()  geom_histogram() creates a histogram
 Binwidth: Each bar represents a range.
 Main: Name of the histogram
 labels the y-axis as Count.

Density Plot : using simple method
• A density plot shows the probability
density function of a continuous
variable, providing insights into the
data's distribution, such as mean,
presence of multiple peaks, and the
spread.
• The plot is smoothed, which helps Explanation
1. density(titanic$Age): This calculates the kernel
in visualizing patterns without the
density estimate of the Age variable in the titanic
noise of individual data points.
plot(density(na.omit(titanic$Age) dataset.
), 2. Kernel density estimation is a non-parametric way
main = "Density Plot of Age", to estimate the probability density function of a
continuous random variable.
Density Plot : using ggplot

ggplot(titanic, aes(x = Age,


na.rm=TRUE)) geom_density(fill
= "blue", alpha = 0.5)
labs(title = "Density Plot of Age",
x = "Age", y = "Density") +
theme_minimal()
Explanatio
n
1. ggplot(titanic, aes(x = Age)):This initializes a ggplot object with the dataset titanic
and specifies that the variable Age will be mapped to the x-axis.
2. geom_density(fill = "blue", alpha = 0.5):This adds a density plot to the graph. The fill
= "blue" argument sets the color of the area under the density curve to blue, and
alpha = 0.5 makes the blue color semi-transparent (the value of alpha ranges from 0
for fully transparent to 1 for fully opaque).
Boxplot: using simple method
• Useful for visualizing the
distribution, median, and possible
outliers of a numerical variable.
• It can be seen that there is a
presence of outliers/extreme values
in the dataset. This extreme values
distorts the dataset and results in
biased results
boxplot(titanic$Age, Explanation

main="Boxplot of Age", notch=TRUE: Adds notches to the boxplot.

ylab="Age", col="lightblue", Notches help compare medians: if notches of

notch=TRUE) two boxplots do not overlap, their medians are


significantly different.
Boxplot: using ggplot

ggplot(titanic, aes(y = Age))


geom_boxplot(fill = "lightblue",
color = "black")
labs(title = "Boxplot of Age in
Titanic Dataset",
y = "Age")
theme_minimal()
The box plot can provide answers to the
following questions:

• Is a factor significant?
• Does the location differ between
subgroups?
• Does the variation differ between
subgroups?
• Are there any outliers?
Boxplot: using ggplot
ggplot(titanic, aes(x =
factor(Survived), y = Age, fill =
factor(Survived))) +
geom_boxplot() +
labs(title = "Titanic Survivals by
Age",
x = "Survival (0 = No, 1 = Yes)",
y = "Age") +
scale_fill_manual(values =
c("red", "pink"), labels = c("Did
not Survive", "Survived")) +
theme_minimal()
Plotting Age against Survival (0 = No, 1 = Yes).
Bargraph: using
ggplot
A bar graph showing the count of
passengers by gender in the Titanic
dataset, with "Gender" on the x-axis
and "Count" on the y-axis.

ggplot(titanic, aes(x = Gender, fill


= Gender)) +
geom_bar() +
labs(title = "Bar Graph of Gender",
x = "Gender", y = "Count")
Do it Yourself
Explanation
fill = Gender: Bars are colored based on
Q. Count the number of occurrences of each
gender.
unique value in the Embarked column and
geom_bar() counts the number of
creates bars for each one
occurrences of each gender and plots
the frequency.
Percentage Bar Graph: using simple method
• A percentage bar graph visually
represents proportions of different
categories as parts of a whole.
• Each bar represents 100% and is divided
into segments corresponding to different
categories, showing their relative
percentages.
• This is a stacked percentage bar chart
showing the survival rate across different
Step Explanation
passenger classes Converted 'Survived' and 'Pclass' to
1
titanic$Survived <-
factors
factor(titanic$Survived, labels =
c("No", "Yes"))
titanic$Pclass <-
Joint Bar Graph: using ggplot
This allows you to easily compare how the
distribution of gears varies across different
cylinder categories.
Step
Dataset: mtcars
2:
ggplot(mtcars, aes(x = cyl, fill = gear)) +
geom_bar(position = "dodge") +
labs(title = "Joint Bar Graph: Number of
Cylinders vs Number of Gears",
x = "Number of Cylinders", y = "Count of
Cars",
fill = "Number of Gears") +
Explanation
theme_minimal()
:
geom_bar(position = "dodge") to display the bars side by side. Dodge means the bars for
different groups
Step
2
ggplot(titanic, aes(x = Pclass, fill = Survived)) +
geom_bar(position = "fill") +
scale_y_continuous(labels =
scales::percent_format()) + Convert y-axis to
percentage
labs(title = "Survival Percentage by Passenger
Class",
x = "Passenger Class", y = "Percentage", fill =
Explanati
"Survived") theme_minimal()
geom_bar(position
on = "fill") → Converts counts to
proportions (percentage).
scale_y_continuous(labels =
scales::percent_format()) → Stacked bars scaled to
100%
fill = Survived → Colors bars based on survival
Stack bar chart: showing exact count

ggplot(titanic, aes(x =
as.factor(Pclass), fill =
as.factor(Survived))) +
geom_bar(position = "stack") +
labs(x = "Passenger Class", y =
"Count", fill = "Survived") +
theme_minimal() +
scale_fill_manual(values
Explanati = c("red",
"green"))
on
Pclass is on the x-axis, and the bars are filled based on Survived.
geom_bar(position = "stack"): This tells ggplot2 to stack the bars.
Pie Chart: using simple method
• Pie Chart is used to see to the
proportion of each categories of a
particular categorical variable.
• It produces a pie chart showing the
distribution of the different classes of
passengers (e.g., 1st class, 2nd class,
3rd class) in the dataset, with each
group represented in a different color. Explanati
1. type_counts <- table(titanic$Pclass)
type_counts <- on line creates a frequency table of the group
This

table(titanic$Pclass) column in the titanic dataset. The table() function


counts how many times each unique group appears.
colors <- c("skyblue",
This would typically categorize the plants into
"orange","green")
different growth conditions (like “1 st class", “2nd
pie(type_counts, col = colors, rd
Pie Chart which shows percentage
type_counts <- table(titanic$Pclass)
colors <- c("skyblue", "orange",
"green") percent_labels <-
round(type_counts /
sum(type_counts) * 100, 1)
labels <- paste(names(type_counts),
"\n", percent_labels, "%")
pie(type_counts, col = colors, main =
Explanati
"Pie Chart of Pclass", labels = labels)
type_counts
on <- table(titanic$Pclass)- Count
occurrences of each Pclass
percent_labels <- round(type_counts /
sum(type_counts) * 100, 1)- converts counts to
percentages.
labels <- paste(names(type_counts), "\n",
percent_labels, "%") - adds percentages to class
labels.
Pie Chart: using ggplot
• 1st step,
type_counts <-
as.data.frame(table(titanic$Pclass))
colnames(type_counts) <- c(“Pclass",
"Count")
• 2nd step, Explanati
1. x = "": This sets the x-axis to a constant value (empty
colors <- c("skyblue", "orange", on
string), which is necessary for creating a pie chart.
"green")
2. y = Count: This maps the Count variable from the data to
• 3 step,
rd
the size of the slices in the pie chart.
ggplot(type_counts, aes(x = "", y =
3. geom_bar(stat = "identity"): This tells ggplot to create a
Count, fill = Pclass)) +
bar chart where the heights (or values) of the bars are
geom_bar(stat = "identity", width = 1) + directly taken from the data (Count), rather than being
coord_polar("y") + counted.
theme_void() + 4. width = 1: This sets the width of the bars to 1, meaning no
labs(title = "Titanic Passenger Class space between the slices of the pie chart.
5. coord_polar(theta = "y"):This transforms the bar
chart into a pie chart by applying a polar
coordinate system.theta = "y" ensures that the y-
axis values are used to create the angles of the pie
slices.
6. scale_fill_manual(values = colors):This specifies
the colors to be used for each group. colors is
presumably a vector of color values corresponding
to the different groups in the Group column.
Pie Chart: using ggplot
1. type_counts <-
as.data.frame(table(titanic$Pclass))
2. colnames(type_counts) <- c("Pclass", "Count")
type_counts$Percentage <-
round(type_counts$Count /
sum(type_counts$Count) * 100, 1) Explanati
3. type_counts$Label <- 1.
on Count occurrences of each class.
2. Converting proportions into
paste0(type_counts$Pclass, " (", percentages and rounding off to 1.
type_counts$Percentage, "%)") 3. Paste commands joins text together
without spaces.
4. colors <- c("skyblue", "orange", "green") 4. Fills slices with different colors based on
5. ggplot(type_counts, aes(x = "", y = Count, fill = Pclass.
5. Uses actual values instead of counting
as.factor(Pclass))) + geom_bar(stat = "identity", occurrences automatically.
width = 1, color = "black") +
coord_polar("y") + theme_void() + labs(title =
Line chart: using simple method
It is useful for comparing different
categories across the same variables.
From visualizing Passenger Miles against
year, it can be concluded that at former
increases with time.
Dataset used: airmiles (inbuilt in R)

Step 1:
plot(airmiles, type = "o", col =
"blue", lwd = 2,
xlab = "Year", ylab = "Passenger
Miles (millions)",
main = "Airline Passenger Miles
Explanation
(1937-1960)")
1. type = "o": Plot both points and lines, with points overlaid on the
lines.
Step 1:
Line Chart: using ggplot
airmiles1 <- data.frame(Year = 1937:1960,
Miles = as.numeric(airmiles))
Step 2:
ggplot(airmiles1, aes(x = Year, y = Miles))
+ geom_line(color = "blue", linewidth =
1.2) + geom_point(color = "blue", size =
2) +
labs(x = "Year", y = "Passenger Miles
(millions)",
Explanation
title = "Airline Passenger Miles (1937-
Convert 'airmiles'
1960)") to a dataframe
+ theme_minimal()
geom_line() adds a line connecting the data points.
geom_point()
size = 2 increases the size of the points.
Scatterplot: using
Scatterplot is being used to show ggplot
relationship between two variables.
You can easily that there is a negative
correlation between MPG and weight
Dataset used: mtcars
plot(mtcars$mpg, mtcars$wt, main
= "Scatter plot of MPG vs Weight",
xlab = "Miles per Gallon (MPG)",
ylab = "Weight (wt)",
pch = 19, col = "blue")
Explanati
on
The plot function automatically creates a scatter plot by pairing each value of mpg
with its corresponding value of wt.
pch stands for "plot character" and the number 19 specifies that the points should
be filled circles.
Scatter plot: using ggplot

ggplot(mtcars, aes(x = mpg, y = wt)) +


geom_point() + labs(title = "Scatterplot of
mpg vs weight", x = "mpg", y = "weight")
theme_minimal()

Explanation
Convert 'airmiles' to a dataframe
geom_line() adds a line connecting the data points.
geom_point()
size = 2 increases the size of the points.

You might also like