CH 4 Data Visualization
CH 4 Data Visualization
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
What is Data Visualization?
• Data (or information) visualization is used to interpret and gain insight
into large amounts of data. This is achieved through visual
representations, often interactive, of raw data
• is a methodology that allows for discovering or confirming a useful
information about the data by constructing and examining the
graphical output
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
What is Data Visualization?
• Data visualization is the graphical representation of information and
data. By using visual elements like charts, graphs, and maps.
• Data visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data. Additionally, it
provides an excellent way for employees or business owners to
present data to non-technical audiences without confusion.
• In the world of Big Data, data visualization tools and technologies are
essential to analyze massive amounts of information and make data-
driven decisions.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Science Visualization
• Data Science Visualization :is the representation of data graphically in
any format. It is the most efficient way of communicating facts with
non-technical professionals and helps them draw inferences from the
data.
• Many companies today are data-driven. The data they acquire is
sitting in some Data Lake, usually in the cloud. The data collected is
pulled out of the Data Lakes, cleaned, and stored in a Data
warehouse.
• Data Scientists work with these data to build and train Machine
Learning Models, make Predictive Analyses, and visualize them.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data visualization and Data Science
•There are many reasons for data visualization in data science.
Data Visualization Analysis
•Data visualization is one of the steps of the data science process, which states that after data has been collected,
processed and modeled, it must be visualized for conclusions to be made.
•The main goal of data visualization is to make it easier to identify (explore) patterns, trends and outliers(anomalies) in
large data sets
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Visualization aims to ..
Data Visualization… Can lead a user to
• Detect patterns
• Detect trends
• Detect correlations in data
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Example#1
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Example#2
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Advantages of Data Visualization
• The ability to absorb information quickly, improve insights and make
faster decisions
• An increased understanding of the next steps that must be taken.
• An improved ability to maintain the audience's interest with information
they can understand
• An easy distribution of information that increases the opportunity to
share insights with everyone involved.
• An increased ability to act on findings quickly and, therefore, achieve
success with greater speed and less mistakes
• some visualizations allow the user to filter out undesirable properties in
the dataset.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Main goals of data visualization
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
What Makes Data Visualization Effective?
• To get the most out of data visualization, you should consider the following
things. These are the fundamentals of data visualization.
• Clarity: Data should be visualized in a way that everyone can understand.
• Problem domain: When presenting data, the visualizations should be related to
the business problem.
• Interactivity: Interactive plots are useful to compare and highlight certain things
within the plot.
• Comparability: We can compare the thighs easily with good plots.
• Aesthetics: Quality plots are visually aesthetic.
• Informative: A good plot summarizes all relevant information.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Importance of Data Visualization in Data Science
1. Data cleaning
Data visualization plays an important role in data clearing. Good examples are
detecting outliers and removing multicollinearity. We can create scatterplots to
detect outliers and generate heatmaps to check multicollinearity.
2. Data exploration
Before building any model, we need to do some exploratory data analysis to
identify dataset characteristics. For example, we can create histograms for
continuous variables to check for normality in the data. We can create scatterplots
between two features to check whether they are correlated. Likewise, we can
create a bar chart for the label column with two or more classes to identify class
imbalance.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Importance of Data Visualization in Data Science
3.Data Distribution
• Data visualization can be used to understand the distribution of the data, look for central tendencies
(mean, median, and mode), understand the presence of outliers using a boxplot, check for skewness.
4. Evaluation of modeling outputs
We can create a confusion matrix and learning curve to measure the performance of a model during
training. Plots are also useful in validating model assumptions. For example, we can create a residuals
plot and histogram for the distribution of residuals to validate the assumptions of a linear regression
model.
5. Identifying trends
Time and seasonal plots are useful in time series analysis to identify certain trends over time.
6. Presenting results
As a data scientist, you need to present your findings to the company or other related persons who do
not have more knowledge in the subject domain. So, you need to explain everything in plain English.
You can use informative plots that summarize your findings.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Type of Datasets in Analytical Problems
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Type of Datasets in Analytical
Problems(cont.)
3. Spatial data:
Data which is naturally organized and understood in terms of its spatial location
or extent.
E.g. latitude and longitude of locations, geography information, suburbs,
streets, etc.
4. Textual data:
This kind of data set consists of sequences of words and punctuation.
E.g. twitter feed or customer complaints.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types
There are many data visualization types. The following are the commonly used data visualization
charts.
1. Distribution plot
A distribution plot is used to visualize data distribution.
Example: Probability distribution plot or density curve.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
2. Line plot
A line plot is created by connecting a series of data points
with straight lines. The number of periods is on the x-axis.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
3. Bar plot
A bar plot is used to plot the frequency of occurring
categorical data. Each category is represented by a bar.
The bars can be created vertically or horizontally. Their
heights or lengths are proportional to the values they
represent.
advantages Simplicity and Clarity
It can be used when you are comparing variables in the
same category or tracking the progression of 1 or 2
variables over time.
For example to compare the marks of a student in
multiple subjects, a Bar Plot is the best choice.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
4. Scatter plot
A Scatter Plot uses dots to illustrate
values of Numerical Variables. It is
used to :
analyze individual points
observe and visualize relationships
between variables
or get a general overview of variables
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
5. Histogram
• A Histogram graphically represents the frequency
of Numerical Data using bars. Unlike Bar Plot, it
only represents Quantitative Data.
• The bars in the Histogram touch each other i.e.
there is no space between the bars.
It is generally used when:
you are dealing with large datasets and want to detect any
unusual activities or gaps in the data
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
6. Pie chart
A categorical variable pie chart includes each
category's values as slices whose sizes are
proportional to the quantity they represent. It is
a circular graph made with slices equal to the
number of categories.
• Pie Chart is generally used to represent
Categorical Data.
• For example, comparison in Areas of Growth
within a business such as Profit, Market
Expenses, etc
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
7. Area plot
The area plot is based on the line chart. We
get the area plot when we cover the area
between the line and the x-axis.
It is very much like Line Plot but with the key
difference of highlighting the distance between
different variables.
It is generally used to analyze progress in Time
Series, analyze Market Trends and Variations,
etc.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
8. Heatmap
A heat map is a two-dimensional representation of data
in which values are represented by colors .
Heatmaps are used to show relationships between two
variables, one plotted on each axis. By observing how cell
colors change across each axis, you can observe if there are
any patterns in value for one or both variables.
The heatmap is extremely useful for identifying
multicollinearity that occurs when the input features are
highly correlated with one or more of the other features in
the dataset.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
9. Hexbin plot
Similar to the scatter plot, a hexbin plot
represents the relationship between two
numerical variables. It is useful when there are a
lot of data points in the two variables. When you
have a lot of data points, they will overlap when
represented in a scatter plot.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Drawing A Box And Whisker Plot
Example:
Construct a box plot for the following data:
12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25
Solution:
Step 1: Arrange the data in ascending order.
Step 2: Find the median, lower quartile and upper quartile.
https://www.onlinemathlearning.com/box-plot.html
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Drawing A Box And Whisker Plot(cont.)
Step 3: Draw a number line that will include the smallest and the largest data.
Step 4: Draw three vertical lines at the lower quartile (12), median (22) and the upper
quartile (36), just above the number line.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Drawing A Box And Whisker Plot(cont.)
Step 5: Join the lines for the lower quartile and the upper quartile to form a box.
Step 6: Draw a line from thesmallest value (5) to the left side of the box and draw a line from the
right side of the box to the biggest value (53).
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Drawing A Box And Whisker Plot(cont.)
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Effectiveness of Visualization across Data Types
illustrates how different graphs can be used to visualize patterns in the data taking into consideration the data type of the variable
Notes :
KPIkey
performance
indicator,(a
quantifiable me
asure of
performance
over time for a
specific
objective)
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Tools and Software for Data Visualization
There are multiple tools and software available for data visualization.
1. Python provides open-source libraries such as
Matplotlib
Seaborn
Plotty
Bokeh
Altair
2. R provides open-source libraries such as
Ggplot2
Lattice
3. Other data visualization libraries
Tableau
Microsoft Power BI are popular among data scientists.
Looker
Sisense
Matlab for data visualization
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Process/Workflow
The data visualization process or workflow includes the fowling key steps.
1. Develop your research question
This may be a business problem or any other related problem that could be solved with a data-driven
approach. You should note all the objectives and outcomes plus required resources such as datasets, open-
source software libraries, etc.
2. Get or create your data
The next step is collecting data. You can use existing datasets if they’re relevant to your research question.
Alternatively, you can download open-source datasets from the internet or do web scraping to collect data.
3. Clean your data
Real-world data are messy. So, you need to clean them before using them for visualization. You can identify
missing values and outliers and treat them accordingly. You can perform feature selection and remove
unnecessary features from the data. You can create a new set of features based on the original features.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Process/Workflow(cont.)
4. Choose a chart type
The chart type depends on many factors. For example, it depends on the feature type (numerical or categorical). It also
depends on the type of visualization you need. Let’s say you have two numerical features. If you want to find their
distributions, you can create two histograms for each feature. If you want to plot their variations, you can create box and
whisker plots for each feature. You can create a scatterplot if you want to find a relationship (linear or non-linear, positive or
negative) between the two features.
5. Choose your tool
You can use open-source data visualization tools such as matplotlib, seaborn, plotty and ggplot. You can also use API-based
software such as Matlab, Minitab, SPSS, etc.
6. Prepare data
You can extract relevant features. You can do feature standardization if the values of the features are not on the same scale.
You can apply data preprocessing steps such as PCA to reduce the dimensionality of the data. That will allow you to visualize
high-dimensional data in 2D and 3D plots!
7. Create a chart
This is the final step. Here. You define the title and names for the axes. You should also choose a proper chart background to
ensure the content is easily readable.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Techniques in Data Science
Some of the main data visualization techniques in data science are univariate analysis, bivariate analysis and
multivariate analysis.
1. Univariate Analysis
In univariate analysis, as the name suggest, we analyze only one variable at a time. In other words, we analyze
each variable separately. Bar charts, pie charts, box plots and histograms are common examples of univariate
data visualization. Bar charts and pie charts are created for categorical variables, while box plots and
histograms are created for numerical variables.
2. Bivariate Analysis
In bivariate analysis, we analyze two variables at a time. Often, we see whether there is a relationship between
the two variables. The scatter plot is a classic example of bivariate data visualization.
3. Multivariate Analysis
In multivariate analysis, we analyze more than two variables simultaneously. The heatmap is a classic example
of multivariate data visualization. Other examples are cluster analysis and principal component analysis (PCA).
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Disadvantages
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
challenges of data visualization
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Summary
• Data visualization forms the backbone of all analytical projects. It not only helps
in gaining insights into the data but can be used as a tool for data pre-processing.
Having the right set of visualizations for different data types and business
scenarios is the key to effective communication of results.
• Data Visualization:
• Enhances learning
• Enhances understanding
• Enhances reasoning
• Helps in decision making Data visualization acts as a link between the raw data and our
engagement with it
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Examples
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Example#1
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Example#2:The Data Visualization
Catalogue
• Provides an excellent introduction to different types of visualizations •
Explore the Search by Function feature to find the best visualizations
for a given purpose
• You can find examples of other visualizations here.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Example#3: Powerful Visualization
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Example#4:
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
References