0% found this document useful (0 votes)
36 views

CH 4 Data Visualization

The document discusses data visualization and its importance in data science. It defines data visualization as using visual representations of data to help interpret patterns and insights. The key goals of data visualization are to communicate results, explore data to identify trends and correlations, and help with all stages of the data science process from data cleaning to model evaluation. Effective visualization ensures clarity, relates to the problem domain, allows interactivity, enables comparability, and is aesthetically pleasing and informative.

Uploaded by

ll Y4ZEED ll
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

CH 4 Data Visualization

The document discusses data visualization and its importance in data science. It defines data visualization as using visual representations of data to help interpret patterns and insights. The key goals of data visualization are to communicate results, explore data to identify trends and correlations, and help with all stages of the data science process from data cleaning to model evaluation. Effective visualization ensures clarity, relates to the problem domain, allows interactivity, enables comparability, and is aesthetically pleasing and informative.

Uploaded by

ll Y4ZEED ll
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Ch4:Data Visualization

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
What is Data Visualization?
• Data (or information) visualization is used to interpret and gain insight
into large amounts of data. This is achieved through visual
representations, often interactive, of raw data
• is a methodology that allows for discovering or confirming a useful
information about the data by constructing and examining the
graphical output

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
What is Data Visualization?
• Data visualization is the graphical representation of information and
data. By using visual elements like charts, graphs, and maps.
• Data visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data. Additionally, it
provides an excellent way for employees or business owners to
present data to non-technical audiences without confusion.
• In the world of Big Data, data visualization tools and technologies are
essential to analyze massive amounts of information and make data-
driven decisions.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Science Visualization 
• Data Science Visualization :is the representation of data graphically in
any format. It is the most efficient way of communicating facts with
non-technical professionals and helps them draw inferences from the
data. 
• Many companies today are data-driven. The data they acquire is
sitting in some Data Lake, usually in the cloud. The data collected is
pulled out of the Data Lakes, cleaned, and stored in a  Data
warehouse.
• Data Scientists work with these data to build and train Machine
Learning Models, make Predictive Analyses, and visualize them.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data visualization and Data Science
•There are many reasons for data visualization in data science.
Data  Visualization  Analysis
•Data visualization is one of the steps of the data science process, which states that after data has been collected,
processed and modeled, it must be visualized for conclusions to be made.

•Data visualization benefits include:


 communicating your results or findings
 monitoring the model’s performance at the evaluation stage
 hyperparameter tuning,
 identifying trends, patterns and correlation between dataset features,
 data cleaning such as outlier detection and validating model assumptions.

•The main goal of data visualization is to make it easier to identify (explore) patterns, trends and outliers(anomalies) in
large data sets

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Visualization aims to ..
Data Visualization… Can lead a user to
• Detect patterns
• Detect trends
• Detect correlations in data

Can then prompt a user to


• Draw inferences
• Anticipate potential trajectories and outcomes
• Ask new questions of the data that wouldn’t have otherwise been considered

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Example#1

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Example#2

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Advantages of Data Visualization
• The ability to absorb information quickly, improve insights and make
faster decisions
• An increased understanding of the next steps that must be taken.
• An improved ability to maintain the audience's interest with information
they can understand
• An easy distribution of information that increases the opportunity to
share insights with everyone involved.
• An increased ability to act on findings quickly and, therefore, achieve
success with greater speed and less mistakes
• some visualizations allow the user to filter out undesirable properties in
the dataset.
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Main goals of data visualization

1. What are the three main goals of data visualization?


 Communicating your results or findings with your audience 
 Exploring (knowing) your data 
 Identify trends, patterns and correlations between variables 
2. How is data visualization used in data science?
Data visualization is used in every aspect of data science:   
 Tuning hyperparameters 
 Monitoring the model’s performance 
 Cleaning data 
 Validating the model’s assumptions 

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
What Makes Data Visualization Effective?
• To get the most out of data visualization, you should consider the following
things. These are the fundamentals of data visualization. 
• Clarity: Data should be visualized in a way that everyone can understand. 
• Problem domain: When presenting data, the visualizations should be related to
the business problem. 
• Interactivity: Interactive plots are useful to compare and highlight certain things
within the plot. 
• Comparability: We can compare the thighs easily with good plots. 
• Aesthetics: Quality plots are visually aesthetic. 
• Informative: A good plot summarizes all relevant information. 

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Importance of Data Visualization in Data Science

1. Data cleaning
Data visualization plays an important role in data clearing. Good examples are
detecting outliers and removing multicollinearity. We can create scatterplots to
detect outliers and generate heatmaps to check multicollinearity. 
2. Data exploration
Before building any model, we need to do some exploratory data analysis to
identify dataset characteristics. For example, we can create histograms for
continuous variables to check for normality in the data. We can create scatterplots
between two features to check whether they are correlated. Likewise, we can
create a bar chart for the label column with two or more classes to identify class
imbalance. 

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Importance of Data Visualization in Data Science
3.Data Distribution
• Data visualization can be used to understand the distribution of the data, look for central tendencies
(mean, median, and mode), understand the presence of outliers using a boxplot, check for skewness.
4. Evaluation of modeling outputs
We can create a confusion matrix and learning curve to measure the performance of a model during
training. Plots are also useful in validating model assumptions. For example, we can create a residuals
plot and histogram for the distribution of residuals to validate the assumptions of a linear regression
model. 
5. Identifying trends
Time and seasonal plots are useful in time series analysis to identify certain trends over time. 
6. Presenting results
As a data scientist, you need to present your findings to the company or other related persons who do
not have more knowledge in the subject domain. So, you need to explain everything in plain English.
You can use informative plots that summarize your findings.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Type of Datasets in Analytical Problems

• It is important to understand the type of datasets to determine the type of


visualization that can be applied.
• E.g. when working with a tabular data a combination of bar graphs and line charts might be useful
when compared to spatial data where a map with a density plot might communicate the result
effectively.
• The key data types that are commonly used.
1. Tabular data
Data organized in tables, a row for each data item, and a column for each of its attributes.
E.g. Datasets that are available in Excel, CSV files, Pandas data frame, etc.
2. Network data
Nodes in the network are data items and links between the nodes are relations between.
For example a social network.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Type of Datasets in Analytical
Problems(cont.)
3. Spatial data:
Data which is naturally organized and understood in terms of its spatial location
or extent.
E.g. latitude and longitude of locations, geography information, suburbs,
streets, etc.
4. Textual data:
This kind of data set consists of sequences of words and punctuation.
E.g. twitter feed or customer complaints.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types
There are many data visualization types. The following are the commonly used data visualization
charts. 
1. Distribution plot
A distribution plot is used to visualize data distribution.
Example: Probability distribution plot or density curve.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)

2. Line plot
A line plot is created by connecting a series of data points
with straight lines. The number of periods is on the x-axis.

The major advantage of using Line Plot is that it is


1. very intuitive
2. you can easily understand the result, even if you have
no experience in this field.

It is commonly used to track and compare several


variables over time, analyze trends, and predict future
values. 

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)

3. Bar plot
A bar plot is used to plot the frequency of occurring
categorical data. Each category is represented by a bar.
The bars can be created vertically or horizontally. Their
heights or lengths are proportional to the values they
represent.
advantages  Simplicity and Clarity
It can be used when you are comparing variables in the
same category or tracking the progression of 1 or 2
variables over time.
For example to compare the marks of a student in
multiple subjects, a Bar Plot is the best choice.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)

4. Scatter plot
A Scatter Plot uses dots to illustrate
values of Numerical Variables. It is
used to :
analyze individual points
observe and visualize relationships
between variables
or get a general overview of variables

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)

5. Histogram
• A Histogram graphically represents the frequency
of Numerical Data using bars. Unlike Bar Plot, it
only represents Quantitative Data.
• The bars in the Histogram touch each other i.e.
there is no space between the bars.
It is generally used when:
you are dealing with large datasets and want to detect any
unusual activities or gaps in the data

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
6. Pie chart
A categorical variable pie chart includes each
category's values as slices whose sizes are
proportional to the quantity they represent. It is
a circular graph made with slices equal to the
number of categories.
• Pie Chart is generally used to represent
Categorical Data.
• For example, comparison in Areas of Growth
within a business such as Profit, Market
Expenses, etc

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)

7. Area plot
The area plot is based on the line chart. We
get the area plot when we cover the area
between the line and the x-axis.
 It is very much like Line Plot but with the key
difference of highlighting the distance between
different variables.
 It is generally used to analyze progress in Time
Series, analyze Market Trends and Variations,
etc.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)
8. Heatmap
A heat map is a two-dimensional representation of data
in which values are represented by colors .
Heatmaps are used to show relationships between two
variables, one plotted on each axis. By observing how cell
colors change across each axis, you can observe if there are
any patterns in value for one or both variables.
The heatmap is extremely useful for identifying
multicollinearity that occurs when the input features are
highly correlated with one or more of the other features in
the dataset.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)

9. Hexbin plot
Similar to the scatter plot, a hexbin plot
represents the relationship between two
numerical variables. It is useful when there are a
lot of data points in the two variables. When you
have a lot of data points, they will overlap when
represented in a scatter plot.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Types(cont.)

10. Box and whisker plot

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Drawing A Box And Whisker Plot
Example:
Construct a box plot for the following data:
12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25
Solution:
Step 1: Arrange the data in ascending order.
Step 2: Find the median, lower quartile and upper quartile.

Median (middle value) = 22


Lower quartile (middle value of the lower half) = 12
Upper quartile (middle value of the upper half) = 36
(If there is an even number of data items, then we need to get the average of the middle numbers.)

https://www.onlinemathlearning.com/box-plot.html
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Drawing A Box And Whisker Plot(cont.)

Step 3: Draw a number line that will include the smallest and the largest data.

                           
Step 4: Draw three vertical lines at the lower quartile (12), median (22) and the upper
quartile (36), just above the number line.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Drawing A Box And Whisker Plot(cont.)

Step 5: Join the lines for the lower quartile and the upper quartile to form a box.

Step 6: Draw a line from thesmallest value (5) to the left side of the box and draw a line from the
right side of the box to the biggest value (53).

                           
Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Drawing A Box And Whisker Plot(cont.)

Example 2: comparative double box and whisker plot 


Suppose an IT company has two stores that sell computers. The company
recorded the number of sales each store made each month. In the past 12
months, we have the following numbers of sold computers:
Store 1:
350, 460, 20, 160, 580, 250, 210, 120, 200, 510, 290, 380.
Store 2:
520, 180, 260, 380, 80, 500, 630, 420, 210, 70, 440, 140.

Interpreting the results:


• Store 2’s highest and lowest sales are both higher than Store 1’s
relevant sales.
• In addition, Store 2’s median sales value is higher than Store 1’s.
• Also, Store 2’s interquartile range is larger.
• These results tell us that Store 2 consistently sells more computers
than Store 1.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Effectiveness of Visualization across Data Types
illustrates how different graphs can be used to visualize patterns in the data taking into consideration the data type of the variable

Notes :
KPIkey
performance
indicator,(a
quantifiable me
asure of
performance
over time for a
specific
objective)

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Tools and Software for Data Visualization

There are multiple tools and software available for data visualization.  
1. Python provides open-source libraries such as  
Matplotlib 
Seaborn 
Plotty 
Bokeh 
Altair
2. R provides open-source libraries such as 
Ggplot2 
Lattice
3. Other data visualization libraries  
Tableau 
Microsoft Power BI are popular among data scientists. 
Looker
Sisense
Matlab for data visualization 

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Process/Workflow
The data visualization process or workflow includes the fowling key steps. 
1. Develop your research question
This may be a business problem or any other related problem that could be solved with a data-driven
approach. You should note all the objectives and outcomes plus required resources such as datasets, open-
source software libraries, etc. 
2. Get or create your data
The next step is collecting data. You can use existing datasets if they’re relevant to your research question.
Alternatively, you can download open-source datasets from the internet or do web scraping to collect data. 
3. Clean your data
Real-world data are messy. So, you need to clean them before using them for visualization. You can identify
missing values and outliers and treat them accordingly. You can perform feature selection and remove
unnecessary features from the data. You can create a new set of features based on the original features. 

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Process/Workflow(cont.)
4. Choose a chart type
The chart type depends on many factors. For example, it depends on the feature type (numerical or categorical). It also
depends on the type of visualization you need. Let’s say you have two numerical features. If you want to find their
distributions, you can create two histograms for each feature. If you want to plot their variations, you can create box and
whisker plots for each feature. You can create a scatterplot if you want to find a relationship (linear or non-linear, positive or
negative) between the two features.  
5. Choose your tool
You can use open-source data visualization tools such as matplotlib, seaborn, plotty and ggplot. You can also use API-based
software such as Matlab, Minitab, SPSS, etc. 
6. Prepare data
You can extract relevant features. You can do feature standardization if the values of the features are not on the same scale.
You can apply data preprocessing steps such as PCA to reduce the dimensionality of the data. That will allow you to visualize
high-dimensional data in 2D and 3D plots! 
7. Create a chart
This is the final step. Here. You define the title and names for the axes. You should also choose a proper chart background to
ensure the content is easily readable.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Techniques in Data Science
Some of the main data visualization techniques in data science are univariate analysis, bivariate analysis and
multivariate analysis. 
1. Univariate Analysis
In univariate analysis, as the name suggest, we analyze only one variable at a time. In other words, we analyze
each variable separately. Bar charts, pie charts, box plots and histograms are common examples of univariate
data visualization. Bar charts and pie charts are created for categorical variables, while box plots and
histograms are created for numerical variables. 
2. Bivariate Analysis
In bivariate analysis, we analyze two variables at a time. Often, we see whether there is a relationship between
the two variables. The scatter plot is a classic example of bivariate data visualization. 
3. Multivariate Analysis
In multivariate analysis, we analyze more than two variables simultaneously. The heatmap is a classic example
of multivariate data visualization. Other examples are cluster analysis and principal component analysis (PCA). 

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Disadvantages

There are also some disadvantages of data visualization. 


We need to download, install and configure software and open-source libraries.
The process will be difficult and time-consuming for beginners. 
Some data visualization tools are not available for free. We need to pay for those.
 
When we summarize the data, we’ll lose the exact information. 

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
challenges of data visualization

Choosing the right plot type 


Identifying the needs of your audience 
Developing the research question convert it to a data science
question 
Collecting data 

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Summary
• Data visualization forms the backbone of all analytical projects. It not only helps
in gaining insights into the data but can be used as a tool for data pre-processing.
Having the right set of visualizations for different data types and business
scenarios is the key to effective communication of results.
• Data Visualization:
• Enhances learning
• Enhances understanding
• Enhances reasoning
• Helps in decision making Data visualization acts as a link between the raw data and our
engagement with it

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Data Visualization Examples

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Example#1

• The best example of how these considerations and concepts can be


put into practice
Interacting with Google Maps

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Example#2:The Data Visualization
Catalogue
• Provides an excellent introduction to different types of visualizations •
Explore the Search by Function feature to find the best visualizations
for a given purpose
• You can find examples of other visualizations here.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Example#3: Powerful Visualization

• Powerful Visualization, but not


Immediately Intuitive
• Digital Commons Network –
Open Access.
• Powered by Scholars. Published
by Universities
Digital Commons Network.

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
Example#4:

• Using Visualization for


Story-Telling
• Connected China – Reuters

Ms.Duha Qutishat
1st Semester 2022-2023 Introduction to Data Science Course
References

[1] Introduction to Data Visualization Rebecca Bartlett, Heather MacDonald, Scott


Turner
[2] Learnbay.Datascience
[3] “Interactive Data Visualization” by M.O. Ward et al., Second Edition.
[4]
https://towardsdatascience.com/data-visualization-in-data-science-5681c
bdde5bf
[5] https://www.tableau.com/learn/articles/data-visualization
[6]
https://www.knowledgehut.com/blog/data-science/data-visualization-in-
data-science#what-is-data-visualization
[7] https://hevodata.com/learn/data-science-visualization/
1st Semester 2022-2023 Introduction to Data Science Course
Ms.Duha Qutishat

You might also like