Block 1
Block 1
1.0 INTRODUCTION
The Internet and communication technology has grown tremendously in the past decade
leading to generation of large amount of unstructured data. This unstructured data
includes data such as, unformatted textual, graphical, video, audio data etc., which is
being generated as a result of people use of social media and mobile technologies. In
addition, as there is a tremendous growth in the digital eco system of organisation, large
amount of semi-structured data, like XML data, is also being generated at a large rate.
All such data is in addition to the large amount of data that results from organisational
databases and data warehouses. This data may be processed in real time to support
decision making process of various organisations. The discipline of data science focuses
on the processes of collection, integration and processing of large amount of data to
produce useful decision making information, which may be useful for informed
decision making.
This unit introduces you to the basic concept of data sciences. This unit provides an
introduction to different types of data used in data science. It also points to different
types of analysis that can be performed using data science. Further, the Unit also
introduces some of the common mistakes of data science.
1.1 OBJECTIVES
At the end of this unit you should be able to:
• Define the term data science in the context of an organization
• explain different types of data
• list and explain different types of analysis that can be performed on data
• explain the common mistakes about data size
• define the concept of data dredging
• List some of the applications of data sites
• Define the life cycle of data science
1
Introduction to Data Science
1.2 DATA SCIENCE-DEFINITION
Data Science is a multi-disciplinary science with an objective to perform data analysis
to generate knowledge that can be used for decision making. This knowledge can be
in the form of similar patterns or predictive planning models, forecasting models etc.
A data science application collects data and information from multiple heterogenous
sources, cleans, integrates, processes and analyses this data using various tools and
presents information and knowledge in various visual forms.
Programming
Visualization Modelling and
Machine Learning simulation,
Computing Mathematics
Big data
Database System Statistics
Data
Science
Domain
Knowledge
What are the advantages of Data science in an organisation? The following are some
of the areas in which data science can be useful.
• It helps in making business decisions such as deciding the health of
companies with whom they plan to collaborate,
• It may help in making better predictions for the future such as making
strategic plans of the company based on present trends etc.
• It may identify similarities among various data patterns leading to
applications like fraud detection, targeted marketing etc.
In general, data science is a way forward for business decision making, especially in
the present day world, where data is being generate at the rate of Zetta bytes.
Data Science can be used in many organisations, some of the possible usage of data
science are as given below:
• It has great potential for finding the best dynamic route from a source to
destination. Such application may constantly monitor the traffic flow and
predict the best route based on collected data.
• It may bring down the logistic costs of an organization by suggesting the best
time and route for transporting foods
• It can minimize marketing expenses by identifying the similar group buying
patterns and performing selective advertising based on the data obtained.
• It can help in making public health policies, especially in the cases of
disasters.
2
Basics of Data Science
• It can be useful in studying the environmental impact of various
developmental activities
• It can be very useful in savings of resources in smart cities
Type of data is one of the important aspect, which determines the type of
analysis that has to be performed on data. In data science, the following are the
different types of data, that are required to be processed:
1. Structured Data
2. Semi-Structured Data
3. Unstructured data
4. Data Streams
Structured Data
Since the start of the era of computing, computer has been used as a data
processing device. However, it was not before 1960s, when businesses started
using computer for processing their data. One of the most popular language of
that era was Common Business-Oriented Language (COBOL). COBOL had a
data division, which used to represent the structure of the data being processed.
This was followed by a disruptive seminal design of technology by a E.F.
Codd. This lead to creation of relational database management systems
(RDBMS). RDBMS allows structured storage, retrieval and processing of
integrated data of an organisation that can be securely shared among several
applications. The RDBMS technology also supported secure transaction, thus,
became a major source of data generation. Figure 2 shows the sample structure
of data that may be stored in a relational database system. One of the key
characteristics of structured data is that it can be associated with a schema. In
addition, each schema element may be related to a specific data type.
The relational data is structured data and large amount of this structured data is
being collected by various organisations, as backend to most applications. In
90s, the concept of a data warehouse was introduced. A data warehouse is a
time-invariant, subject-oriented aggregation of data of an organisation that can
be used for decision making. A data in a data warehouse is represented using
dimension tables and fact tables. The dimensional tables classifies the data of
fact tables. You have already studied various schemas in the context of data
warehouse in MCS221. The data of data warehouse is also structured in nature
and can be used for analytical data processing and data mining. In addition,
many different types of database management systems have been developed,
which mostly store structured data.
3
Introduction to Data Science
However, with the growth of communication and mobile technologies many
different applications became very popular leading to generation of very large
amount of semi-structured and unstructured data. These are discussed next.
Semi-structured Data
As the name suggest Semi-structured has some structure in it. The structure of
semi-structured data is due to the use of tags or key/value pairs The common
form of semi-structured data is produced through XML, JSON objects, Server
logs, EDI data, etc. The example of semi-structured data is shown in the Figure
3.
<Book> "Book": {
<title>Data Science and Big Data</title> "Title": "Data
<author>R Raman</author> Science",
<author>C V Shekhar</author>
"Price": 5000,
<yearofpublication>2020</yearofpublicatio
n>
</Book> "Year": 2020
}
Figure 3: Sample semi-structured data
Unstructured Data
The unstructured data does not follow any schema definition. For example, a
written text like content of this Unit is unstructured. You may add certain
headings or meta data for unstructured data. In fact, the growth of internet has
resulted in generation of Zetta bytes of unstructured data. Some of the
unstructured data can be as listed below:
• Large written textual data such as email data, social media data etc..
• Unprocessed audio and video data
• Image data and mobile data
• Unprocessed natural speech data
• Unprocessed geographical data
In general, this data requires huge storage space, newer processing methods
and faster processing capabilities.
Data Streams
A data stream is characterised by a sequence of data over a period of time.
Such data may be structured, semi-structured or unstructured, but it gets
generated repeatedly. For example, IoT devices like weather sensors will
generate data stream of pressure, temperature, wind direction, wind speed,
humidity etc for a particular place where it is installed. Such data is huge for
many applications are required to be processed in real time. In general, not all
the data of streams is required to be stored and such data is required to be
processed for a specific duration of time.
4
1.3.1 Statistical Data Types Basics of Data Science
There are two distinct types of data that can be used in statistical analysis.
These are – Categorical data and Quantitative data
Categorical data is used to define the category of data, for example, occupation
of a person may take values of the categories “Business”, “Salaried”. “Others”
etc. The categorical data can be of two distinct measurement scales called
Nominal and Ordinal, which are given in Figure 4. If the categories are not
related, then categorical data is of Nominal data type, for example, the
Business category and Salaried categories have no relationship, therefore it is
of Nominal type. However, a categorical variable like age category, defining
age in categories “0 or more but less than 26”, “26 or more but less than 46”,
“46 or more but less than 61”, “More than 61”, has a specific relationship. For
example, the person in age category “More than 61” are elder to person in any
other age category.
Quantitative Data:
Quantitative data is the numeric data, which can be used to define different
scale of data. The qualitative data is also of two basic types –discrete, which
represents distinct numbers like 2, 3, 5,… or continuous, which represent a
continuous values of a given variable, for example, your height can be
measured using continuous scale.
Data are raw facts, for example, student data may include name, Gender, Age,
Height of student, etc. The name typically is a distinguishing data that tries to
distinctly identify two data items, just like primary key in a database.
However, the name data or any other identifying data may not be useful for
performing data analysis in data science. The data such as Gender, Age, Height
may be used to answer queries of the kind: Is there a difference in the height of
boys and girls in the age range 10-15 years? One of the important question is
how do you measure the data so that it is recorded consistently? Stanley
Stevens, a psychologist, defined the following four characteristics that any
scale that can be measured:
5
Introduction to Data Science
absolute zero value, whereas, the Intelligent quotient cannot be defined
as zero.
1.3.2 Sampling
In general the size of data that is to be processed today is quite large. This
leads you to the question, whether you would use the entire data or some
representative sample of this data. In several data science techniques sample
data is used to develop an exploratory model also. Thus, even in the data
science sample is one of the ways, which can enhance the speed of exploratory
data analysis. The population in this case may be the entire set of data that you
may be interested. Figure 5 shows the relationships between population and
sample. One of the question, which is asked in this context is what should be
the size of a good sample. You may have to find the answer in the literature.
However, you may please note that a good sample is representative of its
population.
Sample
Population
6
One of the key objectives of statistics, which uses sample data, is to determine Basics of Data Science
the statistic of the sample and find the probability that the statistic developed
for the sample would determine the parameters of population with a specific
percentage of accuracy. Please note the terms stated above are very important
and explain in the following table:
3. What would be the measurement scale for the following? Give reason
in support of your answer.
Age, AgeCategory, Colour of eye, Weight of students of a class, Grade
of students, 5-point Likert scale
Descriptive analysis is used to present basic summaries about data; however, it makes
no attempt to interpret the data. These summaries may include different statistical
values and certain graphs. Different types of data are described using different ways.
The following example illustrates this concept:
Example 1: Consider the data given in the following Figure 6. Show the summary of
categorical data in this Figure.
7
Introduction to Data Science S20200005 M 173
S20200006 M 160
S20200007 M 180
S20200008 F 178
S20200009 F 167
S20200010 M 173
Figure 6: Sample Height Data
Please note that enrolment number variable need not be used in analysis, so no
summary data for enrolment number is to be found.
In addition, you can draw bar chart or pie chart for describing the data of Gender
variable. The pie chart for such data is shown in Figure 7. Details of different
charts are explained in Unit 4. In general, you draw a bar graph, in case the number
of categories is more.
The median of the data would be the mid value of the sorted data. First data is sorted
and the median is computed using the following formula:
If n is even, then
8
! ! Basics of Data Science
median = [(𝑉𝑎𝑙𝑢𝑒𝑜𝑓( )#$ ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 + 𝑉𝑎𝑙𝑢𝑒𝑜𝑓(( + 1)#$ 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛]/2
" "
If n is odd, then
!%&
median =(𝑉𝑎𝑙𝑢𝑒𝑜𝑓( )#$ ) 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
"
For this example, the sorted data is as follows:
You may please note that outliers, which are defined as values highly different from
most other values, can impact the mean but not the median. For example, if one
observation in the data, as shown in example 2 is changed as:
Then the median will still remain 14, however, mean will change to 20.64, which is
quite different from the earlier mean. Thus, you should be careful about the presence
of outliers while data analysis.
Interestingly, mean and mode may be useful in determining the nature of data. The
following table describes these conditions:
9
Introduction to Data Science
Mean << Median The distribution
may be right-
skewed
Mode: Mode is defined as the most frequent value of a set of observation. For
example, in the data of example 2, the value 14, which occurs twice, is the mode. The
mode value need not be a mid-value rather it can be any value of the observations. It
just communicates the most frequently occurring value only. In a frequency graph,
mode is represented by peak of data. For example, in the graphs shown in Figure 8,
the value corresponding to the peaks is the mode.
Standard Standard deviation is one of the most Try both the formula
Deviation used measure for finding the spread or and then match the
variability of data. It can be computed answer:
as: 6.4
For Sample:
10
Basics of Data Science
!
1
𝑠= E A(𝑥 − 𝑥̅ )"
(𝑛 − 1)
'(&
For Population:
!
1
σ = E A(𝑥 − µ)"
𝑁
'(&
5-Point For creating 5-point summary first, you Use Sorted data of
Summary and need to sort the data. The five point Example 2
Interquartile summary is defined as follows:
Range (IQR) Minimum Value (Min) Min = 4
st
1 Quartile<=25% values (Q1) Q1=(9+10)/2=9.5
2nd Quartile is median (M) M = 14
3rd Quartile is <=75% values (Q3) Q3=(18+19)/2=18.5
Maximum Value (Max) Max=25
IQR is the difference between 3rd and 1st IQR= 18.5-9.5=9
quartiles values.
Figure 9: The Measure of Spread or Variability
The IQR can also be used to identify suspected outliers. In general, a suspected outlier
can exist in the following two ranges:
Observation/values less than Q1 – 1.5´IQR
Observation/values more than Q3+1.5 ´ IQR
For the example 2,
IQR is 9, therefore the outliers may be: Values < (9.5 – 9) or Values < 0.5.
or Values > (18.5 – 9) or Values > 27.5.
Thus, there is no outlier in the initial data of Example 2.
For the qualitative data, you may draw various plots, such as histogram, box plot etc.
These plots are explained in Unit 4 of this block.
1. As a first step, you may perform the descriptive of various categorical and
qualitative variables of your data. Such information is very useful in
11
Introduction to Data Science determining the suitability of data for the purpose of analysis. This may also
help you in data cleaning, modification and transformation of data.
a. For the qualitative data, you may create frequency tables and bar
charts to know the distribution of data among different categories. A
balanced distribution of data among categories is most desirable.
However, such distribution may not be possible in actual situations.
Several methods has been suggested to deal with such situations.
Some of those will be discussed in the later units.
b. For the quantitative data, you may compute the mean, median,
standard deviation, skewness and kurtosis. The kurtosis value relates
to peaks (determined by mode) in the data. In addition, you may also
draw the charts like histogram to look into frequency distribution.
2. Next, after performing the univariate analysis, you may try to perform some
bi-variate analysis. Some of the basic statistics that you can perform for bi-
variate analysis includes the following:
a. Make two way table between categorical variables and make related
stacked bar charts. You may also use chi-square testing find any
significant relationships.
b. You may draw side-by-side box plots to check if the data of various
categories have differences.
c. You may draw scatterplot and check the correlation coefficient, if that
exists between two variables.
3. Finally, you may like to look into the possibilities of multi-variate
relationships amongst data. You may use dimensionality reduction by using
techniques feature extraction or principle component analysis, you may
perform clustering to identify possible set of classes in the solution space or
you may use graphical tools, like bubble charts, to visualize the data.
It may be noted that exploratory data analysis helps in identifying the possibilities of
relationships amongst data, but does not promises that a causal relationship may exist
amongst variables. The causal relationship has to be ascertained through qualitative
analysis. Let us explain the exploratory data analysis with the help of an example.
Example 3: Consider the sample data of students given in Figure 6 about Gender and
Height. Let us explore this data for an analytical question: Does Height depends on
Gender?
You can perform the exploratory analysis on this data by drawing a side-by-side box
plot for Male and Female students height. This box plot is shown in Figure 10.
Please note that box plot of Figure 10 shows that on an average height of male
students is more than the female student. Does this result applies, in general for the
population? For answering this question, you need to find the probability of
12
occurrence of such sample data. need to determine the probability , therefore, Basics of Data Science
Inferential analysis may need to be performed.
You may have read about many of these tests in Data Warehousing and Data Mining
and Artificial Intelligence and Machine Leaning course. In addition, you may refer to
further readings for these tools. The following example explains the importance of
Inferential analysis.
Example 4:Figure 10 in Example 3 shows the box plot of height of male and female
students. Can you infer from the boxplot and the sample data (Figure 6), if there is
difference in the height of male and female students.
In order to infer, if there is a difference between the hight of two groups (Male and
Female Students), a two-sample t-test was run on the data. The output of this t-test is
shown in Figure 12.
t-Test (two tail): Assuming Unequal Variances
Female Male
Mean 167 173
Variance 94.5 63.5
Observations 5 5
Computed t-value -1.07
p-value 0.32
Critical t-value 2.30
13
Introduction to Data Science
Figure 12 shows that the mean height of the female students is 167 cm, whereas for
the male students it is 173 cm. The variance of female candidates is 94.5, whereas for
male candidate it is 63.5. Each group is interpreted on the basis of 5 observations. The
computed t-value is -1.07and p-value is 0.32. As the p-value is greater than 0.05,
therefore you can conclude that you cannot conclude that the average male student
height is different from average female student height.
With the availability of large amount of data and advanced algorithms for mining and
analysis of large data have led the way to advanced predictive analysis. The predictive
analysis of today uses tools from Artificial Intelligence, Machine Learning, Data
Mining, Data Stream Processing, data modelling etc. to make prediction for strategic
planning and policies of organisations. Predictive analysis uses large amount of data
to identify potential risks and aid the decision-making process. It can be used in
several data intensive industries like electronic marketing, financial analysis,
healthcare applications, etc. For example, in the healthcare industry, predictive
analysis may be used to determine the support for public health infrastructure
requirements for the future based on the present health data.
Advancements in Artificial intelligence, data modeling, machine learning has also led
to Prescriptive analysis. The prescriptive analysis aims to take predictions one step
forward and suggest solutions to present and future issues.
A detailed discussion on these topics is beyond the scope of this Unit. You may refer
to further readings for more information on these.
14
Basics of Data Science
Study
Causes Causes
Attend Mark
Observed, but s
not a Cause
Figure 13: Correlation does not mean causation
Data Dredging: Data Dredging, as the name suggest, is extensive analysis of very
large data sets. Such analysis results in generation of large number of data
associations. Many of those associations may not be casual, thus, requires further
exploration through other techniques. Therefore, it is essential that every data
association with large data set should be investigated further before reporting them as
conclusion of the study.
15
Introduction to Data Science
1.6 APPLICATIONS OF DATA SCIENCE
Data Science is useful in analysing large data sets to produce useful information that
can be used for business development and can help in decision making process. This
section highlights some of the applications of data science.
In general, data science can be used for the benefit of society. It should be used
creatively to improve the effective resource utilization, which may lead to sustainable
development. The ultimate goal of data science applications should be to help us
protect our environment and human welfare.
16
Basics of Data Science
1.7 DATA SCIENCE LIFE CYCLE
So far, we have discussed about various aspects of data science in the previous
sections. In this section, we discuss about the life cycle of a data science based
application. In general, a data science application development may involve the
following stages:
Thus, in general, data science project follows a spiral of development. This is shown
in Figure 16.
17
Introduction to Data Science
Data Science
Project
Requirements
Analysis Phase
Model
deployment Data collection
and and Preparation
Phase
Refinement
1.8 SUMMARY
This Unit introduces basic statistical and analytical concepts of data science. This Unit
first introduces you to the definition of the data science. Data science as a discipline
uses concepts from computing, mathematics and domain knowledge. The types of
data for data science is defined in two different ways. First, it is defined on the basis
of structure and generation rate of data, next it is defined as the measures that can be
used to capture the data. In addition, the concept of sampling has been defined in this
Unit.
This Unit also explains some of the basic methods used for analysis, which includes
descriptive, exploratory, inferential and predictive. Few interesting misconceptions
related to data science has also been explained with the help of example. This unit
also introduces you to some of the applications of data science and data science life
cycle. In the ever-advancing technology, it is suggested to keep reading about newer
data science applications
1.9 SOLUTIONS/ANSWERS
1. Box plots shows 5-point summary of data. A well spread box plot is an
indicator of normally distributed data. Side-by-side box blots can be
used to do a comparison of scale data values of two or more categories.
2. Inferential analysis also computes p-value, which determines if the
result obtained by exploratory analysis are significant enough, such that
results may be applicable for the population.
3. Simpson’s paradox signifies that grouped data sometimes statistics may
produce results that are contrary to when same statistics is applied to
ungrouped data.
19
Basics of Data Science
UNIT 2 PORTABILITY AND STATISTICS FOR
DATA SCIENCE
2.0 Introduction
2.1 Objectives
2.2 Probability
2.2.1 Conditional Probability
2.2.2 Bayes Theorem
2.3 Random Variables and Basic Distributions
2.3.1 Binomial Distribution
2.3.2 Probability Distribution of Continuous Random Variable
2.3.3 The Normal Distribution
2.4 Sampling Distribution and the Central Limit Theorem
2.5 Statistical Hypothesis Testing
2.5.1 Estimation of Parameters of the Population
2.5.2 Significance Testing of Statistical Hypothesis
2.5.3 Example using Correlation and Regression
2.5.4 Types of Errors in Hypothesis Testing
2.6 Summary
2.7 Solution/Answers
2.0 INTRODUCTION
In the previous unit of this Block, you were introduced to the basic concepts of
data science, which include the basic types of data, basic methods of data
analysis and applications and life cycle of data science. This Unit introduces
you to the basic concepts related to probability and statistics related to data
science.
It introduces the concept of conditional probability and Bayes Theorem. It is
followed by discussion on the basic probability distribution, highlighting their
significance and use. These distributions includes Binomial and Normal
distributions, the two most used distributions from discrete and continuous
variables respectively. The Unit also introduces you to the concept of sampling
distribution and central limit theorem. Finally, this unit covers the concepts of
statistical hypothesis testing with the help of an example of correlation. You
may refer to further readings for more details on these topics, if needed.
2.1 OBJECTIVES
1
Portability and Statistics
for Data Science
2.2 PROBABILITY
In the equation above, the set of all possible outcomes is also called sample
space. In addition, it is expected that all the outcomes are equally likely to occur.
Consider that you decided to roll two fair dice together at the same time. Will
the outcome of first die will affect the outcome of the second die? It will not, as
both the outcomes are independent of each other. In other words both the trials
are independent, if the outcome of the first trial does not affect the outcome of
second trail and vice-versa; else the trails are dependent trials.
How to compute the probability for more than one events in a sample space. Let
us explain this with the help of example.
Example 1: A fair die having six equally likely outcomes is to be thrown, then:
(i) What is the sample space: {1, 2, 3, 4, 5, 6}
(ii) An Event Ais die shows 2, then outcome of event A is {2}; and
probability P(A)=1/6
(iii) An Event Bis die shows odd face, then Event B is {1, 3, 5}; and
probability of Event B is P(B) = 3/6= 1/2
(iv) An Event C is die shows even face, then Event C is {2, 4, 6}; and
probability of Event C is P(C) = 3/6= 1/2
(v) Event A and B are disjoint events, as no outcomes between them is
common. So are the Event B and C. But event A and C are not
disjoint.
(vi) Intersection of Events A and B is a null set {}, as they are disjoint
events, therefore, probability that both events A and B both occur,
viz. P(AÇB) = 0. However, intersection of A and C is {2}, therefore,
P(AÇ C) = 1/6.
(vii) The union of the Events A and B would be {1, 2, 3. 5}, therefore, the
probability that event A or event B occurs, viz. P(AÈB)=4/6=2/3.
Whereas, the Union of events B and C is {1, 2, 3, 4, 5, 6}, therefore,
P(BÈC)=6/6=1.
Please note that the following formula can be derived from the above example.
Probability of occurrence of any of the two events A or B (also called union of
events) is:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) (2)
2
Basics of Data Science
For example 1, you may compute the probability of occurrence of event A or C
as:
𝑃(𝐴 ∪ 𝐶) = 𝑃(𝐴) + 𝑃(𝐶) − 𝑃(𝐴 ∩ 𝐶)
= 1/6 + 1/2 – 1/6 = 1/2.
In the case of disjoint events, since 𝑃(𝐴 ∩ 𝐵) is zero, therefore, the equation
(2) will reduce to:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) (3)
Given two events X and Y with the probability of occurrences P(X) and
P(Y) respectively. What would be the probability of occurrence of X if
the other event Y has actually occurred?
Let us analyse the problem further. Since the event Y has already occurred,
therefore, sample space reduces to the sample space of event Y. In addition, the
possible outcomes for occurrence of X could be the occurrences at the
intersection of X and Y, as that is the only space of X, which is part of sample
space of Y. Figure 1 shows this with the help of a Venn diagram.
Sample Space
if Event Y
X Y occurred
Possible
Outcome of X
Initial Sample Space after Y has
occurred
Figure 1: The conditional Probability of event A given that event B has occurred
You can compute the conditional probability using the following equation.
%(' ∩ *)
𝑃(𝑋/𝑌) = %(*) (5)
3
Portability and Statistics
for Data Science Where 𝑃(𝑋/𝑌) is the conditional probability of occurrence of event X, if event
Y has occurred.
For example, in example 1, what is the probability of occurrence of event A, if
event C has occurred?
You may please note that P(AÇC) = 1/6 and P(C)=1/2, therefore, the conditional
probability 𝑃(𝐴/𝐶) would be:
𝑃(𝐴 ∩ 𝐶) 13
𝑃(𝐴/𝐶) = = 6 = 133
𝑃(𝐶) 13
2
What would be conditional probability of disjoint events? You may find the
answer, by computing the 𝑃(𝐴/𝐵) for the Example 1.
Independent events are a special case for the conditional probability. As the two
events are independent of each other, therefore, occurrence of the any one of the
event does not change the probability or occurrence of the second event.
Therefore, for independent events X and Y
𝑃(𝑋/𝑌) = 𝑃(𝑋) 𝑎𝑛𝑑 𝑃(𝑌/𝑋) = 𝑃(𝑌) (7)
In fact, the equation (7) can be used to determine the independent events
Bayes theorem is one of the important theorem, which deals with the
conditional probability. Mathematically, Bayes theorem can be written using
equation (6) as,
𝑃(𝑋/𝑌) × 𝑃(𝑌) = 𝑃(𝑌/𝑋) × 𝑃(𝑋)
%(*/')×%(')
Or 𝑃(𝑋/𝑌) = (8)
%(*)
Example 3:Assume that you have two bags namely Bag A and Bag B. Bag A
contains 5 green and 5 red balls; whereas, Bag B contains 3 green and 7 red
balls. Assume that you have drawn a red ball, what is the probability that this
red ball is drawn from Bag B.
In this example,
Let the event X be “Drawing a Red Ball”. The probability of drawing a red ball
can be computed as follows;
You may select a Bag and then draw a ball. Therefore, the probability
will be computed as:
(Probability of selection of Bag A) ´(Probability of selection of red ball
in Bag A) + (Probability of selection of Bag B) ´ (Probability of
selection of red ball in Bag B)
4
Basics of Data Science
P(Red)= (1/2´5/10 + 1/2 ´ 7/10) = 3/5
Let the event Y be “Selection of Bag B from the two Bags, assuming equally
likely selection of Bags. Therefore, P(BagB)=1/2.
In addition, if Bag B is selected then the probability of drawing Red ball
P(Red/BagB)=7/10, as Bag B has already been selected and it has 3 Green and
7 Red balls.
As per the Bayes Theorem:
%(./0/1231)×%(1231)
𝑃(𝐵𝑎𝑔𝐵/𝑅𝑒𝑑) = %(./0)
! "
× 4
𝑃(𝐵𝑎𝑔𝐵/𝑅𝑒𝑑) = "# $
% = !#
&
Bayes theorem is a powerful tool to revise your estimate provided a given
event has occurred. Thus, you may be able to change your predictions.
1. Is 𝑃(𝑌/𝑋) = 𝑃(𝑌/𝑋)?
2. How can you use probabilities to find, if two events are independent.
3. The MCA batches of University A and University B consists of 20 and
30 students respectively. University A has 10 students who have
obtained more than 75% marks and University B has 20 such students.
A recruitment agency selects one of these student who has more than
75% marks out of the two Universities. What is the probability that the
selected student is from University A?
Using the data of Figure 2, you can create the following frequency table, which
can also be converted to probability.
5
Portability and Statistics
for Data Science
X Frequency Probability P(X)
0 1 1/8
1 3 3/8
2 3 3/8
3 1 1/8
Total 8 Sum of all P(X) = 1
Figure 3: The Frequency and Probability of Random Variable X
0.35
0.3
0.25
PROBABILITY
0.2
0.1
0.05
0
0 1 2 3
NUMBER OF HEADS (X)
6
Basics of Data Science
Another important value defined in probability distribution is the mean or
expected value, which is computed using the following equation (9)for random
variable X:
𝜇 = ∑6578 𝑥5 × 𝑝5 (9)
Thus, the mean or expected number of heads in three trials would be:
𝜇 = 𝑥8 × 𝑝8 + 𝑥! × 𝑝! + 𝑥# × 𝑝# + 𝑥9 × 𝑝9
! 9 9 ! !#
𝜇 = 0 × $ + 1 × $ + 2 × $ + 3 × $ = $ = 1.5
Therefore, in a trail of 3 tosses of coins, the mean number of heads is 1.5.
7
Portability and Statistics
for Data Science 𝜇 =𝑛×𝑠 (12a)
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) (12b)
Therefore, for the variable X which represents number of heads in three tosses
of coin, the mean and standard deviation are:
!
𝜇 = 𝑛 × 𝑠 = 3 × # = 1.5
! ! √9
𝜎 = L𝑛 × 𝑠 × (1 − 𝑠) = M3 × # × (1 − #) = #
Frequency of 'Height'
27
14 14
11 11
8
6
5
3
0 1
(145, 150] (155, 160] (165, 170] (175, 180] (185, 190]
[140, 145] (150, 155] (160, 165] (170, 175] (180, 185] (190, 195]
Figure 5: Histogram of Height of 100 students of a Class
The mean of the height was 166 and the standard distribution was about 10.The
probability for a student height is in between 165 to 170 interval is 0.27.
9
Portability and Statistics
for Data Science 𝐴𝑟𝑒𝑎 𝑢𝑛𝑑𝑒𝑟 𝑡ℎ𝑒 𝑐𝑢𝑟𝑣𝑒
= 0.9032
µ 𝜇 + 1.3𝜎
𝜇+𝜎
Figure 7: Computing Probability using Normal Distribution
With the basic introduction, as above, next we discuss one of the important
aspect of sample and population called sampling distribution. A typical
statistical experiment may be based on a specific sample of data that may be
collected by the researcher. Such data is termed as the primary data. The
question is – Does the statistical results obtained by you using the primary data
can be applied to the population? If yes, what may be the accuracy of such a
collection? To answer this question, you must study the sampling distribution.
Sampling distribution is also a probability distribution, however, this
distribution shows the probability of choosing a specific sample from the
population. In other words, a sampling distribution is the probability distribution
of means of the random samples of the population. The probability in this
distribution defines the likelihood of the occurrence of the specific mean of the
sample collected by the researcher. Sampling distribution determines whether
the statistics of the sample falls closer to population parameters or not. The
following example explains the concept of sampling distribution in the context
of a categorical variable.
10
Basics of Data Science
Example 5: Consider a small population of just 5 person, who vote for a question
“Data Science be made the Core Course in Computer Science? (Yes/No)”. The
following table shows the population:
Suppose, you take a sample size (n) = 3, and collects random sample. The
following are the possible set of random samples:
Sample Sample Proportion (𝑝̂ )
P1, P2, P3 0.67
P1, P2, P4 0.67
P1, P2, P5 0.67
P1, P3, P4 0.33
P1, P3, P5 0.33
P1, P4, P5 0.33
P2, P3, P4 0.33
P2, P3, P5 0.33
P2, P4, P5 0.33
P3, P4, P5 0.00
Frequency of all the sample proportions is:
𝑝̂ Frequency
0 1
0.33 6
0.67 3
8
6
FREQUENCY
6
4
2 3
1
0
0 0.33 0.67
SAMPLE PROPORTION
Please notice the nature of the sampling proportions distribution, it looks closer
to Normal distribution curve. In fact, you can find that out by creating an
example with 100 data points and sample size 30.
11
Portability and Statistics
for Data Science 𝑚𝑒𝑎𝑛 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛 = 𝑝 (14a)
A×(!;A)
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = M 6
(14b)
Suppose, you take a sample size (n) = 3, and collects random sample. The
following are the possible set of random samples:
Sample Sample Mean (𝑥̅ )
P1, P2, P3 25
P1, P2, P4 26.67
P1, P2, P5 28.33
P1, P3, P4 28.33
P1, P3, P5 30
P1, P4, P5 31.67
P2, P3, P4 30
P2, P3, P5 31.67
P2, P4, P5 33.33
P3, P4, P5 35
The mean of all these sample means = 30, which is same as population mean μ.
The histogram of the data is shown in Figure 12.
2.5
2
Frequency
1.5
0.5
0
(26.5, 28] (29.5, 31] (32.5, 34]
[25, 26.5] (28, 29.5] (31, 32.5] (34, 35.5]
Mean Value
Given a sample size n and population mean μ, then the sampling distribution for
the given sample size would fulfil the following:
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 = 𝜇 (15a)
@
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛𝑠 = (15b)
√6
12
Basics of Data Science
Therefore, the z-score computation for sampling distribution will be as per the
following equation:
Note: You can obtain this equation from equation (13), as this is a
distribution of means, therefore, x of equation (13) is 𝑥̅ , and standard
deviation of sampling distribution is given by equation (15b).
(>̅ ;?)
𝑧= @ (15c)
C
√6
Please note that the histogram of the mean of samples is close to normal
distribution.
Such exponentiations led to the Central limit Theorem, which proposes the following:
Central Limit Theorem: Assume that a sample of size is drawn from a population that
has the mean μ and standard deviation σ. The central limit theorem states that with the
increase in n, the sampling distribution, i.e. the distribution of mean of the samples,
approaches closer to normal distribution.
However, it may be noted that the central limit theorem is applicable only if you have
collected independent random samples, where the size of sample is sufficiently large,
yet it is less than 10% of the population. Therefore, the Example 5 and Example 6 are
not true representations for the theorem, rather are given to illustrate the concept.
Further, it may be noted that the central limit theorem does not put any constraint on
the distribution of population. Equation 15 is a result of central limit theorem.
3. What would be the mean and standard deviation for the random variable of
Question 2.
4. What is the mean and standard deviation for standard normal distribution?
5. A country has the population of 1 billion, out of which 1% are the students of
class 10th. A representation sample of 10000 students of class 10 were asked a
question “Is Mathematics difficult or easy?”. Assuming that the population
proportion of this question was reported to be 0.36, what would be possible
standard deviation of the sampling distribution?
13
Portability and Statistics
for Data Science
𝑝 − 2 StDev ^p p 𝑝 + 2 StDev
Figure 13: Confidence Level 95% for a confidence interval (non-shaded area).
Since you have selected a confidence level of 95%, you are expecting that
proportion of the sample (𝑝̂ )can be in the interval–(population proportion (p) -
14
Basics of Data Science
2´(Standard Deviation)) to (population proportion (p) + 2´(Standard
Deviation)), as shown in Figure 13. The probability of occurrence of 𝑝̂ in this
interval is 95% (Please refer to Figure 6). Therefore, the confidence level is 95%.
In addition, note that you do not know the value of p that is what you are
estimating, therefore, you would be computing 𝑝̂ . You may observe in Figure
13, that the value of p will be in the interval (𝑝̂ - 2´(Standard Deviation)) to (𝑝̂
+ 2´(Standard Deviation)). The standard deviation of the sampling distribution
can be computed using equation (14b). However, as you are estimating the value
of p, therefore, you cannot compute the exact value of standard deviation.
Rather, you can compute standard error, which is
computed by estimating the standard deviation using the sample proportion (𝑝̂ ),
by using the following formula:
AD×(!;AD)
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟(𝑆𝑡𝐸𝑟𝑟) = M 6
Therefore, the confidence interval is estimated as (𝑝̂ – 2´StErr) to (𝑝̂ + 2´StErr).
In general, for a specific confidence level, you can specify a specific z-score
instead of 2. Therefore, the confidence interval, for large n, is: (𝑝̂ – z´StErr) to
(𝑝̂ + z´StErr)
In practice, you may use confidence level of 90% or 95% and 99%. The z-score
used for these confidence levels are 1.65, 1.96 (not 2) and 2.58respectively.
Example 7: Consider the statement S1 of this section and estimate the
confidence interval for the given data.
For the sample the probability that class 12th students play some sport is:
𝑝̂ = 405/1000=0.405
The sample size (n) = 1000
AD×(!;AD) 8.F8G×(!;8.F8G)
𝑆𝑡𝐸𝑟𝑟 = M 6
= M !888
= 0.016
Therefore, the Confidence Interval for the confidence level 95% would be:
(0.405 – 1.96 ´0.016) to (0.405 + 1.96 ´0.016)
0.374 to 0.436
Therefore, with a confidence of 95%, you can state that the students of class 12th,
who plays some sport is in the range 37.4% to 43.6%
How can you reduce the size of this interval? You may please observe that
StErris inversely dependent on the square root of the sample size. Therefore,
you may have to increase the sample size to approximately 4 times to reduce the
standard error to approximately half.
Confidence Interval to estimate mean
You can find the confidence interval for estimating mean in a similar manner,
as you have done for the case of proportions. However, in this case you need
estimate the standard error in the estimated mean usingthe variation of equation
15b, as follows:
H
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 =
√6
; where s is the standard deviation of the sample
Example 8: The following table lists the height of a sample of 100 students of
class 12 in centimetres. Estimate the average height of students of class 12.
170 164 168 149 157 148 156 164 168 160
149 171 172 159 152 143 171 163 180 158
167 168 156 170 167 148 169 179 149 171
164 159 169 175 172 173 158 160 176 173
15
Portability and Statistics
for Data Science 159 160 162 169 168 164 165 146 156 170
163 166 150 165 152 166 151 157 163 189
176 185 153 181 163 167 155 151 182 165
189 168 169 180 158 149 164 171 189 192
171 156 163 170 186 187 165 177 175 165
167 185 164 156 143 172 162 161 185 174
The sample mean and sample standard deviation is computed and is shown
below:
Sample Mean (𝑥̅ ) = 166; Standard Deviation of sample (s) = 11
Therefore, the estimated height confidence interval of the mean height of the
students of class 12thcan be computed as:
Mean height (𝑥̅ ) = 166
The sample size (n) = 100
!!
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑖𝑛 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 = =1.1
√!88
The Confidence Interval for the confidence level 95% would be:
(166 – 1.96 ´ 1.1) to (166 + 1.96 ´ 1.1)
163.8 to 168.2
Thus, with a confidence of 95%, you can state that average height of class 12th
students is in between 163.8 to 168.2 centimetres.
You may please note that in example 8, we have used t-distribution for means,
as we have used sample’s standard deviation rather than population standard
deviation. The t-distribution of means is slightly more restrictive than z-
distribution. The t-value is computed in the context of sampling distribution by
the following equation:
(>̅ ;?)
𝑡= H (16)
C
√6
In this section, we will discuss about how to test the statement S2, given in
section 2.5. A number of experimental studies are conducted in statistics, with
the objective to infer, if the data support a hypothesis or not. The significance
testing may involve the following phases:
1.Testing Pre-condition on Data:
Prior to preforming the test of significance, you should check the pre-conditions
on the test. Most of the statistical test require random sampling, large size of
data for each possible category being tested and normal distribution of the
population.
2. Making the statistical Hypothesis: You make statistical hypothesis after the
parameters . of the population. There are two basic hypothesis in statistical
testing – the Null Hypothesis and the Alternative Hypothesis.
Null Hypothesis: Null hypothesis either defines a particular value for the
parameter or specifies there is no difference or no change in the specified
parameters. It is represented as H0.
Alternative Hypothesis: Alternative hypothesis specifies the values or difference
in parameter values. It is represented as either H1 or Ha. We use the convention
Ha.
For example, for the statement S2 of Section 2.5, the two hypothesis would be:
16
Basics of Data Science
H0: There is no effect of hours of study on the marks percentage of 12th class.
Ha: The marks of class 12th improves with the hours of study of the student.
Please note that the hypothesis above is one sided, as your assumption is that the
marks would increase with hours of study. The second one sided hypothesis may
relate to decrease in marks with hours of study. However, most of the cases the
hypothesis will be two sided, which just claims that a variable will cause
difference in the second. For example, two sided hypothesis for statement S2
would be hours of study of students makes a difference (it may either increase
or decrease ) the marks of students of class 12th. In general, one sided tests are
called one tailed tests and two sided tests are called two tailed tests.
17
Portability and Statistics
for Data Science In order to find such a relationship, you may like to perform basic exploratory
analysis. In this case, let us make a scatter plot between the two variables, taking
wsh as an independent variable and mp as a dependent variable. This scatter plot
is shown in Figure 16
120
Figure 16: Scatter plot of Weekly Study Hours vs. Marks Percentage.
The scatter plot of Figure 16 suggests that the two variables may be associated.
But how to determine the strength of this association? In statistics, you use
Correlation, which may be used to determine the strength of linear association
between two quantitative variables. This is explained next.
19
Portability and Statistics
for Data Science On performing regressing analysis on the observed data of Example 9, the
statistics as shown in Figure 18 is generated.
Regression Statistics
Multiple R 0.9577
R Square 0.9172
Adjusted R Square 0.9069
Standard Error 4.1872
Observations 10.0000
ANOVA
df SS F Significance F
Regression 1.0000 1554.1361 88.6407 0.0000
Residual 8.0000 140.2639
Total 9.0000 1694.4000
20
Basics of Data Science
• The term “Multiple R” in Regression Statistics defines the correlation
between the dependent variable (say y) with the set of independent or
explanatory variables in the regression model. Thus, multiple R is similar
to correlation coefficient (r), expect that it is used when multiple
regression is used. Most of the software express the results in terms of
Multiple R, instead of r, to represent the regression output. Similarly, R
Square is used in multiple regression, instead of r2. The proposed model
has a large r2, therefore, can be considered for deployment.
You can go through further readings for more details on all the terms discussed
above.
Figure 19 shows the regression line for the data of Example 9. You may please
observe that residuals is the vertical difference between the Marks Percentage
and Predicted marks percentage. These residuals are shown in Figure 20.
90
Marks Percentage (mp)
80
70
60
50
40
0 5 10 15 20 25
Weekly Study Hours (wsh)
0
-5 0 5 10 15 20 25
-10
Weekly Study Hours (wsh)
In the section 2.5.1 and section 2.5.2, we have discussed about testing the Null
hypothesis. You either Reject the Null hypothesis and accepts alternative
hypothesis based on the computed probability or p-value; or you fail to Reject
the Null hypothesis. The decisions in such hypothesis testing would be:
• You reject Null hypothesis for a confidence interval 95% based on the p-
value, which lies in the shaded portion, that is p-value < 0.05 for two tailed
hypothesis (that is both the shaded portions in Figure 15, each area of
probability 0.025). Please note that in case of one tailed test, you would
consider only one shaded area of Figure 15, therefore, you would be
considering p-value < 0.05 in only one of the two shaded areas.
• You fail to reject the Null hypothesis for confidence interval 95%, when p-
value > 0.05.
The two decisions as stated above could be incorrect, as you are considering a
confidence interval of 95%. The following Figure shows this situation.
Final Decision
The Actual H0 is Rejected, that is, you You fail to reject H0, as you do not
Scenario have accepted the have enough evidence to accept
Alternative hypothesis the Alternative hypothesis
H0 is True This is called a TYPE-I error You have arrived at a correct
decision
H0 is False You have arrived at a correct This is called a TYPE-II error
decision
For example, assume that a medicine is tested for a disease and this medicine is
NOT a cure of the disease. You would make the following hypotheses:
H0: The medicine has no effect for the disease
Ha: The medicine improves the condition of patient.
However, if the data is such that for a confidence interval of 95% the p-value is
computed to be less than 0.05, then you will reject the null hypothesis, which is
Type-I error. The chances of Type-I errors for this confidence interval is 5%.
This error would mean that the medicine will get approval, even though it has
no effect on curing the disease.
However, now assume that a medicine is tested for a disease and this medicine
is a cure of the disease. Hypotheses still remains the same, as above. However,
if the data is such that for a confidence interval of 95% the p-value is computed
to be more than 0.05, then you will not be able to reject the null hypothesis,
which is Type-II error. This error would mean that a medicine which can cure
the disease will not be accepted.
22
Basics of Data Science
2. The Weight of 20 students, in Kilograms, is given in the following table
65 75 55 60 50 59 62 70 61 57
62 71 63 69 55 51 56 67 68 60
Find the estimated weight of the student population.
3. A class of 10 students were given a validated test prior and after completing
a training course. The marks of the students in those tests are given as under:
Marks before Training (mbt) 56 78 87 76 56 60 59 70 61 71
Marks after training (mat) 55 79 88 90 87 75 66 75 66 78
With a significance level of 95% can you say that the training course was useful?
2.6 SUMMARY
This Unit introduces you to the basic probability and statistics related to data
science. The unit first introduces the concept of conditional probability, which
defines the probability of an event given a specific event has occurred. This is
followed by discussion on the Bayes theorem, which is very useful in finding
conditional probabilities. Thereafter, the unit explains the concept of discrete
and continuous random variables. In addition, the Binomial distribution and
normal distribution were also explained. Further, the unit explained the
concept of sampling distribution and central limit theorem, which forms the
basis of the statistical analysis. The Unit also explain the use of confidence
level and intervals for estimating the parameters of the population. Further, the
unit explains the process of significance testing by taking an example related to
correlation and regression. Finally, the Unit explains the concept of errors in
hypothesis testing. You may refer to further readings for more details on these
concepts.
2.7 SOLUTION/ANSWERS
23
Portability and Statistics
! !
for Data Science 𝑃(𝑆𝑡𝐷𝑖𝑠/𝑈𝑛𝑖𝐴) × 𝑃(𝑈𝑛𝑖𝐴) #
×# 3
𝑃(𝑈𝑛𝑖𝐴/𝑆𝑡𝐷𝑖𝑠) = = 4 =
𝑃(𝑆𝑡𝐷𝑖𝑠) 7
!#
Check Your Progress 2
1. As the probability of getting the even number (E) or odd number (O) is equal
in each two of dice, the following eight outcomes may be possible:
Outcomes EEE EEO EOE EOO OEE OEO OOE OOO
Number of 3 2 2 1 2 1 1 0
times Even
number appears
(X)
Therefore, the probability distribution would be:
X Frequency Probability P(X)
0 1 1/8
1 3 3/8
2 3 3/8
3 1 1/8
Total 8 Sum of all P(X) = 1
2. This can be determined by using the Binomial distribution with X=0, 1, 2, 3 and
4, as follows (s and f both are 1/2):
4! 1 8 1 F 1
𝑃(𝑋 = 0) 𝑜𝑟 𝑝8 = F𝐶8 × 𝑠 8 × 𝑓 F;8 = ×H I ×H I =
0! (4 − 0)! 2 2 16
!
4! 1 1 9 4
𝑃(𝑋 = 1) 𝑜𝑟 𝑝! = F𝐶! × 𝑠 ! × 𝑓 F;! = ×H I ×H I =
1! (4 − 1)! 2 2 16
#
4! 1 1 # 6
𝑃(𝑋 = 2) 𝑜𝑟 𝑝# = F𝐶# × 𝑠 # × 𝑓 F;# = ×H I ×H I =
2! (4 − 2)! 2 2 16
9
4! 1 1 ! 4
𝑃(𝑋 = 3) 𝑜𝑟 𝑝9 = F𝐶9 × 𝑠 9 × 𝑓 F;9 = ×H I ×H I =
3! (4 − 3)! 2 2 16
4! 1 F 1 8 1
𝑃(𝑋 = 4) 𝑜𝑟 𝑝F = F𝐶F × 𝑠 F × 𝑓 F;F = ×H I ×H I =
4! (4 − 4)! 2 2 16
4. Analysis of results: The one tail p-value suggests that you reject the
null hypothesis. The difference in the means of the two results is
significant enough to determine that the scores of the student have
improved after the training.
25
Basics of Data Science
UNIT 3 DATA PREPARATION FOR ANALYSIS
3.0 Introduction
3.1 Objectives
3.2 Need for Data Preparation
3.3 Data preprocessing
3.3.1 Data Cleaning
3.3.2 Data Integration
3.3.3 Data Reduction
3.3.4 Data Transformation
3.4 Selection and Data Extraction
3.5 Data Curation
3.5.1 Steps of Data Curation
3.5.2 Importance of Data Curation
3.6 Data Integration
3.6.1 Data Integration Techniques
3.6.2 Data Integration Approaches
3.7 Knowledge Discovery
3.8 Summary
3.9 Solutions/Answers
3.10 Further Readings
3.0 INTRODUCTION
In the previous unit of this Block, you were introduced to the basic concepts of
conditional probability, Bayes Theorem and probability distribution including
Binomial and Normal distributions. The Unit also introduces you to the concept
of the sampling distribution, central limit theorem and statistical hypothesis
testing. This Unit introduces you to the process of data preparation for Data
Analysis. Data preparation is one of the most important processes, as it leads to
good quality data, which will result in accurate results of the data analysis. This
unit covers data selection, cleaning, curation, integration, and knowledge
discovery from the stated data. In addition, this unit gives you an overview of
data quality and how its preparation for analysis is done. You may refer to further
readings for more details on these topics.
3.1 OBJECTIVES
1
Data Preparation for
Analysis
3.2 NEED FOR DATA PREPARATION
In the present time, data is one of the key resources for a business. Data is
processed to create information; information is integrated to create knowledge.
Since knowledge is power, it has evolved into a modern currency, which is
valued and traded between parties. Everyone wants to discuss the knowledge
and benefits they can gain from data. Data is one of the most significant
resources available to marketers, agencies, publishers, media firms, and others
today for a reason. But only high-quality data is useful. We can determine a data
set's reliability and suitability for decision-making by looking at its quality.
Degrees are frequently used to gauge this quality. The usefulness of the data for
the intended purpose and its completeness, accuracy, timeliness, consistency,
validity, and uniqueness are used to determine the data's quality. In simpler
terms, data quality refers to how accurate and helpful the data are for the task at
hand. Further, data quality also refers to the actions that apply the necessary
quality management procedures and methodologies to make sure the data is
useful and actionable for the data consumers. A wide range of elements,
including accuracy, completeness, consistency, timeliness, uniqueness, and
validity, influence data quality. Figure 1 shows the basic factors of data quality.
COMPLETENESS
UNIQUENESS ACCURACY
DATA
QUALITY
VALIDITY TIMELINESS
CONSISTENCY
• Accuracy - The data must be true and reflect events that actually take
place in the real world. Accuracy measures determine how closely the
figures agree with the verified right information sources.
• Completeness - The degree to which the data is complete determines
how well it can provide the necessary values.
• Consistency - Data consistency is the homogeneity of the data across
applications, networks, and when it comes from several sources. For
example, identical datasets should not conflict if they are stored in
different locations.
2
Basics of Data Science
• Timeliness - Data that is timely is readily available whenever it is
needed. The timeliness factor also entails keeping the data accurate; to
make sure it is always available and accessible and updated in real-time.
• Uniqueness - Uniqueness is defined as the lack of duplicate or redundant
data across all datasets. The collection should contain zero duplicate
records.
• Validity - Data must be obtained in compliance with the firm's defined
business policies and guidelines. The data should adhere to the
appropriate, recognized formats, and all dataset values should be within
the defined range.
Consider yourself a manager at a company, say XYZ Pvt Ltd, who has been tasked
with researching the sales statistics for a specific organization, say ABC. You
immediately get to work on this project by carefully going through the ABC
company's database and data warehouse for the parameters or dimensions (such
as the product, price, and units sold), which may be used in your study. However,
your enthusiasm suffers a major problem when you see that several of the
attributes for different tuples do not have any recorded values. You want to
incorporate the information in your study on whether each item purchased was
marked down, but you find that this data has not been recorded. According to users
of this database system, the data recorded for some transactions were containing
mistakes, such as strange numbers and anomalies.
3
Data Preparation for
Analysis
3.3 DATA PREPROCESSING
Preprocessing is the process of taking raw data and turning it into information
that may be used. Data cleaning, data integration, data reduction and data
transformation, and data discretization are the main phases of data preprocessing
(see Figure 2).
DATA Pre-
processing
Data Cleaning
Data Integration
Data Transformation
Data Reduction
a. Missing Values
Consider you need to study customer and sales data for ABC
Company. As you pointed out, numerous tuples lack recorded
values for a number of characteristics, including customer
income. The following techniques can be used to add the values
that are lacking for this attribute.
i. Ignore the tuple: Typically, this is carried out in the
absence of a class label (assuming the task involves
4
Basics of Data Science
classification). This method is particularly detrimental
when each attribute has a significantly different percentage
of missing values. By disregarding the remaining
characteristics in the tuple, we avoid using their values.
ii. Manually enter the omitted value: In general, this
strategy is time-consuming and might not be practical for
huge data sets with a substantial number of missing values.
iii. Fill up the blank with a global constant: A single
constant, such as "Unknown" or “−∞”, should be used to
replace all missing attribute values. If missing data are
replaced with, say, "Unknown," the analysis algorithm can
mistakenly think that they collectively comprise valid data.
So, despite being simple, this strategy is not perfect.
iv. To fill in the missing value, use a measure of the
attribute's central tendency (such as the mean or
median): The median should be used for skewed data
distributions, while the mean can be used for normal
(symmetric) data distributions. Assume, for instance, that
the ABC company’s customer income data distribution is
symmetric and that the mean income is INR 50,000/-. Use
this value to fill in the income value that is missing.
v. For all samples that belong to the same class as the
specified tuple, use the mean or median: For instance, if
we were to categorize customers based on their credit risk,
the mean income value of customers who belonged to the
same credit risk category as the given tuple might be used
to fill in the missing value. If the data distribution is skewed
for the relevant class, it is best to utilize the median value.
vi. Fill in the blank with the value that is most likely to be
there: This result can be reached using regression,
inference-based techniques using a Bayesian
formalization, or decision tree induction. As an example,
using the other characteristics of your data's customers, you
may create a decision tree to forecast the income's missing
numbers.
b. Noisy Data
Noise is the variance or random error in a measured variable. It
is possible to recognize outliers, which might be noise,
employing tools for data visualization and basic statistical
description techniques (such as scatter plots and boxplots). How
can the data be "smoothed" out to reduce noise given a numeric
property, like price, for example? The following are some of the
data-smoothing strategies.
i. Binning: Binning techniques smooth-sorted data values
by looking at their "neighbourhood" or nearby values.
The values that have been sorted are divided into various
"buckets" or bins. Binding techniques carry out local
smoothing since they look at the values' surroundings.
When smoothing by bin means, each value in the bin is
changed to the bin's mean value. As an illustration,
suppose a bin contains three numbers 4, 8 and 15. The
average of these three numbers in the bin is 9.
5
Data Preparation for
Analysis Consequently, the value nine replaces each of the bin's
original values.
Similarly, smoothing by bin medians, which substitutes
the bin median for each bin value, can be used. Bin
boundaries often referred to as minimum and maximum
values in a specific bin can also be used in place of bin
values. This type of smoothing is called smoothing by
bin boundaries. In this method, the nearest boundary
value is used to replace each bin value. In general, the
smoothing effect increases with increasing breadth. As an
alternative, bins may have identical widths with constant
interval ranges of values.
ii. Regression: Regression is a method for adjusting the data values to a
function and may also be used to smooth out the data. Finding the "best"
line to fit two traits (or variables) is the goal of linear regression, which
enables one attribute to predict the other. As an extension of linear
regression, multiple linear regression involves more than two features
and fits the data to a multidimensional surface.
iii. Outlier analysis: Clustering, for instance, the grouping of comparable
values into "clusters," can be used to identify outliers. It makes sense to
classify values that are outliers as being outside the set of clusters.
iv. Data discretization, a data transformation and data reduction technique,
is an extensively used data smoothing technique. The number of distinct
values for each property is decreased, for instance, using the binning
approaches previously discussed. This functions as a form of data
reduction for logic-based data analysis methods like decision trees,
which repeatedly carry out value comparisons on sorted data. Concept
hierarchies are a data discretization technique that can also be applied to
smooth out the data. The quantity of data values that the analysis process
must process is decreased by a concept hierarchy. For example, the price
variable, which represents the price value of commodities, may be
discretized into “lowly priced”, “moderately priced”, and “expensive”
categories.
6
Basics of Data Science
good cause, such as suspicious measurements that are unlikely to be present in
the real data.
4. Handling missing data-Missing data is a deceptively difficult issue in
machine learning. We cannot just ignore or remove the missing observation.
They must be carefully treated since they can indicate a serious problem. Data
gaps resemble puzzle pieces that are missing. Dropping it is equivalent to
denying that the puzzle slot is there. It is like trying to put a piece from another
puzzle into this one. Furthermore, we need to be aware of how we report
missing data. Instead of just filling it in with the mean, you can effectively let
the computer choose the appropriate constant to account for missingness by
using this flagging and filling method.
5. Validate and QA-You should be able to respond to these inquiries as part of
fundamental validation following the data cleansing process, for example:
o Does the data make sense?
o Does the data abide by the regulations that apply to its particular field?
o Does it support or refute your hypothesis? Does it offer any new
information?
o Can you see patterns in the data that will support your analysis?
o Is there a problem with the data quality?
Data from many sources, such as files, data cubes, databases (both relational and
non-relational), etc., must be combined during this procedure. Both
homogeneous and heterogeneous data sources are possible. Structured,
7
Data Preparation for
Analysis unstructured, or semi-structured data can be found in the sources. Redundancies
and inconsistencies can be reduced and avoided with careful integration.
d. Data Value Conflict Detection and Resolution: Data value conflicts must
be found and resolved as part of data integration. As an illustration,
attribute values from many sources may vary for the same real-world thing.
8
Basics of Data Science
Variations in representation, scale, or encoding may be the cause of this. In
one system, a weight attribute might be maintained in British imperial
units, while in another, metric units. For a hotel chain, the cost of rooms in
several cities could include various currencies, services (such as a
complimentary breakfast) and taxes. Similarly, every university may have
its own curriculum and grading system. When sharing information among
them, one university might use the quarter system, provide three database
systems courses, and grade students from A+ to F, while another would use
the semester system, provide two database systems courses, and grade
students from 1 to 10. Information interchange between two such
universities is challenging because it is challenging to establish accurate
course-to-grade transformation procedures between the two universities.
An attribute in one system might be recorded at a lower abstraction level
than the "identical" attribute in another since the abstraction level of
attributes might also differ. As an illustration, an attribute with the same
name in one database may relate to the total sales of one branch of a
company, however, the same result in another database can refer to the
company's overall regional shop sales.
This procedure is used to change the data into formats that are suited for the
analytical process. Data transformation involves transforming or consolidating
the data into analysis-ready formats. The following are some data transformation
strategies:
a. Smoothing, which attempts to reduce data noise. Binning, regression, and
grouping are some of the methods.
b. Attribute construction (or feature construction), wherein, in order to aid
the analysis process, additional attributes are constructed and added from the
set of attributes provided.
c. Aggregation, where data is subjected to aggregation or summary procedures
to calculate monthly and yearly totals; for instance, the daily sales data may
be combined to produce monthly or yearly sales. This process is often used
to build a data cube for data analysis at different levels of abstraction.
d. Normalization, where the attribute data is resized to fit a narrower range:
−1.0 to 1.0; or 0.0 to 1.0.
e. Discretization, where interval labels replace the raw values of a numeric
attribute (e.g., age) (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth,
adult, senior). A concept hierarchy for the number attribute can then be
created by recursively organizing the labels into higher-level concepts. To
meet the demands of different users, more than one concept hierarchy might
be built for the same characteristic.
f. Concept hierarchy creation using nominal data allows for the
extrapolation of higher-level concepts like a street to concepts like a city or
country. At the schema definition level, numerous hierarchies for nominal
qualities can be automatically created and are implicit in the database
structure.
10
Basics of Data Science
2. Why is preprocessing important?
The process of choosing the best data source, data type, and collection tools is
known as data selection. Prior to starting the actual data collection procedure,
data selection is conducted. This concept makes a distinction between selective
data reporting (excluding data that is not supportive of a study premise) and
active/interactive data selection (using obtained data for monitoring
activities/events or conducting secondary data analysis). Data integrity may be
impacted by how acceptable data are selected for a research project.
The main goal of data selection is to choose the proper data type, source, and
tool that enables researchers to effectively solve research issues. This decision
typically depends on the discipline and is influenced by the research that has
already been done, the body of that research, and the availability of the data
sources.
Integrity issues may arise, when decisions about which "appropriate" data to
collect, are primarily centred on cost and convenience considerations rather than
the data's ability to successfully address research concerns. Cost and
convenience are unquestionably important variables to consider while making a
decision. However, researchers should consider how much these factors can
skew the results of their study.
Types and Sources of Data: Different data sources and types can be displayed in
a variety of ways. There are two main categories of data:
• Quantitative data are expressed as numerical measurements at the interval
11
Data Preparation for
Analysis and ratio levels.
• Qualitative data can take the form of text, images, music, and video.
Data curation is creating, organizing and managing data sets so that people
looking for information can access and use them. It comprises compiling,
arranging, indexing, and categorizing data for users inside of a company, a
group, or the general public. To support business decisions, academic needs,
scientific research, and other initiatives, data can be curated. Data curation is a
step in the larger data management process that helps prepare data sets for usage
in business intelligence (BI) and analytics applications. In other cases, the
curation process might be fed with ready-made data for ongoing management
and maintenance. In organizations without particular data curator employment,
data stewards, data engineers, database administrators, data scientists, or
business users may fill that role.
13
Data Preparation for
Analysis There are numerous tasks involved in curating data sets, which can be divided
into the following main steps.
• The data that will be required for the proposed analytics applications
should be determined.
• Map the data sets and note the metadata that goes with them.
• Collect the data sets.
• The data should be ingested into a system, a data lake, a data
warehouse, etc.
• Cleanse the data to remove abnormalities, inconsistencies, and
mistakes, including missing values, duplicate records, and spelling
mistakes.
• Model, organize, and transform the data to prepare it for specific
analytics applications.
• To make the data sets accessible to users, create searchable indexes
of them.
• Maintain and manage the data in compliance with the requirements
of continuous analytics and the laws governing data privacy and
security.
3.5.2 Importance of Data Curation
The following are the reasons for performing data curation.
1. Helps to organize pre-existing data for a corporation: Businesses produce
a large amount of data on a regular basis, however, this data can
occasionally be lacking. When a customer clicks on a website, adds
something to their cart, or completes a transaction, an online clothes
retailer might record that information. Data curators assist businesses in
better understanding vast amounts of information by assembling prior
data into data sets.
2. Connects professionals in different departments: When a company
engages in data curation, it often brings together people from several
departments who might not typically collaborate. Data curators might
collaborate with stakeholders, system designers, data scientists, and data
analysts to collect and transfer information.
3. High-quality data typically uses organizational techniques that make it
simple to grasp and have fewer errors. Curators may make sure that a
company's research and information continue to be of the highest caliber
because the data curation process entails cleansing the data. Removing
unnecessary information makes research more concise, which may
facilitate better data set structure.
4. Makes data easy to understand: Data curators make sure there are no
errors and utilize proper formatting. This makes it simpler for specialists
who are not knowledgeable about a research issue to comprehend a data
set.
5. Allows for higher cost and time efficiency: A business may spend more
time and money organizing and distributing data if it does not regularly
employ data curation. Because prior data is already organized and
distributed, businesses that routinely do data curation may be able to save
14
Basics of Data Science
time, effort, and money. Businesses can reduce the time it takes to obtain
and process data by using data curators, who handle the data.
Data integration creates coherent data storage by combining data from several
sources. Smooth data integration is facilitated by the resolution of semantic
heterogeneity, metadata, correlation analysis, tuple duplicate identification, and
data conflict detection. It is a tactic that combines data from several sources so
that consumers may access it in a single, consistent view that displays their
status. Systems can communicate using flat files, data cubes, or numerous
databases. Data integration is crucial because it maintains data accuracy while
providing a consistent view of dispersed data. It helps the analysis tools extract
valuable information, which in turn helps the executive and management make
tactical choices that will benefit the company.
15
Data Preparation for
Analysis Uniform Access Integration- This method integrates information from a wider
range of sources. In this instance, however, the data is left in its initial place and
is not moved. To put it simply, this technique produces a unified view of the
combined data. The integrated data does not need to be saved separately because
the end user only sees the integrated view.
16
Basics of Data Science
2. Selecting and producing the data set that will be used for discovery -Once
the objectives have been specified, the data that will be used for the knowledge
discovery process should be identified. Determining what data is accessible,
obtaining essential information, and then combining all the data for knowledge
discovery into one set are the factors that will be considered for the procedure.
Knowledge discovery is important since it extracts knowledge and insight from
the given data. This provides the framework for building the models.
3. Preprocessing and cleansing – This step helps in increasing the data
reliability. It comprises data cleaning, like handling the missing quantities and
removing noise or outliers. In this situation, it might make use of sophisticated
statistical methods or an analysis algorithm. For instance, the goal of the Data
Mining supervised approach may change if it is determined that a certain
attribute is unreliable or has a sizable amount of missing data. After developing
a prediction model for these features, missing data can be forecasted. A variety
of factors affect how much attention is paid to this level. However, breaking
down the components is important and frequently useful for enterprise data
frameworks.
4. Data Transformation-This phase entails creating and getting ready the
necessary data for knowledge discovery. Here, techniques of attribute
transformation (such as discretization of numerical attributes and functional
transformation) and dimension reduction (such as feature selection, feature
extraction, record sampling etc.) are employed. This step, which is frequently
very project-specific, can be important for the success of the KDD project.
Proper transformation results in proper analysis and proper conclusions.
5. Prediction and description- The decisions to use classification, regression,
clustering, or any other method can now be made. Mostly, this uses the KDD
objectives and the decisions made in the earlier phases. A forecast and a
description are two of the main objectives of knowledge discovery. The
visualization aspects are included in descriptive knowledge discovery. Inductive
learning, which generalizes a sufficient number of prepared models to produce
a model either explicitly or implicitly, is used by the majority of knowledge
discovery techniques. The fundamental premise of the inductive technique is
that the prepared model holds true for the examples that follow.
6. Deciding on knowledge discovery algorithm -We now choose the strategies
after determining the technique. In this step, a specific technique must be chosen
to be applied while looking for patterns with numerous inducers. If precision
and understandability are compared, the former is improved by neural networks,
while decision trees improve the latter. There are numerous ways that each meta-
learning system could be successful. The goal of meta-learning is to explain why
a data analysis algorithm is successful or unsuccessful in solving a particular
problem. As a result, this methodology seeks to comprehend the circumstances
in which a data analysis algorithm is most effective. Every algorithm has
parameters and learning techniques, including tenfold cross-validation or a
different division for training and testing.
7. Utilizing the Data Analysis Algorithm-Finally, the data analysis algorithm
is put into practice. The approach might need to be applied several times before
producing a suitable outcome at this point. For instance, by rotating the
algorithms, you can alter variables like the bare minimum of instances in a single
decision tree leaf.
17
Data Preparation for
Analysis 8. Evaluation-In this stage, the patterns, principles, and dependability of the
results of the knowledge discovery process are assessed and interpreted in light
of the objective outlined in the preceding step. Here, we take into account the
preprocessing steps and how they impact the final results. As an illustration, add
a feature in step 4 and then proceed. The primary considerations in this step are
the understanding and utility of the induced model. In this stage, the identified
knowledge is also documented for later use.
Check Your Progress 5:
1. What is Knowledge Discovery?
2. What are the Steps involved in Knowledge Discovery?
3. What are knowledge discovery tools?
4. Explain the process of KDD.
3.8 SUMMARY
Despite the development of several methods for preparing data, the intricacy of
the issue and the vast amount of inconsistent or unclean data mean that this field
of study is still very active. This unit gives a general overview of data pre-
processing and describes how to turn raw data into usable information. The
preprocessing of the raw data included data integration, data reduction,
transformation, and discretization. In this unit, we have discussed five different
data-cleaning techniques that can make data more reliable and produce high-
quality results. Building, organizing, and maintaining data sets is known as data
curation. A data curator usually determines the necessary data sets and makes
sure they are gathered, cleaned up, and changed as necessary. The curator is also
in charge of providing users with access to the data sets and information related
to them, such as their metadata and lineage documentation. The primary goal of
the data curator is to make sure users have access to the appropriate data for
analysis and decision-making. Data integration is the procedure of fusing
information from diverse sources into a single, coherent data store. The unit also
introduced knowledge discovery techniques and procedures.
3.9 SOLUTIONS/ANSWERS
2. It raises the reliability and accuracy of the data. Preprocessing data can
increase the correctness and quality of a dataset, making it more
dependable by removing missing or inconsistent data values brought by
human or computer mistakes. It ensures consistency in data.
18
Basics of Data Science
3. Data quality is characterized by five characteristics: correctness,
completeness, reliability, relevance, and timeliness.
References
Data Cleaning in Data Mining - Javatpoint. (n.d.). Www.Javatpoint.Com. Retrieved February 11,
2023, from https://www.javatpoint.com/data-cleaning-in-data-mining
Dowd, R., Recker, R.R., Heaney, R.P. (2000). Study subjects and ordinary patients. Osteoporos
Int. 11(6): 533-6.
Fourcroy, J.L. (1994). Women and the development of drugs: why can’t a woman be more like a
man? Ann N Y Acad Sci, 736:174-95.
Goehring, C., Perrier, A., Morabia, A. (2004). Spectrum Bias: a quantitative and graphical analysis
of the variability of medical diagnostic test performance. Statistics in Medicine, 23(1):125-35.
Gurwitz,J.H., Col. N.F., Avorn, J. (1992). The exclusion of the elderly and women from clinical
trials in acute myocardial infarction. JAMA, 268(11): 1417-22.
Hartt, J., Waller, G. (2002). Child abuse, dissociation, and core beliefs in bulimic disorders. Child
Abuse Negl. 26(9): 923-38.
Kahn, K.S, Khan, S.F, Nwosu, C.R, Arnott, N, Chien, P.F.(1999). Misleading authors’ inferences
in obstetric diagnostic test literature. American Journal of Obstetrics and Gynaecology., 181(1`),
112-5.
Maynard, C., Selker, H.P., Beshansky, J.R.., Griffith, J.L., Schmid, C.H., Califf, R.M., D’Agostino,
R.B., Laks, M.M., Lee, K.L., Wagner, G.S., et al. (1995). The exclusions of women from clinical
trials of thrombolytic therapy: implications for developing the thrombolytic predictive instrument
database. Med Decis Making (Medical Decision making: an international journal of the Society
for Medical Decision Making), 15(1): 38-43.
Robinson, D., Woerner, M.G., Pollack, S., Lerner, G. (1996). Subject selection bias in clinical:
data from a multicenter schizophrenia treatment center. Journal of Clinical Psychopharmacology,
16(2): 170-6.
21
Data Preparation for
Sharpe, N. (2002). Clinical trials and the real world: selection bias and generalisability of trial
Analysis results. Cardiovascular Drugs and Therapy, 16(1): 75-7.
Walter, S.D., Irwig, L., Glasziou, P.P. (1999). Meta-analysis of diagnostic tests with imperfect
reference standards. J Clin Epidemiol., 52(10): 943-51.
What is Data Extraction? Definition and Examples | Talend. (n.d.). Talend - A Leader in Data
Integration & Data Integrity. Retrieved February 11, 2023, from
https://www.talend.com/resources/data-extraction-defined/
Whitney, C.W., Lind, B.K., Wahl, P.W. (1998). Quality assurance and quality control in
longitudinal studies. Epidemiologic Reviews, 20(1): 71-80.
22
UNIT4: DATA VISUALIZATION AND
INTERPRETATION
Structure Page Nos.
4.0 Introduction
4.1 Objectives
4.2 Different types of plots
4.3 Histograms
4.4 Box plots
4.5 Scatter plots
4.6 Heat map
4.7 Bubble chart
4.8 Bar chart
4.9 Distribution plot
4.10 Pair plot
4.11 Line graph
4.12 Pie chart
4.13 Doughnut chart
4.14 Area chart
4.15 Summary
4.16 Answers
4.17 References
4.0 INTRODUCTION
The previous units of this course covers details on different aspects of data analysis,
including the basics of data science, basic statistical concepts related to data science and
data pre-processing. This unit explains the different types of plots for data visualization
and interpretation. This unit covers the details of the plots for data visualization and
further discusses their constructions and discusses the various use cases associated with
various data visualization plots. This unit will help you to appreciate the real-world
need for a workforce trained in visualization techniques and will help you to design,
develop, and interpret visual representations of data. The unit also defines the best
practices associated with the construction of different types of plots.
4.1 OBJECTIVES
After going through this unit, you will be able to:
• Explain the key characteristics of various types of plots for data visualization;
• Explain how to design and create data visualizations;
• Summarize and present the data in meaningful ways;
• Define appropriate methods for collecting, analysing, and interpreting the
numerical information.
Moreover, data visualisation can bring heterogeneous teams together around new
objectives and foster the trust among the team members. Let us discuss about various
graphs and charts that can be utilized in expression of various aspects of businesses.
4.3 HISTOGRAMS
A histogram visualises the distribution of data across distinct groups with continuous
classes. It is represented with set of rectangular bars with widths equal to the class
intervals and areas proportional to frequencies in the respective classes. A histogram
may hence be defined as a graphic of a frequency distribution that is grouped and has
continuous classes. It provides an estimate of the distribution of values, their extremes,
and the presence of any gaps or out-of-the-ordinary numbers. They are useful in
providing a basic understanding of the probability distribution.
• Analyse various data groups: The best data groupings can be found by
creating a variety of histograms.
• Break down compartments using colour: The same chart can display a
second set of categories by colouring the bars that represent each category.
Types of Histogram
Normal distribution: In a normal distribution, the probability that points will occur on
each side of the mean is the same. This means that points on either side of the mean
could occur.
Example: Consider the following bins shows the frequency of length of wings of
housefly in 1/10 of millimetre.
Bimodal Distribution: This distribution has two peaks. In the case of a bimodal
distribution, the data must be segmented before being analysed as normal distributions
in their own right.
Example:
Variable Frequency
0 2
1 6
2 4
3 2
4 4
5 6
6 4
Bimodal Distribution
8
6
freq
4
2
0
0 1 2 3 4 5 6
variable
Edge Peak Distribution: When there is an additional peak at the edge of the
distribution that does not belong there, this type of distribution is called an edge peak
distribution. Unless you are very positive that your data set has the expected number of
outliers, this almost always indicates that you have plotted (or collected) your data
incorrectly (i.e. a few extreme views on a survey).
Comb Distribution: Because the distribution seems to resemble a comb, with
alternating high and low peaks, this type of distribution is given the name "comb
distribution." Rounding off an object might result in it having a comb-like form. For
instance, if you are measuring the height of the water to the nearest 10 centimetres but
your class width for the histogram is only 5 centimetres, you may end up with a comb-
like appearance.
Example
Histogram for the population data of a group of 86 people:
20 18
15
15 13
11
10
6
5
0
20-25 26-30 31-35 36-40 41-45 46-50
Population Size 23 18 15 6 11 13
Age Group (Bins)
……………………………………………………………………………………
……………………………………………………………………………………
4. What do histograms show?
………………………………………………………………………………………
………………………………………………………………………………………
……………………………………………………………………………………
2. What are the most important parts of a box plot?
……………………………………………………………………………………
……………………………………………………………………………………
………………………………………………………………………………
Scatter plot is the most commonly used chart when observing the relationship between
two quantitative variables. It works particularly well for quickly identifying possible
correlations between different data points. The relationship between multiple variables
can be efficiently studied using scatter plots, which show whether one variable is a good
predictor of another or whether they normally fluctuate independently. Multiple distinct
data points are shown on a single graph in a scatter plot. Following that, the chart can
be enhanced with analytics like trend lines or cluster analysis. It is especially useful for
quickly identifying potential correlations between data points.
Constructing a Scatter Plot: Scatter plots are mathematical diagrams or plots that rely
on Cartesian coordinates. In this type of graph, the categories being compared are
represented by the circles on the graph (shown by the colour of the circles) and the
numerical volume of the data (indicated by the circle size). One colour on the graph
allows you to represent two values for two variables related to a data set, but two colours
can also be used to include a third variable.
Use Cases: Scatter charts are great in scenarios where you want to display both
distribution and the relationship between two variables.
• Display the relationship between time-on-platform (How Much Time Do
People Spend on Social Media) and churn (the number of people who stopped
being customers during a set period of time).
• Display the relationship between salary and years spent at company
Best Practices
• Analyze clusters to find segments: Based on your chosen variables, cluster
analysis divides up the data points into discrete parts.
• Employ highlight actions: You can rapidly identify which points in your
scatter plots share characteristics by adding a highlight action, all the while
keeping an eye on the rest of the dataset.
• mark customization: individual markings Add a simple visual hint to your
graph that makes it easy to distinguish between various point groups.
Example
SALE OF ICE-CREAM
₹4,000.00
₹3,000.00
₹2,000.00
₹1,000.00
₹-
0 10 20 30 40 50
TEMPERATURE IN DEGREE C
Please note that a linear trendline has been fitted to scatter plot, indicating a positive
change in sales of ice-cream with increase in temperature.
Check Your Progress 3
……………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
4. What are the 3 types of corelations that can be inferred from scatter plots?
……………………………………………………………………………………
……………………………………………………………………………………
Heatmaps are two-dimensional graphics that show data trends through colour
shading. They are an example of part to whole chart in which values are represented
using colours. A basic heat map offers a quick visual representation of the data. A
user can comprehend complex data sets with the help of more intricate heat maps.
Heat maps can be presented in a variety of ways, but they all have one thing in
common: they all make use of colour to convey correlations between data
values. Heat maps are more frequently utilised to present a more comprehensive
view of massive amounts of data. It is especially helpful because colours are simpler
to understand and identify than plain numbers.
Heat maps are highly flexible and effective at highlighting trends. Heatmaps are
naturally self-explanatory, in contrast to other data visualisations that require
interpretation. The greater the quantity/volume, the deeper the colour (the higher
the value, the tighter the dispersion, etc.). Heat Maps dramatically improve the
ability of existing data visualisations to quickly convey important data insights.
Use Cases: Heat Maps are primarily used to better show the enormous amounts of
data contained inside a dataset and help guide users to the parts of data
visualisations that matter most.
• Average monthly temperatures across the years
• Departments with the highest amount of attrition over time.
• Traffic across a website or a product page.
• Population density/spread in a geographical location.
Best Practices
• Select the proper colour scheme: This style of chart relies heavily on
colour, therefore it's important to pick a colour scheme that complements
the data.
• Specify a legend: As a related point, a heatmap must typically contain a
legend describing how the colours correspond to numerical values.
Example
Region-wise monthly sale of a SKU (stock-keeping unit)
MONTH
ZONE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
NORTH 75 84 61 95 77 82 74 92 58 90 54 83
SOUTH 50 67 89 61 91 77 80 72 82 78 58 63
EAST 62 50 83 95 83 89 72 96 96 81 86 82
WEST 69 73 59 73 57 61 58 60 97 55 81 92
The distribution of sales is shown in the sample heatmap above, broken down by
zone and spanning a 12-month period. Like in a typical data table, each cell displays
a numeric count, but the count is also accompanied by a colour, with higher counts
denoting deeper hues.
……………………………………………………………………………………
2. What kind of information does a heat map display?
……………………………………………………………………………………
……………………………………………………………………………………
3. What can be seen in heatmap?
……………………………………………………………………………………
……………………………………………………………………………………
Bubble diagrams are used to show the relationships between different variables. They
are frequently used to represent data points in three dimensions, specifically when the
bubble size, y-axis, and x-axis are all present. Using location and size, bubble charts
demonstrate relationships between data points. However, bubble charts have a restricted
data size capability since too many bubbles can make the chart difficult to read.
Although technically not a separate type of visualisation, bubbles can be used to show
the relationship between three or more measurements in scatter plots or maps by adding
complexity. By altering the size and colour of circles, large amounts of data are
presented concurrently in visually pleasing charts.
Use Cases: Usually, the positioning and ratios of the size of the bubbles/circles on this
chart are used to compare and show correlations between variables. Additionally, it is
utilised to spot trends and patterns in data.
• AdWords’ analysis: CPC vs Conversions vs share of total conversions
• Relationship between life expectancy, GD per capita and population size
Best Practices:
• Add colour: A bubble chart can gain extra depth by using colour.
• Set bubble size in appropriate proportion.
• Overlay bubbles on maps: From bubbles, a viewer can immediately determine
the relative concentration of data. These are used as an overlay to provide the
viewer with context for geographically-related data.
Example
The three variables in this example are sales, profits, and the number of units sold.
Therefore, all three variables and their relationship can be displayed simultaneously
using a bubble chart.
Sales and Profit versus the Quantity sold
BUBBLE CHART
₹30,000.00
₹25,000.00
Sales (in INR)
₹20,000.00
₹15,000.00
₹10,000.00
₹5,000.00
₹-
0 200 400 600 800 1000 1200 1400
Number of units sold
……………………………………………………………………………………
2. What is a bubble chart used for?
……………………………………………………………………………………
……………………………………………………………………………………
3. What is the difference between scatter plot and bubble chart?
……………………………………………………………………………………
……………………………………………………………………………………
4. What is bubble size in bubble chart?
……………………………………………………………………………….
……………………………………………………………………………….
A bar chart is a graphical depiction of numerical data that uses rectangles (or
bars) with equal widths and varied heights. In the field of statistics, bar charts
are one of the methods for handling data.
Constructing a Bar Chart: The x-axis corresponds to the horizontal line, and
the y-axis corresponds to the vertical line. The y-axis represents frequency in
this graph. Write the names of the data items whose values are to be noted along
the x-axis that is horizontal.
Along the horizontal axis, choose the uniform width of bars and the uniform
gap between the bars. Pick an appropriate scale to go along the y-axis that runs
vertically so that you can figure out how high the bars should be based on the
values that are presented. Determine the heights of the bars using the scale you
selected, then draw the bars using that information.
Types of Bar chart: Bar Charts are mainly classified into two types:
Horizontal Bar Charts: Horizontal bar charts are the type of graph that are
used when the data being analysed is to be depicted on paper in the form of
horizontal bars with their respective measures. When using a chart of this type,
the categories of the data are indicated on the y-axis.
Example:
Vertical Bar Charts: A vertical bar chart displays vertical bars on graph (chart)
paper. These rectangular bars in a vertical orientation represent the
measurement of the data. The quantities of the variables that are written along
the x-axis are represented by these rectangular bars.
Example:
We can further divide bar charts into two basic categories:
Grouped Bar Charts: The grouped bar graph is also referred to as the clustered
bar graph (graph). It is valuable for at least two separate types of data. The
horizontal (or vertical) bars in this are categorised according to their position.
If, for instance, the bar chart is used to show three groups, each of which has
numerous variables (such as one group having four data values), then different
colours will be used to indicate each value. When there is a close relationship
between two sets of data, each group's colour coding will be the same.
Example:
Stacked Bar Charts: The composite bar chart is also referred to as the stacked
bar chart. It illustrates how the overall bar chart has been broken down into its
component pieces. We utilise bars of varying colours and clear labelling to
determine which category each item belongs to. As a result, in a chart with
stacked bars, each parameter is represented by a single rectangular bar. Multiple
segments, each of a different colour, are displayed within the same bar. The
various components of each separate label are represented by the various
segments of the bar. It is possible to draw it in either the vertical or horizontal
plane.
Example:
Use cases: Bar charts are typically employed to display quantitative data. The
following is a list of some of the applications of the bar chart-
• In order to clearly illustrate the relationships between various variables,
bar charts are typically utilised. When presented in a pictorial format,
the parameters can be more quickly and easily envisioned by the user.
• Bar charts are the quickest and easiest way to display extensive
amounts of data while saving time. It is utilised for studying trends over
extended amounts of time.
Best Practices:
• Use a common zero valued baseline
• Maintain rectangular forms for your bars
• Consider the ordering of category level and use colour wisely.
Example:
Region Sales
East 6,123
West 2,053
South 4,181
North 3,316
Sales By Region
North 3,316
East
South 4,181
West
West 2,053 South
East 6,123 North
Constructing a Distribution Plot: You must utilise one or two dimensions, together
with one measure, in a distribution plot. You will get a single line visualisation if you
only use one dimension. If you use two dimensions, each value of the outer, second
dimension will produce a separate line.
Use Cases: Distribution of a data set shows the frequency of occurrence of each
possible outcome of a repeatable event observed many times. For instance:
• Height of a population.
• Income distribution in an economy
• Test scores listed by percentile.
Best Practices:
• It is advisable to have equal class widths.
• The class intervals should be mutually exclusive and non-overlapping.
• Open-ended classes at the lower and upper limits (e.g., <10, >100) should be
avoided.
Example
0
00
00
00
00
00
00
00
00
00
00
10
00
-2
-3
-4
-5
-6
-7
-8
-9
1-
-1
01
01
01
01
01
01
01
01
01
10
20
30
40
50
60
70
80
90
The pairs plot is an extension of the histogram and the scatter plot, which are both
fundamental figures. The scatter plots on the upper and lower triangles show the
relationship (or lack thereof) between two variables. The histogram along the diagonal
gives us the ability to see the distribution of a single variable, while the scatter plots on
the upper and lower triangles show the relationship (or lack thereof) between two
variables.
Use Cases: A pairs plot allows us to see both distribution of single variables and
relationships between two variables. It helps to identify the most distinct clusters or the
optimum combination of attributes to describe the relationship between two variables.
• By creating some straightforward linear separations or basic lines in our data
set, it also helps to create some straightforward classification models.
• Analysing socio-economic data of a population.
Best Practices:
• Use a different colour palette.
• For each colour level, use a different marker.
Example:
calories protein fat sodium fiber rating
70 4 1 130 10 68.40297
120 3 5 15 2 33.98368
70 4 1 260 9 59.42551
50 4 0 140 14 93.70491
110 2 2 180 1.5 29.50954
110 2 0 125 1 33.17409
130 3 2 210 2 37.03856
90 2 1 200 4 49.12025
90 3 0 210 5 53.31381
120 1 2 220 0 18.04285
110 6 2 290 2 50.765
120 1 3 210 0 19.82357
110 3 2 140 2 40.40021
110 1 1 180 0 22.73645
110 2 0 280 0 41.44502
100 2 0 290 1 45.86332
110 1 0 90 1 35.78279
110 1 1 180 0 22.39651
110 3 3 140 4 40.44877
110 2 0 220 1 46.89564
100 2 1 140 2 36.1762
100 2 0 190 1 44.33086
110 2 1 125 1 32.20758
110 1 0 200 1 31.43597
100 3 0 0 3 58.34514
120 3 2 160 5 40.91705
120 3 0 240 5 41.01549
110 1 1 135 0 28.02577
100 2 0 45 0 35.25244
110 1 1 280 0 23.80404
100 3 1 140 3 52.0769
110 3 0 170 3 53.37101
120 3 3 75 3 45.81172
120 1 2 220 1 21.87129
110 3 1 250 1.5 31.07222
110 1 0 180 0 28.74241
110 2 1 170 1 36.52368
140 3 1 170 2 36.47151
110 2 1 260 0 39.24111
100 4 2 150 2 45.32807
110 2 1 180 0 26.73452
100 4 1 0 0 54.85092
150 4 3 95 3 37.13686
150 4 3 150 3 34.13977
160 3 2 150 3 30.31335
100 2 1 220 2 40.10597
120 2 1 190 0 29.92429
140 3 2 220 3 40.69232
90 3 0 170 3 59.64284
130 3 2 170 1.5 30.45084
120 3 1 200 6 37.84059
100 3 0 320 1 41.50354
50 1 0 0 0 60.75611
50 2 0 0 1 63.00565
100 4 1 135 2 49.51187
100 5 2 0 2.7 50.82839
120 3 1 210 5 39.2592
100 3 2 140 2.5 39.7034
90 2 0 0 2 55.33314
110 1 0 240 0 41.99893
110 2 0 290 0 40.56016
80 2 0 0 3 68.23589
90 3 0 0 4 74.47295
90 3 0 0 3 72.80179
110 2 1 70 1 31.23005
110 6 0 230 1 53.13132
90 2 0 15 3 59.36399
110 2 1 200 0 38.83975
140 3 1 190 4 28.59279
100 3 1 200 3 46.65884
110 2 1 250 0 39.10617
110 1 1 140 0 27.7533
100 3 1 230 3 49.78745
100 3 1 200 3 51.59219
110 2 1 200 1 36.18756
The pair plot can be interpreted as follows:
Along the boxes of the diagonal, the variable names are displayed.
A scatterplot of the correlation between each pairwise combination of factors is shown
in each of the remaining boxes. For instance, a scatterplot of the values for rating and
sodium can be seen in the matrix's box in the top right corner. A scatterplot of values
for rating, that is positively connected with rating, and so forth may be seen in the box
in the upper left corner. We can see the association between each pair of variables in
our dataset from this single visualisation. For instance, calories and rating appear to
have a negative link but protein and fat appear to be unrelated.
A graph that depicts change over time by means of points and lines is known as a line
graph, line chart, or line plot. It is a graph that shows a line connecting a lot of points
or a line that shows how the points relate to one another. The graph is represented by
the line or curve that connects successive data points to show quantitative data between
two variables that are changing. The values of these two variables are compared along
a vertical axis and a horizontal axis in linear graphs.
One of the most significant uses of line graphs is tracking changes over both short and
extended time periods. It is also used to compare the changes that have taken place for
diverse groups over the course of the same time period. It is strongly advised to use a
line graph rather than a bar graph when working with data that only has slight
fluctuations.
Example:
2. Multiple Line Graph: The same set of axes is used to plot several lines. An
excellent way to compare similar objects over the same time period is via a
multiple line graph.
Example:
Example:
Constructing a line graph: When we have finished creating the data tables, we will
then use those tables to build the linear graphs. These graphs are constructed by plotting
a succession of points, which are then connected together with straight lines to offer a
straightforward method for analysing data gathered over a period of time. It provides a
very good visual format of the outcome data that was gathered over the course of time.
Use cases: Tracking changes over both short and long time periods is an important
application of line graphs. Additionally, it is utilised to compare changes over the same
time period for various groups. Anytime there are little changes, using a line graph
rather than a bar graph is always preferable.
• Only connecting adjacent values along an interval scale should be done with
lines.
• In order to provide correct insights, intervals should be of comparable size.
• Select a baseline that makes sense for your set of data; a zero baseline might
not adequately capture changes in the data.
• Line graphs are only helpful for comparing data sets if the axes have the same
scales.
Example:
A pie chart, often referred to as a circle chart, is a style of graph that can be used to
summarise a collection of nominal data or to show the many values of a single variable
(e.g. percentage distribution). Such a chart resembles a circle that has been divided into
a number of equal halves. Each segment corresponds to a specific category. The overall
size of the circle is divided among the segments in the same proportion as the category's
share of the whole data set.
A pie chart often depicts the individual components that make up the whole. In order to
bring attention to a particular piece of information that is significant, the illustration
may, on occasion, show a portion of the pie chart that is cut away from the rest of the
diagram. This type of chart is known as an exploded pie chart.
Types of a Pie chart: There are mainly two types of pie charts one is 2D pie chart and
another is 3D pie chart. This can be further classified into flowing categories:
1. Simple Pie Chart: The most fundamental kind of pie chart is referred to simply as
a pie chart and is known as a simple pie chart. It is an illustration that depicts a pie
chart in its most basic form.
Example:
Owners(%)
2. Exploded Pie Chart: To create an exploding pie chart, you must first separate the
pie from the chart itself, as opposed to merging the two elements together. It is common
practise to do this in order to draw attention to a certain section or slice of a pie chart.
Example:
3.Pie of Pie: The pie of pie method is a straightforward approach that enables more
categories to be represented on a pie chart without producing an overcrowded and
difficult-to-read graph. A pie chart that is generated from an already existing pie chart
is referred to as a "pie of pie".
Example:
Example:
Therefore, the pie chart formula is given as (Given Data/Total value of Data) × 360°
Use cases: If you want your audience to get a general idea of the part-to-whole
relationship in your data, and comparing the exact sizes of the slices is not as critical to
you, then you should use pie charts. And indicate that a certain portion of the whole is
disproportionately small or large.
• Voting preference by age group
• Market share of cloud providers
Best Practices
• Fewer pie wedges are preferred: The observer may struggle to interpret the chart's
significance if there are too many proportions to compare. Similar to this, keep the
overall number of pie charts on dashboards to a minimum.
Overlay pies on maps: Pie charts can be used to further deconstruct geographic
tendencies in your data and produce an engaging display.
Example
MARKET SHARE
Company A
22% 24% Company B
Company C
13%
Company D
33%
8% Company E
Pie charts have been superseded by a more user-friendly alternative called a doughnut
chart, which makes reading pie charts much simpler. It is recognised that these charts
express the relationship of 'part-to-whole,' which is when all of the parts represent one
hundred percent when collected together. It presents survey questions or data with a
limited number of categories for making comparisons.
In comparison to pie charts, they provide for more condensed and straightforward
representations. In addition, the canter hole can be used to assist in the display of
relevant information. You might use them in segments, where each arc would indicate
a proportional value associated with a different piece of data.
Constructing a Doughnut chart: A doughnut chart, like a pie chart, illustrates the
relationship of individual components to the whole, but unlike a pie chart, it can display
more than one data series at the same time. A ring is added to a doughnut chart for each
data series that is plotted within the chart itself. The beginning of the first data series
can be seen near the middle of the chart. A specific kind of pie chart called a doughnut
chart is used to show the percentages of categorical data. The amount of data that falls
into each category is indicated by the size of that segment of the donut. The creation of
a donut chart involves the use of a string field and a number, count of features, or
rate/ratio field.
There are two types of doughnut chart one is normal doughnut chart and another is
exploded doughnut chart. Exploding doughnut charts, much like exploded pie charts,
highlight the contribution of each value to a total while emphasising individual values.
However, unlike exploded pie charts, exploded doughnut charts can include more than
one data series.
Use cases: Doughnut charts are good to use when comparing sets of data. By using the
size of each component to reflect the percentage of each category, they are used to
display the proportions of categorical data. A string field and a count of features,
number, rate/ratio, or field are used to make a doughnut chart.
• Android OS market share
• Monthly sales by channel
Best Practices
• Stick to five slices or less because thinner and long-tail slices become unreadable
and uncomparable.
• Use this chart to display one point in time with the help of the filter legend.
• Well-formatted and informative labels are essential because the information
conveyed by circular shapes alone is not enough and is imprecise.
• It is a good practice to sort the slices to make it more clear for comparison.
Example:
Project Status
Completed 30%
Work in progress 25%
Incomplete 45%
An area chart, a hybrid of a line and bar chart, shows the relationship between the
numerical values of one or more groups and the development of a second variable, most
often the passage of time. The inclusion of shade between the lines and a baseline,
similar to a bar chart's baseline, distinguishes a line chart from an area chart. An area
chart has this as its defining feature.
Overlapping area chart: An overlapping area chart results if we wish to look at how
the values of the various groups compare to one another. The conventional line chart
serves as the foundation for an overlapping area chart. One point is plotted for each
group at each of the horizontal values, and the height of the point indicates the group's
value on the vertical axis variable.
All of the points for a group are connected from left to right by a line. A zero baseline
is supplemented by shading that is added by the area chart between each line. Because
the shading for different groups will typically overlap to some degree, the shading itself
incorporates a degree of transparency to ensure that the lines delineating each group
may be seen clearly at all times.
The shading brings attention to group that has the highest value by highlighting group's
pure hue. Take care that one series is not always higher than the other, as this could
cause the plot to become confused with the stacked area chart, which is the other form
of area chart. In circumstances like these, the most prudent course of action will consist
of sticking to the traditional line chart.
Stacked area chart: The stacked area chart is what is often meant to be conveyed when
the phrase "area chart" is used in general conversation. When creating the chart of
overlapping areas, each line was tinted based on its vertical value all the way down to
a shared baseline. Plotting lines one at a time creates the stacked area chart, which uses
the height of the most recent group of lines as a moving baseline. Therefore, the total
that is obtained by adding up all of the groups' values will correspond to the height of
the line that is entirely piled on top.
When you need to keep track of both the total value and the breakdown of that total by
groups, you should make use of a stacked area chart. This type of chart will allow you
to do both at the same time. By contrasting the heights of the individual curve segments,
we are able to obtain a sense of how the contributions made by the various subgroups
stack up against one another and the overall sum.
Example:
A B C D
Printers Projectors White Boards
2017 32 45 28
2018 47 43 40
2019 40 39 43
2020 37 40 41
2021 39 49 39
Stacked Area chart
150
100
50
0
1 2 3 4 5
Use Cases: In most cases, many lines are drawn on an area chart in order to create a
comparison between different groups (also known as series) or to illustrate how a whole
is broken down into its component pieces. This results in two distinct forms of area
charts, one for each possible application of the chart.
• Magnitude of a single quantitative variable's trend - An increase in a public
company's revenue reserves, programme enrollment from a qualified subgroup by
year, and trends in mortality rates over time by primary causes of death are just a
few examples.
• Comparison of the contributions made by different category members (or
group)- the variation in staff sizes among departments, or support tickets opened
for various problems.
• Birth and death rates over time for a region, the magnitudes of cost vs. revenue for
a business, the magnitudes of export vs. import over time for a country
Best Practices:
• To appropriately portray the proportionate difference in the data, start the y-axis at
0.
• To boost readability, choose translucent, contrasting colours.
• Keep highly variable data at the top of the chart and low variable data at the bottom
during stacking.
• If you need to show how each value over time contributes to a total, use a stacked
area chart.
• However, it is recommended to utilise 100% stacked area charts if you need to
demonstrate a part to whole relationship in a situation where the cumulative total is
unimportant.
Example:
The above Stacked area chart is belonging to tele-service offered by various television
based applications. In this data, there are different type of subscribers who are using
the services provided by tele-applications in different months.
4.15 SUMMARY
This Unit introduces you to some of the basic charts that are used in data science. The
Unit defines the characteristics of Histograms, which are very popular in univariate
frequency analysis of quantitative variables. It then discusses the importance and
various terms used in the box plots, which are very useful while comparing quantitative
variable over some qualitative characteristic. Scatter plots are used to visualise the
relationships between two quantitative variables. The Unit also discusses about the heat
map, which are excellent visual tools for comparing values. In case three variables are
to be compared then you may use bubble charts. The unit also highlights the importance
of bar charts, distribution plots, pair plots and line graphs. In addition, it highlights the
importance of Pie chart, doughnut charts and area charts for visualising different kinds
of data. In addition, there are many different kinds of charts that are used in different
analytical tool. You may read about them from reafferences.
4.16 ANSWERS
ii.
4. The box plot distribution will reveal the degree to which the data are clustered, how
skewed they are, and also how symmetrical they are.
• Positively Skewed: The box plot is positively skewed if the distance from the me-
dian to the maximum is greater than the distance from the median to the mini-
mum.
• Negatively skewed: Box plots are said to be negatively skewed if the distance
from the median to the minimum is higher than the distance from the median to
the maximum.
• Symmetric: When the median of a box plot is equally spaced from both the maxi-
mum and minimum values, the box plot is said to be symmetric.
• The most practical method for displaying bivariate (2-variable) data is a scatter plot.
• A scatter plot can show the direction of a relationship between two variables when
there is an association or interaction between them (positive or negative).
• The linearity or nonlinearity of an association or relationship can be ascertained
using a scatter plot.
• A scatter plot reveals anomalies, questionably measured data, or incorrectly plotted
data visually.
2.
• The Title- A brief description of what is in your graph is provided in the title.
• The Legend- The meaning of each point is explained in the legend.
• The Source- The source explains how you obtained the data for your graph.
• Y-Axis.
• The Data.
• X-Axis.
3. A scatter plot is composed of a horizontal axis containing the measured values of one
variable (independent variable) and a vertical axis representing the measurements of the
other variable (dependent variable). The purpose of the scatter plot is to display what
happens to one variable when another variable is changed.
4.
• Positive Correlation.
• Negative Correlation.
• No Correlation (None)
3. Using one variable on each axis, heatmaps are used to display relationships
between two variables. You can determine if there are any trends in the
values for one or both variables by monitoring how cell colours vary across
each axis.
4. Any bubbles between 0 and 5 pts on this scale will appear at 5 pt, and
all the bubbles on your chart will be between 5 and 20 pts. To construct
a chart that displays many dimensions, combine bubble size with
colour by value.
Answer 2:
Charts are primarily divided into two categories:
Answer4:
2. A scatter plot of a and b, one of a and c, and finally one of a and d are
shown in the first line. b and a (symmetric to the first row) are in the second
row, followed by b and c, b and d, and so on. In pairs, no sums, mean squares,
or other calculations are performed. That is in your data frame if you discover
it in your pairings plot.
3. Pair plots are used to determine the most distinct clusters or the best
combination of features to describe a connection between two variables. By
creating some straightforward linear separations or basic lines in our data set,
it also helps to create some straightforward classification models.
2. Tracking changes over a short as well as a long period of time is one of the
most important applications of line graphs. Additionally, it is utilised to
compare the modifications that have occurred for various groups throughout
the course of the same period of time. When dealing with data that has only
minor variations, using a line graph rather than a bar graph is strongly
recommended. For instance, the finance team at a corporation may wish to
chart the evolution of the cash balance that the company now possesses
throughout the course of time.
3.
1. A pie chart, often referred to as a circle chart, is a style of graph that can be used to
summarise a collection of nominal data or to show the many values of a single variable.
(e.g. percentage distribution).
2. There are mainly two types of pie charts one is 2D pie chart and another is 3D pie
chart. This can be further classified into flowing categories:
3. Pie of Pie
4. Bar of Pie
3.
1. Pie charts have been superseded by a more user-friendly alternative called a doughnut
chart, which makes reading pie charts much simpler. It is recognised that these charts
express the relationship of 'part-to-whole,' which is when all of the parts represent one
hundred percent when collected together. In comparison to pie charts, they provide for
more condensed and straightforward representations.
2. A donut chart is similar to a pie chart, with the exception
that the centre is cut off. When you want to display
particular dimensions, you use arc segments rather than
slices. Just like a pie chart, this form of chart can assist you
in comparing certain categories or dimensions to the
greater overall; nevertheless, it has a few advantages over
its pie chart counterpart.
3.
Product Sales
60
30
40
x y z
1. An area chart shows how the numerical values of one or more groups change in
proportion to the development of a second variable, most frequently the passage of time.
It combines the features of a line chart and a bar chart. A line chart can be differentiated
from an area chart by the addition of shading between the lines and a baseline, just like
in a bar chart. This is the defining characteristic of an area chart.
3.
Area Chart
3500
3000
2500
2000
1500
1000
500
0
2017 2018 2019 2020
4.17 REFERENCES