Statistics for Data Science
Statistics for Data Science
Data Science
What is Data?
According to the Oxford “Data is distinct pieces of information, usually
formatted in a special way”.
Data is measured, collected and reported, and analysed, whereupon it is often
visualized using graphs, images or other analysis tools. Raw data may be a collection of
numbers or characters before it’s been cleaned and corrected by researchers.
The following are the broadly divided steps which are usually performed in any
Predictive analytics / Machine Learning problem.
pg. 1
Page 2
pg. 2
Page 3
also need to remove the unnecessary columns and/or rows. One should do data
exploration before doing any data cleaning blindly.
5. Data transformation:
For numerical data we may require cantering, scaling or normalization like log-
normalization, etc in order to avoid issues like overfitting. We may also require
dimensionality reduction techniques like Principal component analysis to remove
dimensionality issues. We may require one-hot encoding if we have categorical data.
6. Data partition:
We need to split the dataset into a training (known) set and testing (unknown)
dataset. We need “test data set” in order to validate the model or check the model
performance on unseen or test dataset.
Types of Data:
Generally, data can be classified into two parts:
1. Categorical Data:
Categorical data refers to a data type that can be stored and identified based on
the names or labels given to them. A process called matching is done, to draw out the
similarities or relations between the data and then they are grouped accordingly.
The data collected in the categorical form is also known as qualitative data. Each
dataset can be grouped and labelled depending on their matching qualities, under only
one category
Example: Marital Status, Political Party, Eye colour
2. Numerical Data:
Numerical data can further be classified into two categories:
Discrete Data:
Discrete data can take only discrete values. Discrete information contains only a
finite number of possible values. Those values cannot be subdivided meaningfully.
Here, things can be counted in whole numbers.
We speak of discrete data if the data can only take on certain values. This type of
data can’t be measured but it can be counted. It basically represents information that
can be categorized into a classification.
Example: Number of students in the class, Mobiles, Brothers, etc.
Continuous Data:
Continuous data is data that can be calculated. It has an infinite number of
probable values that can be selected within a given specific range.
Continuous Data represents measurements and therefore their values can’t be
counted but they can be measured.
Example: Temperature, Time, Age, Water, Currency, etc.
At advanced level, we can further classify the data into four parts
1. Nominal Data:
Nominal data are used to label variables where there is no quantitative value and
has no order. So, if you change the order of the value then the meaning will remain the
same.
Thus, nominal data are observed but not measured, are unordered but non-
equidistant, and have no meaningful zero.
The only numerical activities you can perform on nominal data is to state that
perception is (or isn't) equivalent to another (equity or inequity), and you can use this
data to amass them. You can't organize nominal data, so you can't sort them.
Neither would you be able to do any numerical tasks as they are saved for
numerical data. With nominal data, you can calculate frequencies, proportions,
percentages, and central points.
pg. 4
Page 5
2. Ordinal Data:
Ordinal data is almost the same as nominal data but not in the case of order as
their categories can be ordered like 1st, 2nd, etc. However, there is no continuity in the
relative distances between adjacent categories.
Ordinal Data is observed but not measured, is ordered but non-equidistant, and
has no meaningful zero. Ordinal scales are always used for measuring happiness,
satisfaction, etc.
As ordinal data are ordered, they can be arranged by making basic comparisons
between the categories, for example, greater or less than, higher or lower, and so on.
You can't do any numerical activities with ordinal data, however, as they are numerical
data. Examples: Excellent to poor, Grades, Opinion, etc.
3. Interval Data:
Interval values represent ordered units that have the same difference.
Therefore, we speak of interval data when we have a variable that contains numeric
values that are ordered and where we know the exact differences between the values.
The problem with interval values data is that they don’t have a true zero. That
means in regards to our example, that there is no such thing as no temperature. With
interval data, we can add and subtract, but we cannot multiply, divide or calculate ratios.
Because there is no true zero, a lot of descriptive and inferential statistics can’t be
applied. Example: Body temperature.
4. Ratio Data:
Ratio values are also ordered units that have the same difference. Ratio values
are the same as interval values, with the difference that they do have an absolute
zero. Examples: height, weight, length etc.
pg. 5
Page 6
Statistics:
Statistics is a branch of mathematics that deals with the study of collecting,
analysing, interpreting, presenting, and organizing data in a particular manner. Statistics
is defined as the process of collection of data, classifying data, representing the data for
easy interpretation, and further analysis of data. Statistics also is referred to as arriving
at conclusions from the sample data that is collected using surveys or experiments.
Different sectors such as psychology, sociology, geology, probability, and so on also
use statistics to function.
Mathematical Statistics:
Statistics is used mainly to gain an understanding of the data and focus on
various applications. Statistics is the process of collecting data, evaluating data, and
summarizing it into a mathematical form. Initially, statistics were related to the science
of the state where it was used in the collection and analysis of facts and data about a
country such as its economy, population, etc. Mathematical statistics applies
mathematical techniques like linear algebra, differential equations, mathematical
analysis, and theories of probability.
There are two methods of analysing data in mathematical statistics that are used
on a large scale:
Descriptive Statistics:
The descriptive method of statistics is used to describe the data collected and
summarize the data and its properties using the measures of central tendencies and the
measures of dispersion.
Inferential Statistics:
This method of statistics is used to draw conclusions from the data. Inferential
statistics requires statistical tests performed on samples, and it draws conclusions by
identifying the differences between the 2 groups. Tests calculate the p-value that is
compared with the probability of chance(α) = 0.05. If the p-value is less than α, then it
is concluded that the p-value is statistically significant.
pg. 6
Page 7
Descriptive Statistics:
The study of numerical and graphical ways to describe and display your data is
called descriptive statistics. It describes the data and helps us understand the features of
the data by summarizing the given sample set or population of data. In descriptive
statistics, we usually take the sample into account.
Statisticians use graphical representation of data to get a clear picture of the data.
Business trends can be analysed easily with these representations. visual representation
is more effective than presenting huge numbers.
We can describe these data in various dimensions. Various dimensions of
describing data are
1.1 Mean
The “Mean” is the average of the data.
Average can be identified by summing up all the numbers and then dividing
them by the number of observations.
Mean = X1 + X2 + X3 +… + Xn / n
Example: Data – 10,20,30,40,50 and Number of observations = 5
Mean = [ 10+20+30+40+50] / 5
Mean = 30
Outliers influence the central tendency of the data.
1.2 Median
Median is the 50%th percentile of the data. It is exactly the centre point of the
data.
Median can be identified by ordering the data and splits the data into two equal
parts and find the number. It is the best way to find the centre of the data.
Because the central tendency of the data is not affected by outliers. Outliers don’t
influence the data.
Example: Odd number of Data – 10,20,30,40,50
Median is 30.
Even number of data – 10,20,30,40,50,60
Find the middle 2 data and take the mean of those two values.
Here 30 and 40 are middle values.
= 30+40 / 2
= 35. Median is 35
pg. 7
Page 8
1.3 Mode
Mode is frequently occurring data or elements.
If an element occurs the highest number of times, it is the mode of that data. If no
number in the data is repeated, then there is no mode for that data. There can be more
than one mode in a dataset if two values have the same frequency and also the highest
frequency.
2. Dispersion of Data:
The dispersion is the “Spread of the data”. It measures how far the data is spread.
In most of the dataset, the data values are closely located near the mean. On some other
dataset, the values are widely spread out of the mean. These dispersions of data can be
measured by
pg. 8
Page 9
2.2 Range
The range is the difference between the largest and the smallest value in the data.
Max – Min = Range
pg. 9
Page 10
The standard deviation is always positive or zero. It will be large when the data
values are spread out from the mean.
2.4 Variance
The variance is a measure of variability. It is the average squared deviation from
the mean.
The symbol σ2 represents the population variance and the symbol for s2
represents sample variance.
pg. 10
Page 11
3.1 Symmetric:
In the symmetric shape of the graph, the data is distributed the same on both
sides.
In symmetric data, the mean and median are located close together.
3.2 Skewness:
Skewness is the measure of the asymmetry of the distribution of data.
The data is not symmetrical (i.e.) it is skewed towards one side.
Skewness is classified into two types.
pg. 11
Page 12
1. Positively skewed
In a Positively skewed distribution, the data values are clustered around the left
side of the distribution and the right side is longer.
The mean and median will be greater than the mode in the positive skew.
2. Negatively skewed
In a Negatively skewed distribution, the data values are clustered around the
right side of the distribution and the left side is longer.
The mean and median will be less than the mode in the negative skew.
3.3 Kurtosis
Kurtosis is the measure of describing the distribution of data.
This data is distributed in different ways. They are,
1. Platykurtic
The platykurtic shows a distribution with flat tails. Here the data is distributed
flatly. The flat tails indicated the small outliers in the distribution.
2. Mesokurtic
In Mesokurtic, the data is widely distributed. It is normally distributed and it also
matches normal distribution.
3. Leptokurtic
In leptokurtic, the data is very closely distributed. The height of the peak is
greater than width of the peak.
pg. 12
Page 13
Scatter Plot:
A scatter plot uses dots to represent values for two different numeric variables.
The position of each dot on the horizontal and vertical axis indicates values for an
individual data point. Scatter plots are used to observe relationships between variables.
Scatter plots’ primary uses are to observe and show relationships between two
numeric variables. The dots in a scatter plot not only report the values of individual
data points, but also patterns when the data are taken as a whole.
Types of correlation:
The scatter plot explains the correlation between two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –
Positive Correlation:
When the points in the graph are rising, moving from left to right, then the scatter
plot shows a positive correlation. It means the values of one variable are increasing
with respect to another.
Negative Correlation:
When the points in the scatter graph fall while moving left to right, then it is
called a negative correlation. It means the values of one variable are decreasing with
respect to another.
pg. 13
Page 14
No Correlation:
When the points are scattered all over the graph and it is difficult to conclude
whether the values are increasing or decreasing, then there is no correlation between the
variables.
pg. 14
Page 15
Inferential Statistics:
Inferential statistics is a branch of statistics that makes the use of various
analytical tools to draw inferences about the population data from sample data. Apart
from inferential statistics, descriptive statistics forms another branch of statistics.
Inferential statistics help to draw conclusions about the population while descriptive
statistics summarizes the features of the data set.
There are two main types of inferential statistics - hypothesis testing and
regression analysis. The samples chosen in inferential statistics need to be
representative of the entire population. In this article, we will learn more about
inferential statistics, its types, examples, and see the important formulas.
the entire school’s exam marks, we measure a smaller sample of students (for example,
a sample of 50 students). This sample of 50 students will now describe the complete
population of all students of that school.
Simply put, Inferential Statistics make predictions about a population based on a
sample of data taken from that population.
2. Hypothesis testing:
Very beneficial when we are looking to gather data on something that can only
be given to a very confined population, such as a new drug. If we want to know
whether this drug will work for all patients (“complete population”), we can use the
data collected to predict this (often by calculating a z-score).
3. Confidence Interval:
The confidence interval is the range of values that you expect your estimate to
fall between a certain percentage of the time if you run your experiment again or re-
sample the population in the same way.
pg. 16
Page 17
Where:
• CI = the confidence interval
• X̄ = the population mean
• Z = the critical value of the z-distribution
• σ = the population standard deviation
• √n = the square root of the population size
Your desired confidence level is usually one minus the alpha (α) value you used
in your statistical test
Confidence level = 1 – α
So, if you use an alpha value of p < 0.05 for statistical significance, then your
confidence level would be 1 − 0.05 = 0.95, or 95%.
• Proportions
• Population means
• Differences between population means or proportions
• Estimates of variation among groups
These are all point estimates, and don’t give any information about the variation
around the number. Confidence intervals are useful for communicating the variation
around a point estimate.
pg. 17
Page 18
However, the British people surveyed had a wide variation in the number of
hours watched, while the Americans all watched similar amounts.
Even though both groups have the same point estimate (average number of hours
watched), the British estimate will have a wider confidence interval than the American
estimate because there is more variation in the data.
To find the MSE, subtract your sample mean from each value in the dataset, square
the resulting number, and divide that number by n − 1
Then add up all of these numbers to get your total sample variance (s2). For larger
sample sets, it’s easiest to do this in Excel.
pg. 18
Page 19
Example:
In the television-watching survey, the variance in the GB estimate is 100, while the
variance in the USA estimate is 25. Taking the square root of the variance gives us a
sample standard deviation (s) of:
• 10 for the GB estimate.
• 5 for the USA estimate.
The central limit theorem is useful when analysing large data sets because it allows
one to assume that the sampling distribution of the mean will be normally-distributed in
most cases. This allows for easier statistical analysis and inference.
pg. 19
Page 20
Hypothesis Testing:
Hypothesis testing is a part of statistics in which we make assumptions about the
population parameter. So, hypothesis testing mentions a proper procedure by analysing
a random sample of the population to accept or reject the assumption.
Hypothesis testing is the way of trying to make sense of assumptions by looking
at the sample data.
Type of Hypothesis:
The best way to determine whether a statistical hypothesis is true would be to
examine the entire population. Since that is often impractical, researchers typically
examine a random sample from the population. If sample data are not consistent with
the statistical hypothesis, the hypothesis is rejected.
There are two types of statistical hypotheses.
• Null Hypothesis. The null hypothesis, denoted by Ho, is usually the
hypothesis that sample observations result purely from chance.
• Alternative Hypothesis. The alternative hypothesis, denoted by H1 or Ha, is
the hypothesis that sample observations are influenced by some non-random
cause.
If the value of t-stat is less than the significance level we will reject the null
hypothesis, otherwise, we will fail to reject the null hypothesis.
Technically, we never accept the null hypothesis, we say that either we fail to reject
or we reject the null hypothesis.
pg. 21
Page 22
P-value:
The p-value is defined as the probability of seeing a t-statistic as extreme as the
calculated value if the null hypothesis value is true. A low enough p-value is the ground
for rejecting the null hypothesis. We reject the null hypothesis if the p-value is less than
the significance level.
Z-test:
A z test is used on data that follows a normal distribution and has a sample size
greater than or equal to 30. It is used to test if the means of the sample and population
are equal when the population variance is known. The right tailed hypothesis can be set
up as follows:
Z-test is mainly used when the population mean and standard deviation are
given.
T-test:
A t test is used when the data follows a student t distribution and the sample size
is lesser than 30. It is used to compare the sample and population mean when the
population variance is unknown. The hypothesis test for inferential statistics is given as
follows:
pg. 22
Page 23
where, X bar is the sample mean, μ the population mean, s the sample standard
deviation, and N the sample size.
pg. 23
Page 24
• Paired T-test:
If our samples are connected in some way, we have to use the paired t-test.
Here, ‘connecting’ means that the samples are connected as we are collecting
data from the same group two times, e.g., blood tests of patients of a hospital
before and after medication.
Chi-Square test:
The Chi-square test is used in the case when we have to compare categorical
data.
The Chi-square test is of two types. Both use chi-square statistics and distribution
for different purposes.
• The goodness of fit: It determines if sample data of categorical variables
match with population or not.
• Test of Independence: It compares two categorical variables to find whether
they are related to each other or not.
pg. 24
Page 25
pg. 25
Page 26
Z-test is a kind of hypothesis test which The t-test can be referred to as a kind of
ascertains if the averages of the 2 parametric test that is applied to an
Basic
datasets are different from each other identity, how the averages of 2 sets of data
Definition
when standard deviation or variance is differ from each other when the standard
given. deviation or variance is not given.
Sample Size The Sample size is large. Here the Sample Size is small.
All data points are independent. All data points are not dependent.
Key
Normal Distribution for Z, with an Sample values are to be recorded and
Assumptions
average zero and variance = 1. taken accurately.
Based upon
(a type of Based on Normal distribution. Based on Student-t distribution.
distribution)
BASIS OF
ONE-TAILED TEST TWO-TAILED TEST
COMPARISON
Result Greater or less than certain value. Greater or less than certain range of
values.
pg. 26
Page 27
Sample Vs Population:
Sample:
A sample is the specific group that you will collect data from. The size of the
sample is always less than the total size of the population.
You should calculate the sample standard deviation when the dataset you’re
working with represents a sample taken from a larger population of interest. The
formula to calculate a sample standard deviation, denoted as s.
Whatever statistical measures I can calculate Here we called them as
“Descriptive statistics.”
Population:
A population is the entire group that you want to draw conclusions about.
You should calculate the population standard deviation when the dataset you’re
working with represents an entire population, i.e., every value that you’re interested in.
The formula to calculate a population standard deviation, denoted as σ.
Whatever the sample data, I have I will apply some additional theory and I will
estimate on population. ‘Inferential statistics”
pg. 27
Page 28
• Practicality:
It’s easier and more efficient to collect data from a sample.
• Cost-effectiveness:
There are fewer participant, laboratory, equipment, and researcher costs
involved.
• Manageability:
Storing and running statistical analyses on smaller datasets is easier and
reliable.
pg. 28
Page 29
pg. 29
Page 30
2. Systematic sampling:
Systematic sampling is similar to simple random sampling, but it is usually
slightly easier to conduct. Every member of the population is listed with a number, but
instead of randomly generating numbers, individuals are chosen at regular intervals.
Example: All employees of the company are listed in alphabetical order. From the
first 10 numbers, you randomly select a starting point: number 6. From number 6
onwards, every 10th person on the list is selected (6, 16, 26, 36, and so on), and you
end up with a sample of 100 people.
If you use this technique, it is important to make sure that there is no hidden
pattern in the list that might skew the sample. For example, if the HR database groups
employees by team, and team members are listed in order of seniority, there is a risk
that your interval might skip over people in junior roles, resulting in a sample that is
skewed towards senior employees.
3. Stratified sampling:
Stratified sampling involves dividing the population into subpopulations that may
differ in important ways. It allows you draw more precise conclusions by ensuring that
every subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called
strata) based on the relevant characteristic (e.g., gender, age range, income bracket, job
role).
Based on the overall proportions of the population, you calculate how many
people should be sampled from each subgroup. Then you use random or systematic
sampling to select a sample from each subgroup.
Example: The company has 800 female employees and 200 male employees. You
want to ensure that the sample reflects the gender balance of the company, so you sort
the population into two strata based on gender. Then you use random sampling on each
group, selecting 80 women and 20 men, which gives you a representative sample of
100 people.
4. Cluster sampling:
Cluster sampling also involves dividing the population into subgroups, but each
subgroup should have similar characteristics to the whole sample. Instead of sampling
individuals from each subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled
cluster. If the clusters themselves are large, you can also sample individuals from
within each cluster using one of the techniques above. This is called multistage
sampling.
This method is good for dealing with large and dispersed populations, but there is
more risk of error in the sample, as there could be substantial differences between
pg. 30
Page 31
clusters. It’s difficult to guarantee that the sampled clusters are really representative of
the whole population.
Example: The company has offices in 10 cities across the country (all with
roughly the same number of employees in similar roles). You don’t have the capacity to
travel to every office to collect your data, so you use random sampling to select 3
offices – these are your clusters.
1. Convenience sampling:
A convenience sample simply includes the individuals who happen to be most
accessible to the researcher.
This is an easy and inexpensive way to gather initial data, but there is no way to
tell if the sample is representative of the population, so it can’t produce generalizable
results.
Example: You are researching opinions about student support services in your
university, so after each of your classes, you ask your fellow students to complete
pg. 31
Page 32
a survey on the topic. This is a convenient way to gather data, but as you only surveyed
students taking the same classes as you at the same level, the sample is not
representative of all the students at your university.
3. Purposive sampling:
4. Snowball sampling
pg. 32
Page 33
When two dice are rolled with six sided dots, let the possible probability of
rolling is as follows:
pg. 33
Page 34
Bernoulli Distribution:
A discrete probability distribution for a random experiment that has only two
possible outcomes (Bernoulli trials) is known Bernoulli Distribution.
Example: India will win cricket world cup or not
pg. 34
Page 35
Binomial Distribution:
pg. 35
Page 36
Bernoulli Binomial
Poisson Distribution:
A discrete probability distribution that measures the probability of a random
variable over a specific period of time is known as Poisson Distribution.
Example: Probability of Asteroid collision over a selected year of period.
• Used to predict probability of number of successful events.
• Random variable X is Poisson distributed if the distribution function is given by:
Note: In case of Poisson Distribution Mean = Variance
Let’s understand the Poisson Distribution by an example,
Consider the experiment of Number of patients visiting in a hospital
Problem Statement:
Let in a hospital patient arriving in a hospital at expected value is 6, then what is the
probability of five patients will visit the hospital in that day?
Poisson Binomial
pg. 36
Page 37
Normal Distribution:
Empirical Rule:
Empirical Rule is often called the 68 – 95 – 99.7 rule or Three Sigma Rule. It states
that on a Normal Distribution:
• 68% of the data will be within one Standard Deviation of the Mean
• 95% of the data will be within two Standard Deviations of the Mean
• 99.7 of the data will be within three Standard Deviations of the Mean
pg. 37
Page 38
Poisson Normal
The End
pg. 38