2 - Introduction To Statistics
2 - Introduction To Statistics
1
Agenda
❑ What is Statistics?
❑ Data Types & Measurement Level
❑ Types of Statistics ❑ Population VS. Sampling
2
What is Statistics?
3
What is Statistics?
The science of collecting, organizing, presenting, analyzing, and interpreting
data to assist in making more effective decisions.
4
Data Types & Measurement Level
5
Data Types
6
Levels of Measurement
7
Quantitative Variables
Quantitative variables are characteristics that can be expressed
in numbers. For example, weight, height, and length can all be written
numerically.
8
Quantitative Variables
• Continuous Variables:
➢ Contain measurements with decimal precision.
➢ Examples: Height of individuals, weight of a bag of apples, time
taken to run a marathon.
➢ Characteristics: Measurable, infinitely many possible values, can
include fractions/decimals.
• Discrete Variables:
➢ Contain counts that must be whole integer values.
➢ Examples: The number of members in a person’s family, or the
number of goals a basketball team scored in a game.
➢ Characteristics: Countable, often integers, clear gaps between
values.
9
Qualitative Variables
Qualitative variables are characteristics of an individual
or object which can only be expressed in words. Some
examples include ethnicity, profession, or gender.
Ordinal Variables:
➢ Variables that are groups containing an inherent
ranking.
➢ Examples: Education level (high school, bachelor's,
master's), customer satisfaction (dissatisfied, neutral,
satisfied).
Nominal Variables:
➢ Variables made up of categories without an inherent
order.
➢ Examples: Gender (male, female), eye color (blue,
green, brown).
10
Understanding the Variables Using Dataset
11
Understanding the Variables Using Dataset
12
Types of Statistics
Descriptive Vs. Inferential
13
Types of Statistics
• Descriptive Statistics
The methods of organizing, summarizing, and presenting data in an informative way.
Organizing and summarizing data with frequency tables and frequency distributions.
Presenting frequency tables and distributions with charts and graphs.
Measures to summarize the characteristics of data.
• Inferential Statistics
The methods used to estimate population parameters on the basis of a sample
statistic. To make inferences (statements) about a population based on a sample,
the following concepts are used:
Probabilities and Probability Distribution
Sampling
Estimation
Hypothesis Testing
Correlation and Regression
14
Descriptive Statistics
Types of Descriptive Statistics
15
16
Types of Descriptive Statistics
• Measures of Frequency:
• Describe how often certain values or ranges of values occur within a dataset.
• They are used to understand the distribution and occurrence patterns of the data.
17
Frequency Table
➢ A frequency table lists a set of values and how often each one appears.
➢ Frequency is the number of times a specific data value occurs in your dataset.
➢ These tables help you understand which data values are common and which are rare.
➢ They organize your data and are an effective way to present the results to others.
➢ Frequency tables are also known as frequency distributions because they allow you to
understand the distribution of values in your dataset.
18
Descriptive Statistics
Measures of Central Tendency
19
Central Tendency of Data
❑ A measure of central tendency is a
single value that attempts to describe
a set of data by identifying the central
position within that set of data.
20
Mean
● Mean is the sum of all the values in the dataset divided by the number of
values in the dataset. It is also called the Arithmetic Average. Mean is
denoted as x̅ and is read as x bar.
𝑁
σ𝑖=1 𝑥𝑖
𝑥ҧ =
𝑁
21
Mean
The mean is the most widely spread measure of central tendency. It is the simple average
of the dataset. Note: easily affected by outliers
•Importance: The mean provides a central value for a dataset, useful for comparing
different data sets and understanding the general trend.
•Application: Used in almost all fields such as finance, economics, healthcare, and social
sciences to summarize data.
22
Median
● Median is the middle value for sorted data. The sorting of the data can be done either in
ascending order or descending order. A median divides the data into two equal halves.
𝑛+1
• In an ordered dataset, the median is the number at position If this position is not a
2
whole number, the median is the simple average of the two numbers at positions closest to the
calculated value.
23
Median
The median is the midpoint of the ordered dataset. While It is not as popular as the mean, it is often
used in academia and data science because it is not affected by outliers
•Importance: The median gives the middle value of a dataset, making it useful for understanding
the distribution of data, especially when the data is skewed.
•Application: Often used in income data, housing prices, and other scenarios where outliers may
skew the mean.
Example :
24
Mode
● Mode is the most frequent value or item in the dataset.
● A dataset can generally have one or more than one mode value.
25
Mode
The mode is the value that occurs most often. A dataset can have 0 modes, 1 mode or multiple modes.
The mode is calculated simply by finding the value with the highest frequency.
•Importance: The mode indicates the most frequently occurring value in a dataset, useful for
understanding the most common occurrences.
Example Data 1 [ 10, 13, 15, 16, 13, 15, 11, 13]
Example Data 2 [ 8, 9, 13, 11, 12, 16, 8, 10 ]
27
Effect of Outliers in Descriptive Statistics
• An outlier is any unusually large or small observation.
• Outliers can have a disproportionate effect on statistical results, such as the mean but
doesn’t affect median or mode which can result in misleading impressions
28
Effect of Outliers in Descriptive Statistics
29
Descriptive Statistics
Measure of Dispersion
30
Difference between Central Tendency and Dispersion
• Central tendency tells you where most of your data points lie, while dispersion summarizes
how far apart your points are from each other.
• Datasets can have the same central tendency but different levels of dispersion, or vice
versa. Together, they give you a complete picture of your data.
• Variance
• Standard deviation
• Percentile
• Range
• Interquartile range
33
Variance
• Variance is a measure of how far a set of data are dispersed from
their mean or average value. It is denoted as ‘𝝈𝟐 ’.
Properties of Variance:
➢ Variance always has squared units. For example, the variance of a set
of weights estimated in kilograms will be given in kg².
➢ Importance: Variance measures the spread of data points around the mean, useful for
understanding data variability.
In Figure 1, the points have a high variance In Figure 2, the points have a low variance
because they are spread out, because they are close together.
35
How to Calculate Variance Step by Step
• Calculate the Mean (x̄): Find the average of all data points.
• Subtract the Mean from Each Observation (X - x̄): For each data point, subtract
the mean from the data point.
• Square Each of the Resulting Observations ((X - x̄)²): Square the result of
each subtraction.
• Add These Squared Results Together: Sum all the squared values.
• Divide This Total by the Number of Observations (n) (in the case of a
population) to Get Variance (σ²): Divide the sum of squared values by the
number of observations to obtain the variance.
36
Variance and Standard Deviation
• σ2 = Variance = 0.57
37
Standard Deviation
● Standard deviation measures the deviation of data from its mean or average position. The degree of
dispersion is computed by estimating the deviation of data points. It is denoted by the symbol ‘σ’.
● It describes the square root of the mean of the squares of all values in a data set and is also called
the root-mean-square deviation.
● When the data values of a group are similar, the standard deviation will be very low or close to zero.
However, when the data values vary significantly from each other, the standard deviation will be
high or farther from zero.
38
Standard deviation std (σ)
The std is the most common way to measure the spread of the data and how close data points are to each other.
• Importance: Standard deviation is the square root of variance, providing a measure of data
dispersion that is in the same unit as the data. 2
𝜎= 𝜎
• Application: Commonly used in finance for assessing investment risk, and in process control
for monitoring production quality.
39
Variance and standard deviation
40
Population VS. Sampling
41
Population Vs. Sampling
• The primary task of inferential statistics is
making an inference about something by using
only an incomplete sample of data.
42
Sampling
Definition: The process of selecting a subset of individuals from a population to estimate
characteristics of the whole population.
Types of Sampling:
➢ Random Sampling: Every member has an equal chance of being selected.
➢ Stratified Sampling: Population divided into subgroups, and samples are drawn from each.
➢ Systematic Sampling: Selecting every nth member from a list.
➢ Convenience Sampling: Selecting individuals who are easiest to reach, which may introduce bias.
43
When Should Samples be used?
• When studying a large population where it is impractical or
impossible to collect data from every individual.
44
𝑛 2
𝑖=1 𝑥𝑖 − 𝑥ҧ 2 σ(𝑥 − 𝜇)
𝑆2 = 𝜎2 =
𝑛−1 𝑁
45
Range
• The range is the difference between the highest value and the
lowest value of the data. It helps in knowing the spread of the data.
46
Question
Calculate the range of the given set of data:
7, 47, 8, 42, 47, 95, 42, 96, 2.
A) 90
B) 94
C) 96
D) 100
47
Range and Interquartile Range
• It is a better measure of dispersion than range because it leaves out the
extreme values.
• It equally divides the distribution into four equal parts called quartiles:
•The first 25% is the 1st quartile (Q1).
•The middle one is the 2nd quartile (Q2).
•The last one is the 3rd quartile (Q3).
• The 2nd quartile (Q2) divides the distribution into two equal parts of 50%,
so it is the same as the Median.
• The interquartile range is the distance between the third and the first
quartile, or, in other words, IQR = Q3 - Q1.
48
Range and Interquartile Range
• The interquartile range is a measure of where the “middle fifty” is in a data set.
• Where a range is a measure of where the beginning and end are in a set, the interquartile range is a
measure of where the bulk of the values lie.
• That’s why it’s preferred over many other measures of spread when reporting things like school
performance or SAT scores.
• The interquartile range formula is the first quartile subtracted from the third quartile:
• IQR = Q3 – Q1.
49
Quartile Calculations
50
Box Plot
51
Question
Calculate the interquartile range (IQR) of the following data:
17, 18, 18, 19, 20, 21, 21, 23, 25.
A) 4
B) 5
C) 6
D) 7
52
Question
Find the interquartile range (IQR) using the following values:
• Minimum: 1
• Q1: 3
• Median: 5
• Q3: 7
• Maximum: 9
A) 2
B) 3
C) 4
D) 5
53
Advantage of IQR
• The main advantage of the IQR is that it is not affected by outliers
because it doesn’t consider observations below Q1 or above Q3.
54
Box Plot
• mainly used when you are describing center and variability of your data.
• It is also useful for detecting outliers in the data.
• Used to visualize IQR
55
IQR is plotted in a boxplot and probability density
56
Percentile
• A percentile is a measure used in statistics to indicate the value below which a given
percentage of observations in a group of observations fall.
•P is percentile
•n – Number of values below ‘x’
•N – Total count of population
57
Percentile
• Arrange the data in an order
15*0.8= 12
58
Percentile calculation
• The number that expresses the value that a given percent of the values are lower than
• Example:
• We have an array of the ages of all the people that working in same office ages =
[25,31,43,48,50,41,39,60,52,32,27,46,47,55]
• What is 75. percentile? The answer is 48, meaning that 75% of the people are 48 or
younger.
• Steps:
• Arrange in ascending order [25,27,31,32,39,41,43,46,47,48,50,52,55,60]
• Find the Rank = Percentile/100 * (number of things )
• Rank= 0.75*14=10.5
• take the round 11 then subtract 1 for zero index so rank=10 with value 48
59
What is Skewness of Data?
• Skewness is the measure of how much the probability distribution of a
random variable deviates from the normal distribution.
• Skewness is positive if the tail of the distribution extends more to the
right, and negative if the tail extends more to the left.
60
Mean, Median and Mode
The value of skewness for a left skewed The value of skewness The value of skewness for a right skewed
distribution is less than zero. for a normal distribution is less than zero
mean < median < mode distribution is zero. mode < median < mean
mode = median = mean
61
Correlation
❑ Used to find the relationship between two variables which is important in real
because we can predict value of one variable.
Direction of correlation :
➢ A correlation of –1 indicates a perfect negative correlation, meaning that as one
variable goes up, the other goes down.
➢ A correlation of +1 indicates a perfect positive correlation , meaning that as one
variable goes up, the other goes up together.
➢ A correlation of zero indicates that there is no relationship between the variables.
Scatter plot in visualization of correlation , The horizontal axis represents one variable, and the vertical axis represents the other.
Regression line : is the line separate the 2 classes As points surrounded regression line the strength of correlation increase
The closer the correlation is to 0, the weaker it is, while the closer it is to +/-1, the stronger it is.
65
Correlation Coefficient
66
Example: Housing Data
67
Task
Calculate Mean ,Median ,Mode ,Variance ,Std for these students in Math and Science Subjects Grades
A 85 78
B 90 88
C 78 84
D 92 91
E 88 76
Math
Mean = (85+90+78+92+88)/5 = 86.6
Median = [78, 85, 88, 90, 92]
Mode = No mode (all grades are unique)
Variance = 23.84 Std = 4.88
Because
85 – 86.6 = -1.6 (-1.6)^2 = 2.56
90 - 86.6 = 3.4 (3.4 )^2 = 11.56 variance = (2.56 + 11.56 + 73.96 + 29.16 + 1.96)/5 = 23.84
78 - 86.6 = -8.6 (-8.6)^2 = 73.96 Std = √ 23.84 = 4.88
92 - 86.6 = 5.4 (5.4)^2 = 29.16
88 - 86.6 = 1.96 (1.96)^2 = 1.96
69
Task Solution
Calculate Mean ,Median ,Mode ,Variance ,Std for these students in Math and Science Subjects Grades
Science
Mean = (78+88+84+91+76)/5 = 83.4
Median = [76,78,84,88,91]
Mode = No mode (all grades are unique)
Variance = 32.64 Std = 5.71
Because
78 – 83.4 = -5.4 (-5.4)^2 = 29.16
88 - 83.4 = 4.6 (4.6 )^2 = 21.16 variance = (29.16 + 21.16 + 0.36 + 57.76 + 54.76)/5 = 32.64
84 - 83.4 = 0.6 (0.6)^2 = 0.36 Std = √ 32.64 = 5.71
91 - 83.4 = 7.6 (7.6)^2 = 57.76
76 - 83.4 = -7.4 (-7.4)^2 = 54.76
70
Inferential Statistics
Data Distribution
71
Data Distribution
• A distribution is a function that shows the possible values for a variable and how
often they occur
• In probability theory and statistics, a probability distribution is a mathematical
function that provides the probabilities of the occurrence of different possible
outcomes in an experiment
• It is a common mistake to believe that the distribution is the graph. In fact, the
distribution is the ‘rule’ that determines how values are positioned in relation to
each other. Very often, we use a graph to visualize the data. Since different
distributions have a particular graphical representation, statisticians like to plot
them.
72
Probability Distribution
Common distributions include:
• Normal Distribution: Symmetrical, bell-shaped distribution characterized by its mean
and standard deviation.
• Binomial Distribution: Represents the number of successes in a fixed number of trials
with a constant probability of success.
• Poisson Distribution: Models the number of events occurring in a fixed interval of
time or space.
73
Normal Distribution
❑ Normal distribution, also known as the Gaussian distribution, is a probability distribution
that is symmetric about the mean, showing that data near the mean are more frequent in
occurrence than data far from the mean (Bell shape).
❑ Applications:
• Biology: Most biological measures are normally distributed, such as: height, arm and leg length,
nails, blood pressure, thickness of tree barks, etc.
• IQ tests, Stock Market Information 74
Normal Distribution
75
How is Data Distributed in Normal Distribution?
76
Hypothesis Testing
➢ A statistical method to make inferences or draw conclusions
about a population based on sample data.
True State
H0 True Correct Type I Error
Decision
H0 False Type II Error Correct
Decision
𝑥ҧ − 𝜇0 𝑥ҧ − 𝜇0 𝑥ҧ − 𝜇0 𝑥ҧ − 𝜇0
𝑧= 𝜎 𝑍= 𝑧= 𝜎 𝑇=
ൗ 𝑛 𝑆ൗ ൗ 𝑛 𝑆ൗ
𝑛 𝑛
81
82
83
Find the Critical t-value
84
Example 1 - Efficacy Test for New drug
85
Example 1 - Efficacy Test for New drug
H A : m New − m Std 0
• Experimental (Sample) data:
y New y Std
s New sStd
nNew nStd
86
Example 1 - Efficacy Test for New drug
● Type I error - Concluding that the new drug is better than the standard (HA)
when in fact it is no better (H0). Ineffective drug is deemed better.
○ Traditionally = P(Type I error) = 0.05
● Type II error - Failing to conclude that the new drug is better (HA) when in fact
it is. Effective drug is deemed to be no better.
○ Traditionally a clinically important difference (D) is assigned and sample
sizes chosen so that:
= P(Type II error | m1-m2 = D) .20
87
Example 2 - Mean Age of a Certain Population
● Researchers are interested in the mean age of a certain population.
● A random sample of 10 individuals drawn from the population of
interest has a mean of 27.
● Assuming that the population is approximately normally distributed
with variance 20,can we conclude that the mean is different from
30 years? (α=0.05) .
● If the p - value is 0.0340 how can we use it in making a decision?
88
Solution
1. Data: variable is age, n=10, x =27 ,𝝈𝟐 =20,α=0.05
2. Assumptions:
The population is approximately normally distributed with variance 20
3. Hypotheses:
➢ H0 : μ=30
➢ HA : μ ≠30
4. Test Statistic: Z= -2.12
5. Decision Rule
➢ The alternative hypothesis is HA: μ ≠ 30
➢ Hence we reject H0 if Z > Z1-0.025= Z0.975 or Z< - Z1-0.025 = - Z0.975
➢ Z0.975=1.96(from table D)
6. Decision:
➢ We reject H0 ,since -2.12 is in the rejection region .
➢ We can conclude that μ is not equal to 30
➢ Using the p value ,we note that p-value =0.0340 < 0.05,therefore we reject H0 89
T.DIST Function
We can use Excel’s T.DIST Function to calculate the p-value by simply adding the test
statistic value and degree of freedom
90
Regression
91
Regression
Regression is a statistical method used to understand the relationship between variables.
The primary purpose of regression analysis is to predict a dependent variable (also known
as the outcome, response, or target) based on one or more independent variables (also
known as predictors, features, or explanatory variables).
Types of Regression:
➢ Simple Linear Regression: One independent variable, straight-line relationship.
➢ Multiple Linear Regression: Multiple independent variables predicting one dependent variable.
➢ Logistic Regression: Used for predicting binary outcomes.
92
Types of Regression
93
Linear Regression
• Linear Regression: The simplest form of regression, where the relationship between the dependent
and independent variable(s) is modeled as a straight line. It can be represented by the equation
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜖
where:
• 𝑌 is the dependent variable.
• 𝛽0 is the intercept.
• 𝛽1 is the slope (coefficient of the independent variable 𝑥).
• 𝜖 is the error term.
94
Multiple Linear Regression
• Multiple Linear Regression: An extension of linear regression that uses more than one independent
variable. The model can be written as
Y = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽n 𝑥n + 𝜖
where:
• 𝑌 is the dependent variable.
• 𝛽0 is the intercept.
• 𝛽1 is the slope (coefficient of the independent variable 𝑥).
• 𝜖 is the error term.
Predicting Salary:
Scenario: Estimating an employee’s salary based on multiple factors, such as years of
experience, education level, and job role.
Example: Using a dataset that includes years of experience, education level (Bachelor’s,
Master’s, PhD), and job roles (Developer, Manager, Analyst) to predict salaries.
95
Logistic Regression
Used for binary classification problems, where the dependent variable is categorical (e.g.,
success/failure, yes/no). It uses the logistic function to model the probability of the default class.
Spam Detection:
Scenario: Classifying emails as spam or not spam.
Example: Using logistic regression to build a spam filter that identifies emails as
spam based on features like the presence of certain keywords, sender information, and
email structure.
96
Questions?
97