0% found this document useful (0 votes)
11 views

2 - Introduction To Statistics

Uploaded by

Hafsa Zahran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

2 - Introduction To Statistics

Uploaded by

Hafsa Zahran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Introduction to Statistics

1
Agenda

❑ What is Statistics?
❑ Data Types & Measurement Level
❑ Types of Statistics ❑ Population VS. Sampling

➢ Descriptive Vs. Inferential ❑ Inferential Statistics

❑ Descriptive Statistics ➢ Data Distribution

➢ Types of Descriptive Statistics ➢ Hypothesis Testing

➢ Measures of Central Tendency ❑ Regression

➢ Measure of Dispersion ➢ Types of Regression

2
What is Statistics?

3
What is Statistics?
The science of collecting, organizing, presenting, analyzing, and interpreting
data to assist in making more effective decisions.

4
Data Types & Measurement Level

5
Data Types

6
Levels of Measurement

7
Quantitative Variables
Quantitative variables are characteristics that can be expressed
in numbers. For example, weight, height, and length can all be written
numerically.

8
Quantitative Variables
• Continuous Variables:
➢ Contain measurements with decimal precision.
➢ Examples: Height of individuals, weight of a bag of apples, time
taken to run a marathon.
➢ Characteristics: Measurable, infinitely many possible values, can
include fractions/decimals.
• Discrete Variables:
➢ Contain counts that must be whole integer values.
➢ Examples: The number of members in a person’s family, or the
number of goals a basketball team scored in a game.
➢ Characteristics: Countable, often integers, clear gaps between
values.
9
Qualitative Variables
Qualitative variables are characteristics of an individual
or object which can only be expressed in words. Some
examples include ethnicity, profession, or gender.

Ordinal Variables:
➢ Variables that are groups containing an inherent
ranking.
➢ Examples: Education level (high school, bachelor's,
master's), customer satisfaction (dissatisfied, neutral,
satisfied).
Nominal Variables:
➢ Variables made up of categories without an inherent
order.
➢ Examples: Gender (male, female), eye color (blue,
green, brown).

10
Understanding the Variables Using Dataset

11
Understanding the Variables Using Dataset

12
Types of Statistics
Descriptive Vs. Inferential

13
Types of Statistics
• Descriptive Statistics
The methods of organizing, summarizing, and presenting data in an informative way.
Organizing and summarizing data with frequency tables and frequency distributions.
Presenting frequency tables and distributions with charts and graphs.
Measures to summarize the characteristics of data.

• Inferential Statistics
The methods used to estimate population parameters on the basis of a sample
statistic. To make inferences (statements) about a population based on a sample,
the following concepts are used:
Probabilities and Probability Distribution
Sampling
Estimation
Hypothesis Testing
Correlation and Regression

14
Descriptive Statistics
Types of Descriptive Statistics

15
16
Types of Descriptive Statistics
• Measures of Frequency:
• Describe how often certain values or ranges of values occur within a dataset.
• They are used to understand the distribution and occurrence patterns of the data.

• Measures of Central Tendency:


• These are ways of describing the central position of a frequency distribution for a
group of data, such as mean, median, and mode.

• Measures of Dispersion or Variation:


• These are ways of summarizing a group of data by describing how spread out the
scores are.
• Examples include range, interquartile range, variance, percentile, quartile,
and standard deviation.

17
Frequency Table
➢ A frequency table lists a set of values and how often each one appears.
➢ Frequency is the number of times a specific data value occurs in your dataset.
➢ These tables help you understand which data values are common and which are rare.
➢ They organize your data and are an effective way to present the results to others.
➢ Frequency tables are also known as frequency distributions because they allow you to
understand the distribution of values in your dataset.

18
Descriptive Statistics
Measures of Central Tendency

19
Central Tendency of Data
❑ A measure of central tendency is a
single value that attempts to describe
a set of data by identifying the central
position within that set of data.

❑ Measures of Central Tendency:

20
Mean
● Mean is the sum of all the values in the dataset divided by the number of
values in the dataset. It is also called the Arithmetic Average. Mean is
denoted as x̅ and is read as x bar.

𝑁
σ𝑖=1 𝑥𝑖
𝑥ҧ =
𝑁

21
Mean
The mean is the most widely spread measure of central tendency. It is the simple average
of the dataset. Note: easily affected by outliers

•Importance: The mean provides a central value for a dataset, useful for comparing
different data sets and understanding the general trend.

•Application: Used in almost all fields such as finance, economics, healthcare, and social
sciences to summarize data.

Example Data 1 [ -10, 0, 10, 20, 30 ] For data 1: (-10 + 0 + 10 + 20 + 30 ) / 5 = 10


Example Data 2 [ 8, 9, 10, 11, 12 ] For data 2: (8 + 9 + 10 + 11 + 12 ) / 5 = 10

22
Median
● Median is the middle value for sorted data. The sorting of the data can be done either in
ascending order or descending order. A median divides the data into two equal halves.

𝑛+1
• In an ordered dataset, the median is the number at position If this position is not a
2
whole number, the median is the simple average of the two numbers at positions closest to the
calculated value.

23
Median
The median is the midpoint of the ordered dataset. While It is not as popular as the mean, it is often
used in academia and data science because it is not affected by outliers
•Importance: The median gives the middle value of a dataset, making it useful for understanding
the distribution of data, especially when the data is skewed.

•Application: Often used in income data, housing prices, and other scenarios where outliers may
skew the mean.

Example :

24
Mode
● Mode is the most frequent value or item in the dataset.

● A dataset can generally have one or more than one mode value.

25
Mode
The mode is the value that occurs most often. A dataset can have 0 modes, 1 mode or multiple modes.
The mode is calculated simply by finding the value with the highest frequency.

•Importance: The mode indicates the most frequently occurring value in a dataset, useful for
understanding the most common occurrences.

•Application: Common in market research, quality control, and inventory management.

Example Data 1 [ 10, 13, 15, 16, 13, 15, 11, 13]
Example Data 2 [ 8, 9, 13, 11, 12, 16, 8, 10 ]

For data 1: 13 is the most frequent


For data 2: 8 is the most frequent
26
Differences between Mean, Median and Mode

27
Effect of Outliers in Descriptive Statistics
• An outlier is any unusually large or small observation.
• Outliers can have a disproportionate effect on statistical results, such as the mean but
doesn’t affect median or mode which can result in misleading impressions

6,915 $  Mean → 7,676 $

7,200 $  Median → 7,200 $

28
Effect of Outliers in Descriptive Statistics

29
Descriptive Statistics
Measure of Dispersion

30
Difference between Central Tendency and Dispersion
• Central tendency tells you where most of your data points lie, while dispersion summarizes
how far apart your points are from each other.
• Datasets can have the same central tendency but different levels of dispersion, or vice
versa. Together, they give you a complete picture of your data.

Central Tendency Spread in data


31
• This example show how central tendency alone is not enough to describe data
here we have same mean and median but each time with different data, so we
need to measure dispersion

Mean : 20 Mean : 20 Mean : 20

Median : 20 Median : 20 Median : 20


32
Measure of Dispersion

• Variance

• Standard deviation

• Percentile

• Range

• Interquartile range

33
Variance
• Variance is a measure of how far a set of data are dispersed from
their mean or average value. It is denoted as ‘𝝈𝟐 ’.

Properties of Variance:

➢ It is always non-negative since each term in the variance sum is squared,


and therefore, the result is either positive or zero.

➢ Variance always has squared units. For example, the variance of a set
of weights estimated in kilograms will be given in kg².

➢ Since the population variance is squared, we cannot compare it directly


with the mean or the data themselves.
34
Variance (𝜎 2 )
Variance measures the dispersion of a set of data points around its mean value.

➢ Importance: Variance measures the spread of data points around the mean, useful for
understanding data variability.

➢ Application: Used in risk management, investment analysis, and quality control.

In Figure 1, the points have a high variance In Figure 2, the points have a low variance
because they are spread out, because they are close together.

35
How to Calculate Variance Step by Step
• Calculate the Mean (x̄): Find the average of all data points.

• Subtract the Mean from Each Observation (X - x̄): For each data point, subtract
the mean from the data point.

• Square Each of the Resulting Observations ((X - x̄)²): Square the result of
each subtraction.

• Add These Squared Results Together: Sum all the squared values.

• Divide This Total by the Number of Observations (n) (in the case of a
population) to Get Variance (σ²): Divide the sum of squared values by the
number of observations to obtain the variance.

36
Variance and Standard Deviation

• Average = 4/7 = 0.57

• σ2 = Variance = 0.57

37
Standard Deviation
● Standard deviation measures the deviation of data from its mean or average position. The degree of
dispersion is computed by estimating the deviation of data points. It is denoted by the symbol ‘σ’.

● Properties of Standard Deviation:

● It describes the square root of the mean of the squares of all values in a data set and is also called
the root-mean-square deviation.

● The smallest value of the standard deviation is 0 since it cannot be negative.

● When the data values of a group are similar, the standard deviation will be very low or close to zero.
However, when the data values vary significantly from each other, the standard deviation will be
high or farther from zero.

38
Standard deviation std (σ)
The std is the most common way to measure the spread of the data and how close data points are to each other.

• Importance: Standard deviation is the square root of variance, providing a measure of data
dispersion that is in the same unit as the data. 2
𝜎= 𝜎
• Application: Commonly used in finance for assessing investment risk, and in process control
for monitoring production quality.

39
Variance and standard deviation

40
Population VS. Sampling

41
Population Vs. Sampling
• The primary task of inferential statistics is
making an inference about something by using
only an incomplete sample of data.

• Population: is a collection of all possible


individuals, objects, or measurements of interest
(To understand the whole collection of data).

• Sample: is a portion or part of the population of


interest (To make inferences about the whole
based on the subset).

42
Sampling
Definition: The process of selecting a subset of individuals from a population to estimate
characteristics of the whole population.
Types of Sampling:
➢ Random Sampling: Every member has an equal chance of being selected.
➢ Stratified Sampling: Population divided into subgroups, and samples are drawn from each.
➢ Systematic Sampling: Selecting every nth member from a list.
➢ Convenience Sampling: Selecting individuals who are easiest to reach, which may introduce bias.

43
When Should Samples be used?
• When studying a large population where it is impractical or
impossible to collect data from every individual.

• When resources such as time, cost, and manpower are limited,


making it more feasible to collect data from a subset of the
population.

• When conducting research or experiments where it is important


to minimize potential biases in data collection.

44
𝑛 2
෌𝑖=1 𝑥𝑖 − 𝑥ҧ 2 σ(𝑥 − 𝜇)
𝑆2 = 𝜎2 =
𝑛−1 𝑁
45
Range
• The range is the difference between the highest value and the
lowest value of the data. It helps in knowing the spread of the data.

• Application: Widely used in everyday scenarios such as evaluating


temperature variations over a week, comparing prices of products, and
understanding the performance spread of students in a test.

46
Question
Calculate the range of the given set of data:
7, 47, 8, 42, 47, 95, 42, 96, 2.

A) 90
B) 94
C) 96
D) 100
47
Range and Interquartile Range
• It is a better measure of dispersion than range because it leaves out the
extreme values.

• It equally divides the distribution into four equal parts called quartiles:
•The first 25% is the 1st quartile (Q1).
•The middle one is the 2nd quartile (Q2).
•The last one is the 3rd quartile (Q3).

• The 2nd quartile (Q2) divides the distribution into two equal parts of 50%,
so it is the same as the Median.

• The interquartile range is the distance between the third and the first
quartile, or, in other words, IQR = Q3 - Q1.

48
Range and Interquartile Range
• The interquartile range is a measure of where the “middle fifty” is in a data set.

• Where a range is a measure of where the beginning and end are in a set, the interquartile range is a
measure of where the bulk of the values lie.
• That’s why it’s preferred over many other measures of spread when reporting things like school
performance or SAT scores.

• The interquartile range formula is the first quartile subtracted from the third quartile:
• IQR = Q3 – Q1.

49
Quartile Calculations

50
Box Plot

51
Question
Calculate the interquartile range (IQR) of the following data:
17, 18, 18, 19, 20, 21, 21, 23, 25.

A) 4
B) 5
C) 6
D) 7

52
Question
Find the interquartile range (IQR) using the following values:
• Minimum: 1
• Q1: 3
• Median: 5
• Q3: 7
• Maximum: 9

A) 2
B) 3
C) 4
D) 5
53
Advantage of IQR
• The main advantage of the IQR is that it is not affected by outliers
because it doesn’t consider observations below Q1 or above Q3.

• Observations can be considered outliers when they lie more than


1.5 IQR below the first quartile or 1.5 IQR above the third quartile.

• Outliers = Q1 – 1.5 × IQR (or) Q3 + 1.5 × IQR

54
Box Plot
• mainly used when you are describing center and variability of your data.
• It is also useful for detecting outliers in the data.
• Used to visualize IQR

55
IQR is plotted in a boxplot and probability density
56
Percentile
• A percentile is a measure used in statistics to indicate the value below which a given
percentage of observations in a group of observations fall.

Percentile(x) = (Number of values fall under ‘x’/total number of values) × 100


P = (n/N) × 100
Where ,

•P is percentile
•n – Number of values below ‘x’
•N – Total count of population

57
Percentile
• Arrange the data in an order

• Calculate the percentage of observations or data points


below a particular value.

What is the 80th Percentile observation?

Total Observations * 0.8

15*0.8= 12

58
Percentile calculation
• The number that expresses the value that a given percent of the values are lower than

• Example:

• We have an array of the ages of all the people that working in same office ages =
[25,31,43,48,50,41,39,60,52,32,27,46,47,55]

• What is 75. percentile? The answer is 48, meaning that 75% of the people are 48 or
younger.
• Steps:
• Arrange in ascending order [25,27,31,32,39,41,43,46,47,48,50,52,55,60]
• Find the Rank = Percentile/100 * (number of things )
• Rank= 0.75*14=10.5
• take the round 11 then subtract 1 for zero index so rank=10 with value 48

59
What is Skewness of Data?
• Skewness is the measure of how much the probability distribution of a
random variable deviates from the normal distribution.
• Skewness is positive if the tail of the distribution extends more to the
right, and negative if the tail extends more to the left.

60
Mean, Median and Mode

The value of skewness for a left skewed The value of skewness The value of skewness for a right skewed
distribution is less than zero. for a normal distribution is less than zero
mean < median < mode distribution is zero. mode < median < mean
mode = median = mean
61
Correlation
❑ Used to find the relationship between two variables which is important in real
because we can predict value of one variable.

Example from our daily life:


❑ The more time you spend running on a treadmill, the more calories you will burn.
❑ The more money you save, the more financially secure you feel.
❑ The more cigarettes you smoked, the higher stress level you have

Use Scatter plot in visualization of correlation


62
Correlation Coefficient
❑ The correlation coefficient (r)/pearson r indicates a measure of the strength of
a relationship between two variables.
Possible correlations range from +1 to –1.

Direction of correlation :
➢ A correlation of –1 indicates a perfect negative correlation, meaning that as one
variable goes up, the other goes down.
➢ A correlation of +1 indicates a perfect positive correlation , meaning that as one
variable goes up, the other goes up together.
➢ A correlation of zero indicates that there is no relationship between the variables.

❑ Describe linear relations only


63
Correlation Coefficient

➢ X: Values of the x-variable in a sample


➢ x̄: Mean of the values of the x-variable
➢ y: Values of the y-variable in a sample
➢ ȳ: Mean of the values of the y-variable
➢ N: Number of records
➢ σ: Standard deviation
64
Correlation Coefficient

Scatter plot in visualization of correlation , The horizontal axis represents one variable, and the vertical axis represents the other.
Regression line : is the line separate the 2 classes As points surrounded regression line the strength of correlation increase
The closer the correlation is to 0, the weaker it is, while the closer it is to +/-1, the stronger it is.

65
Correlation Coefficient

66
Example: Housing Data

67
Task
Calculate Mean ,Median ,Mode ,Variance ,Std for these students in Math and Science Subjects Grades

Student Math Science

A 85 78

B 90 88

C 78 84

D 92 91

E 88 76

Sample Footer Text 68


Task Solution
Calculate Mean ,Median ,Mode ,Variance ,Std for these students in Math and Science Subjects Grades

Math
Mean = (85+90+78+92+88)/5 = 86.6
Median = [78, 85, 88, 90, 92]
Mode = No mode (all grades are unique)
Variance = 23.84 Std = 4.88
Because
85 – 86.6 = -1.6 (-1.6)^2 = 2.56
90 - 86.6 = 3.4 (3.4 )^2 = 11.56 variance = (2.56 + 11.56 + 73.96 + 29.16 + 1.96)/5 = 23.84
78 - 86.6 = -8.6 (-8.6)^2 = 73.96 Std = √ 23.84 = 4.88
92 - 86.6 = 5.4 (5.4)^2 = 29.16
88 - 86.6 = 1.96 (1.96)^2 = 1.96
69
Task Solution
Calculate Mean ,Median ,Mode ,Variance ,Std for these students in Math and Science Subjects Grades

Science
Mean = (78+88+84+91+76)/5 = 83.4
Median = [76,78,84,88,91]
Mode = No mode (all grades are unique)
Variance = 32.64 Std = 5.71
Because
78 – 83.4 = -5.4 (-5.4)^2 = 29.16
88 - 83.4 = 4.6 (4.6 )^2 = 21.16 variance = (29.16 + 21.16 + 0.36 + 57.76 + 54.76)/5 = 32.64
84 - 83.4 = 0.6 (0.6)^2 = 0.36 Std = √ 32.64 = 5.71
91 - 83.4 = 7.6 (7.6)^2 = 57.76
76 - 83.4 = -7.4 (-7.4)^2 = 54.76
70
Inferential Statistics
Data Distribution

71
Data Distribution
• A distribution is a function that shows the possible values for a variable and how
often they occur
• In probability theory and statistics, a probability distribution is a mathematical
function that provides the probabilities of the occurrence of different possible
outcomes in an experiment
• It is a common mistake to believe that the distribution is the graph. In fact, the
distribution is the ‘rule’ that determines how values are positioned in relation to
each other. Very often, we use a graph to visualize the data. Since different
distributions have a particular graphical representation, statisticians like to plot
them.
72
Probability Distribution
Common distributions include:
• Normal Distribution: Symmetrical, bell-shaped distribution characterized by its mean
and standard deviation.
• Binomial Distribution: Represents the number of successes in a fixed number of trials
with a constant probability of success.
• Poisson Distribution: Models the number of events occurring in a fixed interval of
time or space.

73
Normal Distribution
❑ Normal distribution, also known as the Gaussian distribution, is a probability distribution
that is symmetric about the mean, showing that data near the mean are more frequent in
occurrence than data far from the mean (Bell shape).

❑ The Importance of Normal Distribution


1. It approximates a wide variety of random variables
2. Distributions of sample means with large enough sample sizes could be approximated as normal
3. All computable statistics are well-structured
4. It is widely used in regression analysis
5. Good track record

❑ Applications:
• Biology: Most biological measures are normally distributed, such as: height, arm and leg length,
nails, blood pressure, thickness of tree barks, etc.
• IQ tests, Stock Market Information 74
Normal Distribution

75
How is Data Distributed in Normal Distribution?

76
Hypothesis Testing
➢ A statistical method to make inferences or draw conclusions
about a population based on sample data.

➢ Elements of a hypothesis test:

❑ Population: The entire group being studied.


❑ Sample: A subset of the population used to make inferences.
❑ Null Hypothesis (𝐻0 ): The default assumption (no effect or no difference).
❑ Alternative Hypothesis (𝐻1 ): The assumption that there is an effect or difference.
❑ Significance Level (𝛼): The threshold for rejecting 𝐻0​ (commonly 0.05).
77
Hypothesis Testing
Test Result – H0 True H0 False

True State
H0 True Correct Type I Error
Decision
H0 False Type II Error Correct
Decision

 = P(Type I Error )  = P(Type II Error )


78
Steps in Hypothesis Testing
1.State the Hypotheses:
1. Null Hypothesis (𝐻0 ): This is the hypothesis that there is no effect or no difference. It represents the
status quo or a statement of no change.
2. Alternative Hypothesis (𝐻1): This hypothesis represents what you aim to prove, indicating the
presence of an effect or a difference.
2.Choose the Significance Level (αα):
➢ The significance level is the probability of rejecting the null hypothesis when it is true. Common
choices are 0.05, 0.01, or 0.10.
3.Select the Appropriate Test:
1. Choose a statistical test based on the data type and the hypotheses. Common tests include:
1. t-test (for comparing means)
2. Chi-square test (for categorical data)
3. ANOVA (for comparing means across multiple groups)
4. Z-test (for large sample sizes)
4.Collect Data:
➢ Gather the sample data that will be used for the hypothesis test. Ensure the data is collected in a way
that is unbiased and representative of the population.
79
Steps in Hypothesis Testing
5. Calculate the Test Statistic:
➢ Using the chosen statistical test, compute the test statistic (e.g., t-value, z-value) based on the sample
data.
6. Determine the p-value:
➢ The p-value indicates the probability of observing the test results under the null hypothesis. It helps to
determine the strength of the evidence against the null hypothesis.
7. Make a Decision:
1. Compare the p-value to the significance level (α):
1. If p≤αp≤α: Reject the null hypothesis (H0​).
2. If p>αp>α: Fail to reject the null hypothesis (H0​).
8. Draw Conclusions:
➢ Interpret the results in the context of the research question. Discuss whether the evidence supports the
alternative hypothesis and what it implies for the population.
9. Report the Results:
➢ Clearly present the findings, including the hypotheses, test statistic, p-value, and any relevant
confidence intervals. Provide context and implications of the results. 80
⚫ Testing hypothesis for the mean μ :
⚫ When the value of sample size (n):

population is normal or not normal population is normal


( n ≥ 30 ) (n< 30)

σ is known σ is not known σ is known σ is not known

𝑥ҧ − 𝜇0 𝑥ҧ − 𝜇0 𝑥ҧ − 𝜇0 𝑥ҧ − 𝜇0
𝑧= 𝜎 𝑍= 𝑧= 𝜎 𝑇=
ൗ 𝑛 𝑆ൗ ൗ 𝑛 𝑆ൗ
𝑛 𝑛

81
82
83
Find the Critical t-value

84
Example 1 - Efficacy Test for New drug

● Drug company has new drug, wishes to compare it with


current standard treatment
● Federal regulators tell company that they must demonstrate
that new drug is better than current treatment to receive
approval
● Firm runs clinical trial where some patients receive new drug,
and others receive standard treatment
● Numeric response of therapeutic effect is obtained (higher
scores are better).
● Parameter of interest: mNew - mStd

85
Example 1 - Efficacy Test for New drug

● Null hypothesis - New drug is no better than standard trt

H 0 : m New − m Std  0 (m New − m Std = 0)


• Alternative hypothesis - New drug is better than standard trt

H A : m New − m Std  0
• Experimental (Sample) data:

y New y Std
s New sStd
nNew nStd
86
Example 1 - Efficacy Test for New drug

● Type I error - Concluding that the new drug is better than the standard (HA)
when in fact it is no better (H0). Ineffective drug is deemed better.
○ Traditionally  = P(Type I error) = 0.05

● Type II error - Failing to conclude that the new drug is better (HA) when in fact
it is. Effective drug is deemed to be no better.
○ Traditionally a clinically important difference (D) is assigned and sample
sizes chosen so that:
 = P(Type II error | m1-m2 = D)  .20

87
Example 2 - Mean Age of a Certain Population
● Researchers are interested in the mean age of a certain population.
● A random sample of 10 individuals drawn from the population of
interest has a mean of 27.
● Assuming that the population is approximately normally distributed
with variance 20,can we conclude that the mean is different from
30 years? (α=0.05) .
● If the p - value is 0.0340 how can we use it in making a decision?

88
Solution
1. Data: variable is age, n=10, x =27 ,𝝈𝟐 =20,α=0.05
2. Assumptions:
The population is approximately normally distributed with variance 20
3. Hypotheses:
➢ H0 : μ=30
➢ HA : μ ≠30
4. Test Statistic: Z= -2.12
5. Decision Rule
➢ The alternative hypothesis is HA: μ ≠ 30
➢ Hence we reject H0 if Z > Z1-0.025= Z0.975 or Z< - Z1-0.025 = - Z0.975
➢ Z0.975=1.96(from table D)
6. Decision:
➢ We reject H0 ,since -2.12 is in the rejection region .
➢ We can conclude that μ is not equal to 30
➢ Using the p value ,we note that p-value =0.0340 < 0.05,therefore we reject H0 89
T.DIST Function
We can use Excel’s T.DIST Function to calculate the p-value by simply adding the test
statistic value and degree of freedom

90
Regression

91
Regression
Regression is a statistical method used to understand the relationship between variables.
The primary purpose of regression analysis is to predict a dependent variable (also known
as the outcome, response, or target) based on one or more independent variables (also
known as predictors, features, or explanatory variables).

Types of Regression:
➢ Simple Linear Regression: One independent variable, straight-line relationship.
➢ Multiple Linear Regression: Multiple independent variables predicting one dependent variable.
➢ Logistic Regression: Used for predicting binary outcomes.

92
Types of Regression

93
Linear Regression
• Linear Regression: The simplest form of regression, where the relationship between the dependent
and independent variable(s) is modeled as a straight line. It can be represented by the equation

𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜖
where:
• 𝑌 is the dependent variable.
• 𝛽0 is the intercept.
• 𝛽1 is the slope (coefficient of the independent variable 𝑥).
• 𝜖 is the error term.

Predicting House Prices:


Scenario: Predicting the price of a house based on a single feature, such as square footage.
Example: Using historical data on house prices and their square footage to build a model that
estimates the price of a new house based on its size.

94
Multiple Linear Regression
• Multiple Linear Regression: An extension of linear regression that uses more than one independent
variable. The model can be written as

Y = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽n 𝑥n + 𝜖

where:
• 𝑌 is the dependent variable.
• 𝛽0 is the intercept.
• 𝛽1 is the slope (coefficient of the independent variable 𝑥).
• 𝜖 is the error term.

Predicting Salary:
Scenario: Estimating an employee’s salary based on multiple factors, such as years of
experience, education level, and job role.
Example: Using a dataset that includes years of experience, education level (Bachelor’s,
Master’s, PhD), and job roles (Developer, Manager, Analyst) to predict salaries.
95
Logistic Regression
Used for binary classification problems, where the dependent variable is categorical (e.g.,
success/failure, yes/no). It uses the logistic function to model the probability of the default class.

Spam Detection:
Scenario: Classifying emails as spam or not spam.
Example: Using logistic regression to build a spam filter that identifies emails as
spam based on features like the presence of certain keywords, sender information, and
email structure.
96
Questions?

97

You might also like