0% found this document useful (0 votes)

11 views

2 - Introduction To Statistics

Uploaded by

Hafsa Zahran

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

2 - Introduction To Statistics

Uploaded by

Hafsa Zahran

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 97

Introduction to Statistics

1
Agenda

❑ What is Statistics?
❑ Data Types & Measurement Level
❑ Types of Statistics ❑ Population VS. Sampling

➢ Descriptive Vs. Inferential ❑ Inferential Statistics

❑ Descriptive Statistics ➢ Data Distribution

➢ Types of Descriptive Statistics ➢ Hypothesis Testing

➢ Measures of Central Tendency ❑ Regression

➢ Measure of Dispersion ➢ Types of Regression

2
What is Statistics?

3
What is Statistics?
The science of collecting, organizing, presenting, analyzing, and interpreting
data to assist in making more effective decisions.

4
Data Types & Measurement Level

5
Data Types

6
Levels of Measurement

7
Quantitative Variables
Quantitative variables are characteristics that can be expressed
in numbers. For example, weight, height, and length can all be written
numerically.

8
Quantitative Variables
• Continuous Variables:
➢ Contain measurements with decimal precision.
➢ Examples: Height of individuals, weight of a bag of apples, time
taken to run a marathon.
➢ Characteristics: Measurable, infinitely many possible values, can
include fractions/decimals.
• Discrete Variables:
➢ Contain counts that must be whole integer values.
➢ Examples: The number of members in a person’s family, or the
number of goals a basketball team scored in a game.
➢ Characteristics: Countable, often integers, clear gaps between
values.
9
Qualitative Variables
Qualitative variables are characteristics of an individual
or object which can only be expressed in words. Some
examples include ethnicity, profession, or gender.

Ordinal Variables:
➢ Variables that are groups containing an inherent
ranking.
➢ Examples: Education level (high school, bachelor's,
master's), customer satisfaction (dissatisfied, neutral,
satisfied).
Nominal Variables:
➢ Variables made up of categories without an inherent
order.
➢ Examples: Gender (male, female), eye color (blue,
green, brown).

10
Understanding the Variables Using Dataset

11
Understanding the Variables Using Dataset

12
Types of Statistics
Descriptive Vs. Inferential

13
Types of Statistics
• Descriptive Statistics
The methods of organizing, summarizing, and presenting data in an informative way.
Organizing and summarizing data with frequency tables and frequency distributions.
Presenting frequency tables and distributions with charts and graphs.
Measures to summarize the characteristics of data.

• Inferential Statistics
The methods used to estimate population parameters on the basis of a sample
statistic. To make inferences (statements) about a population based on a sample,
the following concepts are used:
Probabilities and Probability Distribution
Sampling
Estimation
Hypothesis Testing
Correlation and Regression

14
Descriptive Statistics
Types of Descriptive Statistics

15
16
Types of Descriptive Statistics
• Measures of Frequency:
• Describe how often certain values or ranges of values occur within a dataset.
• They are used to understand the distribution and occurrence patterns of the data.

• Measures of Central Tendency:

• These are ways of describing the central position of a frequency distribution for a
group of data, such as mean, median, and mode.

• Measures of Dispersion or Variation:

• These are ways of summarizing a group of data by describing how spread out the
scores are.
• Examples include range, interquartile range, variance, percentile, quartile,
and standard deviation.

17
Frequency Table
➢ A frequency table lists a set of values and how often each one appears.
➢ Frequency is the number of times a specific data value occurs in your dataset.
➢ These tables help you understand which data values are common and which are rare.
➢ They organize your data and are an effective way to present the results to others.
➢ Frequency tables are also known as frequency distributions because they allow you to
understand the distribution of values in your dataset.

18
Descriptive Statistics
Measures of Central Tendency

19
Central Tendency of Data
❑ A measure of central tendency is a
single value that attempts to describe
a set of data by identifying the central
position within that set of data.

❑ Measures of Central Tendency:

20
Mean
● Mean is the sum of all the values in the dataset divided by the number of
values in the dataset. It is also called the Arithmetic Average. Mean is
denoted as x̅ and is read as x bar.

𝑁
σ𝑖=1 𝑥𝑖
𝑥ҧ =
𝑁

21
Mean
The mean is the most widely spread measure of central tendency. It is the simple average
of the dataset. Note: easily affected by outliers

•Importance: The mean provides a central value for a dataset, useful for comparing
different data sets and understanding the general trend.

•Application: Used in almost all fields such as finance, economics, healthcare, and social
sciences to summarize data.

Example Data 1 [ -10, 0, 10, 20, 30 ] For data 1: (-10 + 0 + 10 + 20 + 30 ) / 5 = 10

Example Data 2 [ 8, 9, 10, 11, 12 ] For data 2: (8 + 9 + 10 + 11 + 12 ) / 5 = 10

22
Median
● Median is the middle value for sorted data. The sorting of the data can be done either in
ascending order or descending order. A median divides the data into two equal halves.

𝑛+1
• In an ordered dataset, the median is the number at position If this position is not a
2
whole number, the median is the simple average of the two numbers at positions closest to the
calculated value.

23
Median
The median is the midpoint of the ordered dataset. While It is not as popular as the mean, it is often
used in academia and data science because it is not affected by outliers
•Importance: The median gives the middle value of a dataset, making it useful for understanding
the distribution of data, especially when the data is skewed.

•Application: Often used in income data, housing prices, and other scenarios where outliers may
skew the mean.

Example :

24
Mode
● Mode is the most frequent value or item in the dataset.

● A dataset can generally have one or more than one mode value.

25
Mode
The mode is the value that occurs most often. A dataset can have 0 modes, 1 mode or multiple modes.
The mode is calculated simply by finding the value with the highest frequency.

•Importance: The mode indicates the most frequently occurring value in a dataset, useful for
understanding the most common occurrences.

•Application: Common in market research, quality control, and inventory management.

Example Data 1 [ 10, 13, 15, 16, 13, 15, 11, 13]
Example Data 2 [ 8, 9, 13, 11, 12, 16, 8, 10 ]

For data 1: 13 is the most frequent

For data 2: 8 is the most frequent
26
Differences between Mean, Median and Mode

27
Effect of Outliers in Descriptive Statistics
• An outlier is any unusually large or small observation.
• Outliers can have a disproportionate effect on statistical results, such as the mean but
doesn’t affect median or mode which can result in misleading impressions

6,915 $  Mean → 7,676 $

7,200 $  Median → 7,200 $

28
Effect of Outliers in Descriptive Statistics

29
Descriptive Statistics
Measure of Dispersion

30
Difference between Central Tendency and Dispersion
• Central tendency tells you where most of your data points lie, while dispersion summarizes
how far apart your points are from each other.
• Datasets can have the same central tendency but different levels of dispersion, or vice
versa. Together, they give you a complete picture of your data.

Central Tendency Spread in data

31
• This example show how central tendency alone is not enough to describe data
here we have same mean and median but each time with different data, so we
need to measure dispersion

Mean : 20 Mean : 20 Mean : 20

Median : 20 Median : 20 Median : 20

32
Measure of Dispersion

• Variance

• Standard deviation

• Percentile

• Range

• Interquartile range

33
Variance
• Variance is a measure of how far a set of data are dispersed from
their mean or average value. It is denoted as ‘𝝈𝟐 ’.

Properties of Variance:

➢ It is always non-negative since each term in the variance sum is squared,

and therefore, the result is either positive or zero.

➢ Variance always has squared units. For example, the variance of a set
of weights estimated in kilograms will be given in kg².

➢ Since the population variance is squared, we cannot compare it directly

with the mean or the data themselves.
34
Variance (𝜎 2 )
Variance measures the dispersion of a set of data points around its mean value.

➢ Importance: Variance measures the spread of data points around the mean, useful for
understanding data variability.

➢ Application: Used in risk management, investment analysis, and quality control.

In Figure 1, the points have a high variance In Figure 2, the points have a low variance
because they are spread out, because they are close together.

35
How to Calculate Variance Step by Step
• Calculate the Mean (x̄): Find the average of all data points.

• Subtract the Mean from Each Observation (X - x̄): For each data point, subtract
the mean from the data point.

• Square Each of the Resulting Observations ((X - x̄)²): Square the result of
each subtraction.

• Add These Squared Results Together: Sum all the squared values.

• Divide This Total by the Number of Observations (n) (in the case of a
population) to Get Variance (σ²): Divide the sum of squared values by the
number of observations to obtain the variance.

36
Variance and Standard Deviation

• Average = 4/7 = 0.57

• σ2 = Variance = 0.57

37
Standard Deviation
● Standard deviation measures the deviation of data from its mean or average position. The degree of
dispersion is computed by estimating the deviation of data points. It is denoted by the symbol ‘σ’.

● Properties of Standard Deviation:

● It describes the square root of the mean of the squares of all values in a data set and is also called
the root-mean-square deviation.

● The smallest value of the standard deviation is 0 since it cannot be negative.

● When the data values of a group are similar, the standard deviation will be very low or close to zero.
However, when the data values vary significantly from each other, the standard deviation will be
high or farther from zero.

38
Standard deviation std (σ)
The std is the most common way to measure the spread of the data and how close data points are to each other.

• Importance: Standard deviation is the square root of variance, providing a measure of data
dispersion that is in the same unit as the data. 2
𝜎= 𝜎
• Application: Commonly used in finance for assessing investment risk, and in process control
for monitoring production quality.

39
Variance and standard deviation

40
Population VS. Sampling

41
Population Vs. Sampling
• The primary task of inferential statistics is
making an inference about something by using
only an incomplete sample of data.

• Population: is a collection of all possible

individuals, objects, or measurements of interest
(To understand the whole collection of data).

• Sample: is a portion or part of the population of

interest (To make inferences about the whole
based on the subset).

42
Sampling
Definition: The process of selecting a subset of individuals from a population to estimate
characteristics of the whole population.
Types of Sampling:
➢ Random Sampling: Every member has an equal chance of being selected.
➢ Stratified Sampling: Population divided into subgroups, and samples are drawn from each.
➢ Systematic Sampling: Selecting every nth member from a list.
➢ Convenience Sampling: Selecting individuals who are easiest to reach, which may introduce bias.

43
When Should Samples be used?
• When studying a large population where it is impractical or
impossible to collect data from every individual.

• When resources such as time, cost, and manpower are limited,

making it more feasible to collect data from a subset of the
population.

• When conducting research or experiments where it is important

to minimize potential biases in data collection.

44
𝑛 2
෌𝑖=1 𝑥𝑖 − 𝑥ҧ 2 σ(𝑥 − 𝜇)
𝑆2 = 𝜎2 =
𝑛−1 𝑁
45
Range
• The range is the difference between the highest value and the
lowest value of the data. It helps in knowing the spread of the data.

• Application: Widely used in everyday scenarios such as evaluating

temperature variations over a week, comparing prices of products, and
understanding the performance spread of students in a test.

46
Question
Calculate the range of the given set of data:
7, 47, 8, 42, 47, 95, 42, 96, 2.

A) 90
B) 94
C) 96
D) 100
47
Range and Interquartile Range
• It is a better measure of dispersion than range because it leaves out the
extreme values.

• It equally divides the distribution into four equal parts called quartiles:
•The first 25% is the 1st quartile (Q1).
•The middle one is the 2nd quartile (Q2).
•The last one is the 3rd quartile (Q3).

• The 2nd quartile (Q2) divides the distribution into two equal parts of 50%,
so it is the same as the Median.

• The interquartile range is the distance between the third and the first
quartile, or, in other words, IQR = Q3 - Q1.

48
Range and Interquartile Range
• The interquartile range is a measure of where the “middle fifty” is in a data set.

• Where a range is a measure of where the beginning and end are in a set, the interquartile range is a
measure of where the bulk of the values lie.
• That’s why it’s preferred over many other measures of spread when reporting things like school
performance or SAT scores.

• The interquartile range formula is the first quartile subtracted from the third quartile:
• IQR = Q3 – Q1.

49
Quartile Calculations

50
Box Plot

51
Question
Calculate the interquartile range (IQR) of the following data:
17, 18, 18, 19, 20, 21, 21, 23, 25.

A) 4
B) 5
C) 6
D) 7

52
Question
Find the interquartile range (IQR) using the following values:
• Minimum: 1
• Q1: 3
• Median: 5
• Q3: 7
• Maximum: 9

A) 2
B) 3
C) 4
D) 5
53
Advantage of IQR
• The main advantage of the IQR is that it is not affected by outliers
because it doesn’t consider observations below Q1 or above Q3.

• Observations can be considered outliers when they lie more than

1.5 IQR below the first quartile or 1.5 IQR above the third quartile.

• Outliers = Q1 – 1.5 × IQR (or) Q3 + 1.5 × IQR

54
Box Plot
• mainly used when you are describing center and variability of your data.
• It is also useful for detecting outliers in the data.
• Used to visualize IQR

55
IQR is plotted in a boxplot and probability density
56
Percentile
• A percentile is a measure used in statistics to indicate the value below which a given
percentage of observations in a group of observations fall.

Percentile(x) = (Number of values fall under ‘x’/total number of values) × 100

P = (n/N) × 100
Where ,

•P is percentile
•n – Number of values below ‘x’
•N – Total count of population

57
Percentile
• Arrange the data in an order

• Calculate the percentage of observations or data points

below a particular value.

What is the 80th Percentile observation?

Total Observations * 0.8

15*0.8= 12

58
Percentile calculation
• The number that expresses the value that a given percent of the values are lower than

• Example:

• We have an array of the ages of all the people that working in same office ages =
[25,31,43,48,50,41,39,60,52,32,27,46,47,55]

• What is 75. percentile? The answer is 48, meaning that 75% of the people are 48 or
younger.
• Steps:
• Arrange in ascending order [25,27,31,32,39,41,43,46,47,48,50,52,55,60]
• Find the Rank = Percentile/100 * (number of things )
• Rank= 0.75*14=10.5
• take the round 11 then subtract 1 for zero index so rank=10 with value 48

59
What is Skewness of Data?
• Skewness is the measure of how much the probability distribution of a
random variable deviates from the normal distribution.
• Skewness is positive if the tail of the distribution extends more to the
right, and negative if the tail extends more to the left.

60
Mean, Median and Mode

The value of skewness for a left skewed The value of skewness The value of skewness for a right skewed
distribution is less than zero. for a normal distribution is less than zero
mean < median < mode distribution is zero. mode < median < mean
mode = median = mean
61
Correlation
❑ Used to find the relationship between two variables which is important in real
because we can predict value of one variable.

Example from our daily life:

❑ The more time you spend running on a treadmill, the more calories you will burn.
❑ The more money you save, the more financially secure you feel.
❑ The more cigarettes you smoked, the higher stress level you have

Use Scatter plot in visualization of correlation

62
Correlation Coefficient
❑ The correlation coefficient (r)/pearson r indicates a measure of the strength of
a relationship between two variables.
Possible correlations range from +1 to –1.

Direction of correlation :
➢ A correlation of –1 indicates a perfect negative correlation, meaning that as one
variable goes up, the other goes down.
➢ A correlation of +1 indicates a perfect positive correlation , meaning that as one
variable goes up, the other goes up together.
➢ A correlation of zero indicates that there is no relationship between the variables.

❑ Describe linear relations only

63
Correlation Coefficient

➢ X: Values of the x-variable in a sample

➢ x̄: Mean of the values of the x-variable
➢ y: Values of the y-variable in a sample
➢ ȳ: Mean of the values of the y-variable
➢ N: Number of records
➢ σ: Standard deviation
64
Correlation Coefficient

Scatter plot in visualization of correlation , The horizontal axis represents one variable, and the vertical axis represents the other.
Regression line : is the line separate the 2 classes As points surrounded regression line the strength of correlation increase
The closer the correlation is to 0, the weaker it is, while the closer it is to +/-1, the stronger it is.

65
Correlation Coefficient

66
Example: Housing Data

67
Task
Calculate Mean ,Median ,Mode ,Variance ,Std for these students in Math and Science Subjects Grades

Student Math Science

A 85 78

B 90 88

C 78 84

D 92 91

E 88 76

Sample Footer Text 68

Task Solution
Calculate Mean ,Median ,Mode ,Variance ,Std for these students in Math and Science Subjects Grades

Math
Mean = (85+90+78+92+88)/5 = 86.6
Median = [78, 85, 88, 90, 92]
Mode = No mode (all grades are unique)
Variance = 23.84 Std = 4.88
Because
85 – 86.6 = -1.6 (-1.6)^2 = 2.56
90 - 86.6 = 3.4 (3.4 )^2 = 11.56 variance = (2.56 + 11.56 + 73.96 + 29.16 + 1.96)/5 = 23.84
78 - 86.6 = -8.6 (-8.6)^2 = 73.96 Std = √ 23.84 = 4.88
92 - 86.6 = 5.4 (5.4)^2 = 29.16
88 - 86.6 = 1.96 (1.96)^2 = 1.96
69
Task Solution
Calculate Mean ,Median ,Mode ,Variance ,Std for these students in Math and Science Subjects Grades

Science
Mean = (78+88+84+91+76)/5 = 83.4
Median = [76,78,84,88,91]
Mode = No mode (all grades are unique)
Variance = 32.64 Std = 5.71
Because
78 – 83.4 = -5.4 (-5.4)^2 = 29.16
88 - 83.4 = 4.6 (4.6 )^2 = 21.16 variance = (29.16 + 21.16 + 0.36 + 57.76 + 54.76)/5 = 32.64
84 - 83.4 = 0.6 (0.6)^2 = 0.36 Std = √ 32.64 = 5.71
91 - 83.4 = 7.6 (7.6)^2 = 57.76
76 - 83.4 = -7.4 (-7.4)^2 = 54.76
70
Inferential Statistics
Data Distribution

71
Data Distribution
• A distribution is a function that shows the possible values for a variable and how
often they occur
• In probability theory and statistics, a probability distribution is a mathematical
function that provides the probabilities of the occurrence of different possible
outcomes in an experiment
• It is a common mistake to believe that the distribution is the graph. In fact, the
distribution is the ‘rule’ that determines how values are positioned in relation to
each other. Very often, we use a graph to visualize the data. Since different
distributions have a particular graphical representation, statisticians like to plot
them.
72
Probability Distribution
Common distributions include:
• Normal Distribution: Symmetrical, bell-shaped distribution characterized by its mean
and standard deviation.
• Binomial Distribution: Represents the number of successes in a fixed number of trials
with a constant probability of success.
• Poisson Distribution: Models the number of events occurring in a fixed interval of
time or space.

73
Normal Distribution
❑ Normal distribution, also known as the Gaussian distribution, is a probability distribution
that is symmetric about the mean, showing that data near the mean are more frequent in
occurrence than data far from the mean (Bell shape).

❑ The Importance of Normal Distribution

1. It approximates a wide variety of random variables
2. Distributions of sample means with large enough sample sizes could be approximated as normal
3. All computable statistics are well-structured
4. It is widely used in regression analysis
5. Good track record

❑ Applications:
• Biology: Most biological measures are normally distributed, such as: height, arm and leg length,
nails, blood pressure, thickness of tree barks, etc.
• IQ tests, Stock Market Information 74
Normal Distribution

75
How is Data Distributed in Normal Distribution?

76
Hypothesis Testing
➢ A statistical method to make inferences or draw conclusions
about a population based on sample data.

➢ Elements of a hypothesis test:

❑ Population: The entire group being studied.

❑ Sample: A subset of the population used to make inferences.
❑ Null Hypothesis (𝐻0 ): The default assumption (no effect or no difference).
❑ Alternative Hypothesis (𝐻1 ): The assumption that there is an effect or difference.
❑ Significance Level (𝛼): The threshold for rejecting 𝐻0 (commonly 0.05).
77
Hypothesis Testing
Test Result – H0 True H0 False

True State
H0 True Correct Type I Error
Decision
H0 False Type II Error Correct
Decision

 = P(Type I Error )  = P(Type II Error )

78
Steps in Hypothesis Testing
1.State the Hypotheses:
1. Null Hypothesis (𝐻0 ): This is the hypothesis that there is no effect or no difference. It represents the
status quo or a statement of no change.
2. Alternative Hypothesis (𝐻1): This hypothesis represents what you aim to prove, indicating the
presence of an effect or a difference.
2.Choose the Significance Level (αα):
➢ The significance level is the probability of rejecting the null hypothesis when it is true. Common
choices are 0.05, 0.01, or 0.10.
3.Select the Appropriate Test:
1. Choose a statistical test based on the data type and the hypotheses. Common tests include:
1. t-test (for comparing means)
2. Chi-square test (for categorical data)
3. ANOVA (for comparing means across multiple groups)
4. Z-test (for large sample sizes)
4.Collect Data:
➢ Gather the sample data that will be used for the hypothesis test. Ensure the data is collected in a way
that is unbiased and representative of the population.
79
Steps in Hypothesis Testing
5. Calculate the Test Statistic:
➢ Using the chosen statistical test, compute the test statistic (e.g., t-value, z-value) based on the sample
data.
6. Determine the p-value:
➢ The p-value indicates the probability of observing the test results under the null hypothesis. It helps to
determine the strength of the evidence against the null hypothesis.
7. Make a Decision:
1. Compare the p-value to the significance level (α):
1. If p≤αp≤α: Reject the null hypothesis (H0).
2. If p>αp>α: Fail to reject the null hypothesis (H0).
8. Draw Conclusions:
➢ Interpret the results in the context of the research question. Discuss whether the evidence supports the
alternative hypothesis and what it implies for the population.
9. Report the Results:
➢ Clearly present the findings, including the hypotheses, test statistic, p-value, and any relevant
confidence intervals. Provide context and implications of the results. 80
⚫ Testing hypothesis for the mean μ :
⚫ When the value of sample size (n):

population is normal or not normal population is normal

( n ≥ 30 ) (n< 30)

σ is known σ is not known σ is known σ is not known

𝑥ҧ − 𝜇0 𝑥ҧ − 𝜇0 𝑥ҧ − 𝜇0 𝑥ҧ − 𝜇0
𝑧= 𝜎 𝑍= 𝑧= 𝜎 𝑇=
ൗ 𝑛 𝑆ൗ ൗ 𝑛 𝑆ൗ
𝑛 𝑛

81
82
83
Find the Critical t-value

84
Example 1 - Efficacy Test for New drug

● Drug company has new drug, wishes to compare it with

current standard treatment
● Federal regulators tell company that they must demonstrate
that new drug is better than current treatment to receive
approval
● Firm runs clinical trial where some patients receive new drug,
and others receive standard treatment
● Numeric response of therapeutic effect is obtained (higher
scores are better).
● Parameter of interest: mNew - mStd

85
Example 1 - Efficacy Test for New drug

● Null hypothesis - New drug is no better than standard trt

H 0 : m New − m Std  0 (m New − m Std = 0)

• Alternative hypothesis - New drug is better than standard trt

H A : m New − m Std  0
• Experimental (Sample) data:

y New y Std
s New sStd
nNew nStd
86
Example 1 - Efficacy Test for New drug

● Type I error - Concluding that the new drug is better than the standard (HA)
when in fact it is no better (H0). Ineffective drug is deemed better.
○ Traditionally  = P(Type I error) = 0.05

● Type II error - Failing to conclude that the new drug is better (HA) when in fact
it is. Effective drug is deemed to be no better.
○ Traditionally a clinically important difference (D) is assigned and sample
sizes chosen so that:
 = P(Type II error | m1-m2 = D)  .20

87
Example 2 - Mean Age of a Certain Population
● Researchers are interested in the mean age of a certain population.
● A random sample of 10 individuals drawn from the population of
interest has a mean of 27.
● Assuming that the population is approximately normally distributed
with variance 20,can we conclude that the mean is different from
30 years? (α=0.05) .
● If the p - value is 0.0340 how can we use it in making a decision?

88
Solution
1. Data: variable is age, n=10, x =27 ,𝝈𝟐 =20,α=0.05
2. Assumptions:
The population is approximately normally distributed with variance 20
3. Hypotheses:
➢ H0 : μ=30
➢ HA : μ ≠30
4. Test Statistic: Z= -2.12
5. Decision Rule
➢ The alternative hypothesis is HA: μ ≠ 30
➢ Hence we reject H0 if Z > Z1-0.025= Z0.975 or Z< - Z1-0.025 = - Z0.975
➢ Z0.975=1.96(from table D)
6. Decision:
➢ We reject H0 ,since -2.12 is in the rejection region .
➢ We can conclude that μ is not equal to 30
➢ Using the p value ,we note that p-value =0.0340 < 0.05,therefore we reject H0 89
T.DIST Function
We can use Excel’s T.DIST Function to calculate the p-value by simply adding the test
statistic value and degree of freedom

90
Regression

91
Regression
Regression is a statistical method used to understand the relationship between variables.
The primary purpose of regression analysis is to predict a dependent variable (also known
as the outcome, response, or target) based on one or more independent variables (also
known as predictors, features, or explanatory variables).

Types of Regression:
➢ Simple Linear Regression: One independent variable, straight-line relationship.
➢ Multiple Linear Regression: Multiple independent variables predicting one dependent variable.
➢ Logistic Regression: Used for predicting binary outcomes.

92
Types of Regression

93
Linear Regression
• Linear Regression: The simplest form of regression, where the relationship between the dependent
and independent variable(s) is modeled as a straight line. It can be represented by the equation

𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜖
where:
• 𝑌 is the dependent variable.
• 𝛽0 is the intercept.
• 𝛽1 is the slope (coefficient of the independent variable 𝑥).
• 𝜖 is the error term.

Predicting House Prices:

Scenario: Predicting the price of a house based on a single feature, such as square footage.
Example: Using historical data on house prices and their square footage to build a model that
estimates the price of a new house based on its size.

94
Multiple Linear Regression
• Multiple Linear Regression: An extension of linear regression that uses more than one independent
variable. The model can be written as

Y = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽n 𝑥n + 𝜖

where:
• 𝑌 is the dependent variable.
• 𝛽0 is the intercept.
• 𝛽1 is the slope (coefficient of the independent variable 𝑥).
• 𝜖 is the error term.

Predicting Salary:
Scenario: Estimating an employee’s salary based on multiple factors, such as years of
experience, education level, and job role.
Example: Using a dataset that includes years of experience, education level (Bachelor’s,
Master’s, PhD), and job roles (Developer, Manager, Analyst) to predict salaries.
95
Logistic Regression
Used for binary classification problems, where the dependent variable is categorical (e.g.,
success/failure, yes/no). It uses the logistic function to model the probability of the default class.

Spam Detection:
Scenario: Classifying emails as spam or not spam.
Example: Using logistic regression to build a spam filter that identifies emails as
spam based on features like the presence of certain keywords, sender information, and
email structure.
96
Questions?

Population Growth POGIL
100% (1)
Population Growth POGIL
7 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
93 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
Iba Unit - Ii
No ratings yet
Iba Unit - Ii
31 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Measures of Central Tendency-Ungrouped Data
No ratings yet
Measures of Central Tendency-Ungrouped Data
28 pages
Chapter 10 Data Analysis-Quantitative
No ratings yet
Chapter 10 Data Analysis-Quantitative
93 pages
central tendency english
No ratings yet
central tendency english
25 pages
Statistics in Research Processing and Data Analysis
No ratings yet
Statistics in Research Processing and Data Analysis
34 pages
DeMeasure of central tendency and dispersion
No ratings yet
DeMeasure of central tendency and dispersion
15 pages
PDFen (1)
No ratings yet
PDFen (1)
16 pages
Chapter 01
No ratings yet
Chapter 01
56 pages
Ch3 Numerically Summarizing Data
No ratings yet
Ch3 Numerically Summarizing Data
35 pages
Chap 4 Research Method and Technical Writing
No ratings yet
Chap 4 Research Method and Technical Writing
33 pages
Chap 4 Research Method and Technical Writing
No ratings yet
Chap 4 Research Method and Technical Writing
34 pages
APznzaZmf FjNZzQU2KZGNWcTIMyEPNieeXpEIC4txhLpx IW9aIcijwEdcvmrObIy4gDpcU78AYLsB6msaeqj47x3Fc6z9vdKhe5EnyMTtReSpFg 23R3DG W66DWWysqOW PfB BJrKuEN CsrKXdSrdM OKOdbGKa2ND0ltkJXrievcwimUpSlHEYiQCPleUm8zmyjmaz7 PPZRnRfUuizv
No ratings yet
APznzaZmf FjNZzQU2KZGNWcTIMyEPNieeXpEIC4txhLpx IW9aIcijwEdcvmrObIy4gDpcU78AYLsB6msaeqj47x3Fc6z9vdKhe5EnyMTtReSpFg 23R3DG W66DWWysqOW PfB BJrKuEN CsrKXdSrdM OKOdbGKa2ND0ltkJXrievcwimUpSlHEYiQCPleUm8zmyjmaz7 PPZRnRfUuizv
24 pages
6-DATA-analysis-2
No ratings yet
6-DATA-analysis-2
46 pages
Statistics Interview Questions
No ratings yet
Statistics Interview Questions
53 pages
Statistics
No ratings yet
Statistics
53 pages
Introduction To Basic Statistics
No ratings yet
Introduction To Basic Statistics
53 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
3 pages
Basics of Statistics For Analytics Using SAS/ Excel
No ratings yet
Basics of Statistics For Analytics Using SAS/ Excel
28 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
12 pages
Stats Interview Questions Answers 1697190472
No ratings yet
Stats Interview Questions Answers 1697190472
54 pages
C4 Descriptive Statistics
No ratings yet
C4 Descriptive Statistics
34 pages
Unit 1 - Business Statistics & Analytics
No ratings yet
Unit 1 - Business Statistics & Analytics
25 pages
Define Statistics
No ratings yet
Define Statistics
89 pages
Psychology Project
No ratings yet
Psychology Project
14 pages
IMS 504-Week 4&5 New
No ratings yet
IMS 504-Week 4&5 New
40 pages
Antim Prahar Business Statistics and Analysis - 240328 - 180758
No ratings yet
Antim Prahar Business Statistics and Analysis - 240328 - 180758
15 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
BUSD2027 QualityMgmt Module2
No ratings yet
BUSD2027 QualityMgmt Module2
168 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
Sibd Questions Soved Theory
No ratings yet
Sibd Questions Soved Theory
14 pages
Descriptive Statistics and Inferential Statistics Are Two Branches of Statistics That Serve Different Purposes
No ratings yet
Descriptive Statistics and Inferential Statistics Are Two Branches of Statistics That Serve Different Purposes
6 pages
RSU - Statistics - Lecture 3 - Final - myRSU
No ratings yet
RSU - Statistics - Lecture 3 - Final - myRSU
34 pages
Educational Statistics Notes
No ratings yet
Educational Statistics Notes
32 pages
statistics
No ratings yet
statistics
10 pages
8614 ASSIGNMENT NO 2
No ratings yet
8614 ASSIGNMENT NO 2
26 pages
Statistics,2
No ratings yet
Statistics,2
33 pages
Cental Tendency
No ratings yet
Cental Tendency
20 pages
DS Module 2
No ratings yet
DS Module 2
113 pages
Data Collection and Implementation
No ratings yet
Data Collection and Implementation
55 pages
ge8 statistics
No ratings yet
ge8 statistics
2 pages
AK - STATISTIKA - 02 - Describing Data (Cont.)
No ratings yet
AK - STATISTIKA - 02 - Describing Data (Cont.)
47 pages
Stat Theory (Previous+All)
No ratings yet
Stat Theory (Previous+All)
113 pages
chap2b
No ratings yet
chap2b
15 pages
3 Numerical Descriptive Measures
No ratings yet
3 Numerical Descriptive Measures
55 pages
Module 8
No ratings yet
Module 8
28 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
48 pages
Introduction-to-Measures-of-Central-Tendency-and-Dispersion
No ratings yet
Introduction-to-Measures-of-Central-Tendency-and-Dispersion
11 pages
Prob & Stat
No ratings yet
Prob & Stat
50 pages
Statistics
No ratings yet
Statistics
3 pages
Lecture 9descriptivestatistics 171204035552
No ratings yet
Lecture 9descriptivestatistics 171204035552
26 pages
Midterm 2 - Pec 8 Assessment of Learning 1
No ratings yet
Midterm 2 - Pec 8 Assessment of Learning 1
26 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
Economics GR 11 Ist Term Project
No ratings yet
Economics GR 11 Ist Term Project
9 pages
Mathematics in The Modern World
No ratings yet
Mathematics in The Modern World
13 pages
Analysis of Quantitative Data
No ratings yet
Analysis of Quantitative Data
18 pages
Stats For Data Science
No ratings yet
Stats For Data Science
21 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
2 Phase Vertical
No ratings yet
2 Phase Vertical
4 pages
SSC-JE Non-Tech Book 2021 Sample Pages
No ratings yet
SSC-JE Non-Tech Book 2021 Sample Pages
20 pages
Unit 1 Starting Points For The Understanding of Culture, Society, and Politics
No ratings yet
Unit 1 Starting Points For The Understanding of Culture, Society, and Politics
26 pages
Genetics Key11111
100% (1)
Genetics Key11111
7 pages
Chap.2 - Inside Reading 2
No ratings yet
Chap.2 - Inside Reading 2
9 pages
New Arrival
No ratings yet
New Arrival
7 pages
LPL - Paschim Vhr-Iv Dr. Umesh Mittal, House No - 233, Block A-5 Delhi
100% (1)
LPL - Paschim Vhr-Iv Dr. Umesh Mittal, House No - 233, Block A-5 Delhi
1 page
Module 3: Wear: Fig. 3.1 (A) : Zero Wear of Helical Gear
No ratings yet
Module 3: Wear: Fig. 3.1 (A) : Zero Wear of Helical Gear
30 pages
Rheologyof Molten Polymers
No ratings yet
Rheologyof Molten Polymers
16 pages
Epigenetic Mechanisms of Cell Programming and Reprogramming
No ratings yet
Epigenetic Mechanisms of Cell Programming and Reprogramming
68 pages
2018 SAHC Nicoletta
No ratings yet
2018 SAHC Nicoletta
155 pages
Title: Flight Controls - Spoiler and Elevator Computer - Install Sec 125 Hardware B'
No ratings yet
Title: Flight Controls - Spoiler and Elevator Computer - Install Sec 125 Hardware B'
83 pages
ZIMMERMAN - Integrated Design Process Guide
No ratings yet
ZIMMERMAN - Integrated Design Process Guide
18 pages
(Ebook) Foundations of Factor Analysis, Second Edition by Stanley A Mulaik ISBN 9781420099614, 1420099612 - The ebook in PDF/DOCX format is available for instant download
100% (2)
(Ebook) Foundations of Factor Analysis, Second Edition by Stanley A Mulaik ISBN 9781420099614, 1420099612 - The ebook in PDF/DOCX format is available for instant download
53 pages
MP28167GQ A
No ratings yet
MP28167GQ A
32 pages
Curro DigiEd Year and Term Planners 2022 Final
No ratings yet
Curro DigiEd Year and Term Planners 2022 Final
20 pages
Clean: Soil Air Water
No ratings yet
Clean: Soil Air Water
7 pages
The Weather - Vocabulary 1st Y
100% (1)
The Weather - Vocabulary 1st Y
2 pages
Senkinesh Draft Reference
No ratings yet
Senkinesh Draft Reference
6 pages
Astronautics Homework
No ratings yet
Astronautics Homework
1 page
The Project Gutenberg eBook of Dynamic Thought or the Law of Vib
No ratings yet
The Project Gutenberg eBook of Dynamic Thought or the Law of Vib
101 pages
Copper Alloy C26800
No ratings yet
Copper Alloy C26800
13 pages
Student-Led School Watching and Hazard Mapping (Lifted From)
No ratings yet
Student-Led School Watching and Hazard Mapping (Lifted From)
3 pages
Laboratory Exercise-Ph Meter
No ratings yet
Laboratory Exercise-Ph Meter
3 pages
Leadership Styles1
No ratings yet
Leadership Styles1
22 pages
Iso 3425 1975
No ratings yet
Iso 3425 1975
4 pages
Assignment # 3 Solution: Your Work To Receive Full Credit
No ratings yet
Assignment # 3 Solution: Your Work To Receive Full Credit
4 pages
CC7003-Industrial Safety Management PDF
No ratings yet
CC7003-Industrial Safety Management PDF
5 pages
Provincial - For - Water - Supply - and - Sanitation - Project - PWSSP - at - BTB Intake Facility 14 Juky 2021.
No ratings yet
Provincial - For - Water - Supply - and - Sanitation - Project - PWSSP - at - BTB Intake Facility 14 Juky 2021.
11 pages

2 - Introduction To Statistics

Uploaded by

2 - Introduction To Statistics

Uploaded by

Introduction to Statistics

➢ Descriptive Vs. Inferential ❑ Inferential Statistics

❑ Descriptive Statistics ➢ Data Distribution

➢ Types of Descriptive Statistics ➢ Hypothesis Testing

➢ Measures of Central Tendency ❑ Regression

➢ Measure of Dispersion ➢ Types of Regression

• Measures of Central Tendency:

• Measures of Dispersion or Variation:

❑ Measures of Central Tendency:

Example Data 1 [ -10, 0, 10, 20, 30 ] For data 1: (-10 + 0 + 10 + 20 + 30 ) / 5 = 10

•Application: Common in market research, quality control, and inventory management.

For data 1: 13 is the most frequent

6,915 $  Mean → 7,676 $

7,200 $  Median → 7,200 $

Central Tendency Spread in data

Mean : 20 Mean : 20 Mean : 20

Median : 20 Median : 20 Median : 20

➢ It is always non-negative since each term in the variance sum is squared,

➢ Since the population variance is squared, we cannot compare it directly

➢ Application: Used in risk management, investment analysis, and quality control.

• Average = 4/7 = 0.57

● Properties of Standard Deviation:

● The smallest value of the standard deviation is 0 since it cannot be negative.

• Population: is a collection of all possible

• Sample: is a portion or part of the population of

• When resources such as time, cost, and manpower are limited,

• When conducting research or experiments where it is important

• Application: Widely used in everyday scenarios such as evaluating

• Observations can be considered outliers when they lie more than

• Outliers = Q1 – 1.5 × IQR (or) Q3 + 1.5 × IQR

Percentile(x) = (Number of values fall under ‘x’/total number of values) × 100

• Calculate the percentage of observations or data points

What is the 80th Percentile observation?

Total Observations * 0.8

Example from our daily life:

Use Scatter plot in visualization of correlation

❑ Describe linear relations only

➢ X: Values of the x-variable in a sample

Student Math Science

Sample Footer Text 68

❑ The Importance of Normal Distribution

➢ Elements of a hypothesis test:

❑ Population: The entire group being studied.

 = P(Type I Error )  = P(Type II Error )

population is normal or not normal population is normal

σ is known σ is not known σ is known σ is not known

● Drug company has new drug, wishes to compare it with

● Null hypothesis - New drug is no better than standard trt

H 0 : m New − m Std  0 (m New − m Std = 0)

Predicting House Prices:

You might also like