Statistics
Statistics
STATISTICS – This is the science of collecting, organizing and analysing the data.
DATA – Collection of facts / pieces of information
Types of Statistics –
1.Descriptive Statistics
2.Inferential Statistics
Descriptive Statistics – Consists of organising and summarising the data using visualisation
plots. Extensively used in EDA + FE. By using descriptive statistics, we can understand the
data.
Example: Histogram, Bar chart, Pie chart, Distribution. What is the average age of students
in the classroom? Relation between age and weight?
Inferential Statistics – It consists of collecting sample (n) data and making conclusions about
population(N) data using some experiments. Conclusion can be made using hypothesis
testing consisting of techniques like confidence interval, P-value, Z-test, t test, chi square
test, Anova (F-test)
Example: University – 500 students (population); Class room A – 60 people (sample); Making
conclusion of the average age of the entire university. Are the average age of the students in
the classroom less than or greater than the average age of students in university?
1.Simple Random Sampling: Every member of the population (N) has an equal chance
of being selected for the sample(n). Ex: Exit polls, lottery
2.Stratified Sampling: Strata -> Layers -> Clusters ->Groups. We focus on picking
samples from a group Ex: Male/Female, educational degrees, blood groups
3.Systematic Sampling: Method which targets every nth individual out of population
(N). Ex: Credit cards at airport – agent 1 will approach every 5th person and agent 2
will approach every 9th person to sell credit cards.
4.Convenience Sampling: Only those who are interested in the survey will be
considered for participation. Ex: Students interested in DS program will be sent
brochures and information regarding the course; Job application by candidates who
are interested in the particular job
Variable – Is a property that can take any values. There are 2 different type of variables
namely,
X = {1,2,3,4,5}
Mean: (1+2+3+4+5) / 5 = 3
Population Mean (µ) = ∑_(𝑖 = 1)^𝑁▒〖𝑋𝑖/𝑁〗
= ∑_(𝑖 = 1)^𝑛▒〖𝑋𝑖/𝑛〗
Median: If we have outliers, we should use median instead of mean.
Mode: It is the most frequent element in the list OR most repeated element in the list.
Practical implementation is when in a set of categorical variables, there are NaN values
present. The most repeated value (Mode) can be used to replace NaN value.
Measure of Dispersion –
1.V ri c σ2) – Talks about spread of data. Higher the variance, higher is the spread
of data
2.Sta d rd D vi tio σ – Talks about how many standard deviation away a number
falls from mean
Po u tio V ri c σ2) = ∑_(𝑖 = 1)^𝑁▒〖(𝑋𝑖 − µ)2 /𝑁〗
Sample Variance (s2) = ∑_(𝑖 = 1)^𝑛▒〖(𝑋𝑖 − )2/𝑛 − 1〗
It is natural to wonder why the sum of the squared deviations is divided by n−1 rather than
n. The purpose in computing the sample standard deviation is to estimate the amount of
spread in the population from which the sample was drawn.
Ideally, therefore, we would compute deviations from the mean of all the items in the
population, rather than the deviations from the sample mean.
However, the population mean is in general unknown, so the sample mean is used in its
place.
It is a mathematical fact that the deviations around the sample mean tend to be a bit
smaller than the deviations around the population mean and that dividing by n−1 rather
than n provides exactly the right correction.
Ex: 99th percentile – It means this person has got better marks than 99% of the entire
students.
5 number summary: This can be used to remove outliers.
1.Minimum
2.First Quartile (25 percentile) Q1
3.Median
4.Third Quartile (75 percentile) Q3
5.Maximum
Q1 = 25/100 * (20+1) = 5.25 index (take average of 5th and 6th index – 3)
Q3 = 75/100 * (20+1) = 15.75 index (take average of 15th and 16th index – 7.5)
Conclusion: Since the lowest value in the dataset of 1, there is no outlier present in the lower fence.
However, 27 is greater than higher fence value 14.25. Hence this can be treated as an outlier and
eliminated from the list.
1.Minimum = 1
2.First Quartile (25 percentile) Q1 - 3
3.Median - 5
4.Third Quartile (75 percentile) Q3 – 7.5
5.Maximum – 9
INTRODUCTION TO DISTRIBUTION
1. Normal Distribution
2. Standard Normal Distribution
3. Z-Score
4. Standardization and Normalization
Gaussian / Normal Distribution –
Mean
1.Both sides around the mean are symmetrical / equal.
2.The area under the bell curve is 1 -> 100%
1.Within the 1st SD on either sides of mean, there are 68% of data present
2.Within the 2nd SD on either sides of mean, there are 95% of data present
3.Within the 3rd SD on either sides of mean, there are 99.7% of data present
Assume a variable X belonging to Gaussian distribution with mean (µ) and standard deviation
(σ). This variable can be converted into a different variable y belonging to standard normal
distribution with mean (µ=0) and standard deviation (σ=1) using the formula Z-score (which
can be interpreted by standard scaling / standardisation). The main reason to perform this is
to standardise all the different units into one comparable unit which can increase the
speed of calculation. 3
X_new = (X - mean)/Std
Normalization – The process of transforming a dataset where we define a specific range in
which data needs to be transformed. Ex: MinMax Scaler. Normalization is useful when there
are no outliers as it cannot cope up with them. Usually, we would scale age and not incomes
because only a few people have high incomes but the age is close to uniform.
Area under the curve can be calculated by computing the Z-score and referring the Z value in
the table, link given below.
CLT says a population (N) which is either gaussian / log normally distributed, considering any number
of samples with size of n >= 30, then the distribution of sample means follows a normal / gaussian
distribution.
Probability – Measure of the likelihood of an event.
1.Mutual exclusive events – Two events are mutually exclusive if they cannot occur at
the same time. Ex – Rolling a dice; Tossing a coin; Winter or Summer;
P(A or B) = P(A) + P(B)
2.Non-Mutual exclusive events – Two events can occur at the same time. Ex – Picking
random a card from a deck of cards, two events like a “heart” and “king of heart” can
appear at the same time.
P(A or B) = P(A) + P(B) – P(A*B)
Multiplication Rule of Probabilities -
1.Dependent events – Two events are dependent if they affect each other.
Ex – From a bag of 4W and 3Y marbles, pick a marble. Probability of it being a white
marble is 4/7. Later, pick a yellow marble, probability of it being a yellow marble is
3/6. Notice initially we had 7 marbles and next we have 6 marbles. Hence the first
event has affected the outcome of the second event and hence the name dependent
events.
2.Independent events – Two events are independent if they don’t affect each other.
Ex – Tossing a coin.
Permutation –
Combination –
nCr = n! / r!(n-r)!
\bar{y} = mean of y
N = number of data values
Pearson Correlation Coefficient (PCC) – The Pearson correlation coefficient (r) is the most
common way of measuring a linear correlation. It is a number between –1 and 1 that
measures the strength and direction of the relationship between two variables. Covariance
doesn’t have any restriction on range or scale of +/- values that can be populated as a result.
It may be +3000 or -6543 for example. To overcome this, pearson correlation coefficient has
a range of values between -1 to +1. More the correlation value tending towards +1, stronger
is the correlation (positive). More the correlation value tending towards -1, weaker is the
correlation (negative). PCC is used only for linear data.
Point Estimate – The value of any statistics (sample mean x ) that estimates the value of a
parameter (population mean µ) is called point estimate.
/- Margin of error = µ
Confidence interval ->
Lower CI: Point estimate – Margin of error
Higher CI: Point estimate + Margin of error
Problem Statement: On the quant test of CAT exam, a sample of 25 students has a mean of
520 with a population standard deviation of 100. Construct a 95% CI about the mean.
Solution:
n= 5; x = 5 ;σ=1 ; CI = 95%;
Significance Value (SV) = 1 – CI = 1 - 0.95 = 0.05
WKT, Lower CI = Point estimate – Margin of error
& Higher CI = Point estimate + Margin of error
Lower CI = 520 – Z(0.05/2) * (100/5)
Z(0.05/2) = Z(0.025) = Z(1-0.025) = Z(0.975) = 1.96
Lower CI = 520 – 1.96*(20) = 520 – 39.2 = 480.8
Similarly, Higher CI = 520 + 1.96*(20) = 520 + 39.2 = 559.2
Conclusion: Since the sample mean 520 is within the range of confidence interval ranging
between 480.8 & 559.2, we fail to reject the null hypothesis.
Solution: Initial analysis of the problem statement reveals this is a 2 tail test since the
employee believes that the average amount of medicine is not 80ml. It can be greater than
or less than 80ml. Hence this is a 2 tail test.
T, = ; n= ;x = ; s = .5; CI = .95
SV = 1- CI = 1 – 0.95 = 0.05
In this case, since we have sample standard deviation, but sample size is greater than 30,
we use Z-test
The SV is available on both sides of the distribution around the mean. Area under the curve
can be found out by considering 1 – 0.025 = 0.975. From Z table, the value received against
0.975 is 1.96
The value from Z table obtained for SV of 0.025 is +- 1.96. Since -5.05 does not fall in the
range of +1.96 to -1.96, we reject the null hypothesis.
Chi Square test – This test explains about population proportions. It is a non-parametric test
that is performed on categorical variables / datasets (both nominal and ordinal). Ex: Rank
Problem Statement: In 2000 USA census, age of individuals in a small town was found to be
the following –
Solution:
f0 – Observed value
fe – Expected value
X2 = [(121-100)2 / 100] + [(285-150)2 / 150] + [(91-250)2 / 250]
X2 = 232.494
Conclusion: Since X2 is greater than 5.991, we reject the null hypothesis
The t-test is a method that determines whether two populations are statistically different
from each other, whereas ANOVA determines whether three or more populations are
statistically different from each other.