Data Analysis - INCOMPLETE - 2
Data Analysis - INCOMPLETE - 2
D ATA A N A LY S I S
• H₀: null hypothesis; No variation between two variables(population); two variables have same distribution.
• Hₐ: two populations (variables) are not equal.
• p-value: if p-value is less than a specified significance level α (alpha value; usually 0.05); difference is
significant and null hypothesis H₀ is rejected.
o P-value (probability value) tells how likely a particular set of observations occurs if null hypothesis were true.
o Smaller the p-value, more likely to reject null hypothesis.
o P-value will never reach zero, because there’s always a possibility.
• H₀ is rejected: two variables are not from same distribution.
Example: significance level 0.05; degrees for freedom = 2; test result = 0.7533
• 95 times out of 100, survey that agrees with a sample will have a distribution
value of 5.99 or less.
• 0.7533 is less than 5.99 → accept null hypothesis with 0.05 significance level
• Normal/Gaussian/bell-shaped distribution: continuous probability distribution i.e. symmetrical around its mean.
• Most observations cluster around central peak
• Probabilities for values further away from mean taper off (equally) in both directions.
• Extreme values in both tails of distribution are similarly unlikely.
• Box–Cox transformation
• In statistics, confidence interval refers to probability that a population parameter will fall between a set of
values for a certain proportion (percentage) of times.
• Confidence intervals measure the degree of uncertainty or certainty.
• Most common are 95% or 99% confidence/significance level.
Confidence level = 100 × (1 − 𝛼)
• For 90% confidence level alpha is 0.1; for 95% confidence level alpha is 0.05; for 99% confidence level 𝛼 is 0.01.
• Confidence level means that; if experiment is repeated over and over again, 95% times results will match.
• Example, a survey conducted on group of pet owners to see how many cans of dog food they purchase a year.
Testing the statistic at 99% confidence level gives a confidence interval of (200,300) → they buy between 200
and 300 cans a year (with a very high probability 99%)
• Confidence Interval (CI) is a range of values we are fairly sure our true value lies in.
o Normal or z-distribution
μ ± z * σ / (√n)
Confidence
Z-score
Interval
80% 1.282
85% 1.440
90% 1.645
95% 1.960
99% 2.576
99.5% 2.807
99.9% 3.291
Example: Construct a 98% Confidence Interval based on the following data: 45, 55, 67, 45, 68, 79, 98, 87, 84, 82.
Example: Construct a 95 % confidence interval an experiment that found the sample mean temperature for a certain
city in August was 101.82, with a population standard deviation of 1.2. There were 6 samples in this experiment
• Step 1: Subtract confidence level (Given as 95 percent in question) from 1 and then divide the result by two.
alpha level (represents area in one tail) = (1 – .95) / 2 = .025
• Step 2: Find z-score from z-table : z score = 1.96.
• Step 3: Plug the numbers into the second part of the formula and solve: z * σ / (√n)
= 1.96 * 1.2/√(6) = 1.96 * 0.49 = 0.96
• Step 4: Find the CI:
Lower end of CI range, subtract step 3 from mean = 101.82 – 0.96 = 100.86
Upper end of CI range, add step 3 to mean = 101.82 + 0.96 = 102.78.
CI is (100.86,102.78)
Non-parametric Tests
• No fixed set of parameters available, and also there is no distribution (normal) knowledge available for use.
• No assumption made about parameters for given population.
• Referred to as distribution-free tests.
• More popular; Easy to apply and understand; less complex.
• Chi-square test; Mann-Whitney U-test; Kruskal-Wallis H-test
• Use of ‘n − 1’ instead of ‘n’ in the formula for sample variance and sample standard deviation.
• Corrects the bias in estimation of population variance and population standard deviation.
• Except for rare cases (sample mean = population mean), data will be closer to sample mean than it will be
to the true population mean.
• So the value on denominator will probably be a bit smaller than what it would be if used the true
population mean. To make up for this, divide by ‘n-1’ (a smaller value) rather than ‘n’.
Pearson r correlation:
• Categorical data: variables contain label values rather than numeric values.
• Number of possible values is often limited to a fixed set.
• Each value represents a different category.
• Categorical variables also called nominal (ordinal, if ordered).
o variable “pet” with values: “dog” and “cat”.
o Variable “color” with values: “red“, “green” and “blue”.
o variable “rank” with values: “first”, “second” and “third” (ordinal)
• Machine learning algorithms (data analytics) cannot operate on label data directly (all input/output variables
must be numeric).
Integer/Label Encoding
• Each unique category value is assigned an integer value.
• “red” is 1, “green” is 2, and “blue” is 3.
• Easily reversible.
• Such integer values have a natural ordered relationship between each other → machine learning algorithms
tend to understand and harness this relationship.
o For some variables/analysis (ordinal), this may be enough/good.
• For nominal data label encoding is not enough/good.
o using such encoding and allowing the model to assume natural ordering between categories may result in
poor performance or unexpected results.
One-Hot Encoding
• New binary variable is added for each unique categorical data value.
• Original variable is discarded.
o In “color” variable example, there are 3 categories.
o 3 binary variables are added.
o “1” value is placed in the binary variable for respective color and “0” values for all other color variables.
Color Red Green Blue
Red 1 0 0
Green 0 1 0
Blue 0 0 1
Gender Result
Male Pass
Female Pass
Pass Fail
Male Fail
Male 2 2 4
Male Fail
Female 2 1 3
Male Pass
4 3
Female Pass
Female Fail
• Chi-square statistic can't be negative (WHTHER related or not; doesn’t indicate directionality)
• degrees of freedom = (r-1)*(c-1). (Number of response categories)
• Question: Are gender and education level dependent at 95% level of significance?
• In other words, given the data collected above, is there a relationship between the gender of an individual
and the level of education that they have obtained?
o x̄ : sample mean
o μ : population mean
o S : sample standard deviation
o N : Number of observations
calculated t-value > Critical t-value → reject null hypothesis (it’s highly likely that sample mean of sale is
greater → sales training was probably a success)
H0: μ = 40
H1: μ > 40
• degrees of freedom; df = n1 + n2 − 2
H0: μ1 = μ2 H1: μ1 ≠ μ2
calculated t-value > Critical t-value → reject null hypothesis → mean body fat for men and women are NOT equal.
H0: μ1 = μ2
Ha: μ1 ≠ μ2
absolute value of test statistic (12.62059) > critical value (1.9673) → reject null hypothesis
conclude that two sample means are different at 0.05 significance level.