0% found this document useful (0 votes)
52 views

Data Analysis - INCOMPLETE - 2

The document provides an introduction to data analysis techniques for hypothesis testing and bivariate analysis. It discusses hypothesis testing concepts like the null and alternative hypotheses, p-values, and confidence intervals. It also covers bivariate analysis methods like correlation tests, chi-square tests, t-tests, ANOVA, and examining the relationship between two variables. Grouping techniques like association rule mining and market basket analysis are also introduced.

Uploaded by

vivek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Data Analysis - INCOMPLETE - 2

The document provides an introduction to data analysis techniques for hypothesis testing and bivariate analysis. It discusses hypothesis testing concepts like the null and alternative hypotheses, p-values, and confidence intervals. It also covers bivariate analysis methods like correlation tests, chi-square tests, t-tests, ANOVA, and examining the relationship between two variables. Grouping techniques like association rule mining and market basket analysis are also introduced.

Uploaded by

vivek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

I N T RO D U C T I O N TO

D ATA A N A LY S I S

2nd Sem, MCA

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


CONTENT
❑ Introduction to Data Analysis
• Hypothesis Testing
✓ Bivariate Analysis: Correlation Test
o Correlation coefficient
o Chi square test
o T - test
o ANOVA
o Summary tables, contingency tables, visualization
✓ Multivariate Analysis
• Grouping
o Association rule mining
o Market Basket Analysis
o Recommendation system
o Apriori algorithm
o FP Growth Algorithm

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – H Y P OT H E S I S T E S T I N G

• Main purpose of statistics is to test a hypothesis.


• Hypothesis: Educated guess about something (should be testable).
o Proposed explanation made on basis of limited evidence as a starting point for further investigation.
• Hypothesis testing in statistics is a way to test results to see if results are meaningful.
• Null hypothesis are generally accepted as being true (initially).
• Alternative hypothesis is effectively the opposite (not always) of a null hypothesis.
o H0: There is no relationship between X and Y variable.
o H1: There is a relationship between X and Y variable.
• Steps in hypothesis Testing:
o State null hypothesis,
o Choose what kind of test to perform,
o Either support or reject null hypothesis.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – H Y P OT H E S I S T E S T I N G

• H₀: null hypothesis; No variation between two variables(population); two variables have same distribution.
• Hₐ: two populations (variables) are not equal.
• p-value: if p-value is less than a specified significance level α (alpha value; usually 0.05); difference is
significant and null hypothesis H₀ is rejected.
o P-value (probability value) tells how likely a particular set of observations occurs if null hypothesis were true.
o Smaller the p-value, more likely to reject null hypothesis.
o P-value will never reach zero, because there’s always a possibility.
• H₀ is rejected: two variables are not from same distribution.

Example: significance level 0.05; degrees for freedom = 2; test result = 0.7533
• 95 times out of 100, survey that agrees with a sample will have a distribution
value of 5.99 or less.
• 0.7533 is less than 5.99 → accept null hypothesis with 0.05 significance level

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – H Y P OT H E S I S T E S T I N G

• Normal/Gaussian/bell-shaped distribution: continuous probability distribution i.e. symmetrical around its mean.
• Most observations cluster around central peak
• Probabilities for values further away from mean taper off (equally) in both directions.
• Extreme values in both tails of distribution are similarly unlikely.

• Box–Cox transformation

Mean +/- standard Percentage of


deviations data contained
1 68%
2 95%
3 99.7%

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – H Y P OT H E S I S T E S T I N G

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – H Y P OT H E S I S T E S T I N G

• In statistics, confidence interval refers to probability that a population parameter will fall between a set of
values for a certain proportion (percentage) of times.
• Confidence intervals measure the degree of uncertainty or certainty.
• Most common are 95% or 99% confidence/significance level.
Confidence level = 100 × (1 − 𝛼)
• For 90% confidence level alpha is 0.1; for 95% confidence level alpha is 0.05; for 99% confidence level 𝛼 is 0.01.
• Confidence level means that; if experiment is repeated over and over again, 95% times results will match.

• Example, a survey conducted on group of pet owners to see how many cans of dog food they purchase a year.
Testing the statistic at 99% confidence level gives a confidence interval of (200,300) → they buy between 200
and 300 cans a year (with a very high probability 99%)

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – H Y P OT H E S I S T E S T I N G

• Confidence Interval (CI) is a range of values we are fairly sure our true value lies in.

• CI can be constructed with


o t-distribution
μ ± t * σ / (√n)

o Normal or z-distribution
μ ± z * σ / (√n)

• standard error of the sampling distribution = σ / (√n)

• Since size ‘n’ is in denominator and standard deviation ‘s’ is in numerator


→ small samples with large variations increase standard error,
this reduces confidence that sample statistic is a close approximation of the population parameter.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – H Y P OT H E S I S T E S T I N G

Confidence
Z-score
Interval

80% 1.282
85% 1.440
90% 1.645
95% 1.960
99% 2.576
99.5% 2.807
99.9% 3.291

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – H Y P OT H E S I S T E S T I N G

Example: Construct a 98% Confidence Interval based on the following data: 45, 55, 67, 45, 68, 79, 98, 87, 84, 82.

• Step 1: Find mean, μ and standard deviation, σ for the data.


σ: 18.172; μ: 71
• Step 2: Subtract 1 from sample size to find degrees of freedom (df).
df = 10 – 1 = 9
• Step 3: Find alpha level; Subtract confidence level from 1, then divide by two. (1 – .98) / 2 = .01
• Step 4: Look up df and α in t-distribution table. For df = 9 and α = .01, table gives 2.821
• Step 5: Apply CI formula for t-distribution μ ± t * σ / (√n)
Lower end of CI range, 71 – 16.22075 = 54.77925
Upper end of CI range 71 + 16.22075 = 87.22075
98% CI is (54.78, 87.22)

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – H Y P OT H E S I S T E S T I N G

Example: Construct a 95 % confidence interval an experiment that found the sample mean temperature for a certain
city in August was 101.82, with a population standard deviation of 1.2. There were 6 samples in this experiment

• Step 1: Subtract confidence level (Given as 95 percent in question) from 1 and then divide the result by two.
alpha level (represents area in one tail) = (1 – .95) / 2 = .025
• Step 2: Find z-score from z-table : z score = 1.96.
• Step 3: Plug the numbers into the second part of the formula and solve: z * σ / (√n)
= 1.96 * 1.2/√(6) = 1.96 * 0.49 = 0.96
• Step 4: Find the CI:
Lower end of CI range, subtract step 3 from mean = 101.82 – 0.96 = 100.86
Upper end of CI range, add step 3 to mean = 101.82 + 0.96 = 102.78.
CI is (100.86,102.78)

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N
• Bivariate Analysis: Analysis of any concurrent relation between two variables or attributes.
o Consists of a group of statistical techniques that examine relationship between two variables.
o Bivariate analysis forms foundation of multivariate analysis.
• Correlation: Relation between two variables.
• Bivariate correlation Test: Statistical technique to determine existence of relationships/association
between two different variables (X, Y)
o whether/how much X will change when there is a change in Y.
Types of tests:
o Correlation: check the association between variables.
o Comparison of means: check the differences between means of variables.
o Regression: check if one variable predicts changes in another variable.
o Non-Parametric: tests that are used when data does not meet the assumptions of parametric tests.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N
Parametric Tests
• Prior knowledge of population distribution (normal) is available.
• Fixed set of parameters used to determine probabilistic model.
• Parameters used in normal distribution: Mean, Standard Deviation
• T-test; Z-test; F-test; ANOVA (post-hoc test)

Non-parametric Tests
• No fixed set of parameters available, and also there is no distribution (normal) knowledge available for use.
• No assumption made about parameters for given population.
• Referred to as distribution-free tests.
• More popular; Easy to apply and understand; less complex.
• Chi-square test; Mann-Whitney U-test; Kruskal-Wallis H-test

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N

Correlation Test Selection

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N
• Positive correlation: both variables move in same direction → increase in one variable leads to increase in other
variable and vice versa.
o spending more time on a treadmill burns more calories.
• Negative correlation: two variables move in opposite directions → increase in one variable leads to decrease in
other variable and vice versa.
o increasing speed of a vehicle decreases time to reach destination.
• Weak/Zero correlation: one variable does not affect other.
o no correlation between number of years of school a person has attended and letters in his/her name.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N
• Correlation coefficient (r) measures strength of association/co-occurrence (between -1 to +1).
• Pearson (‘r’ or product-moment) correlation coefficient: Between two continuous-level variables.
o Positive correlation shows direct relationship between two variables (the larger A, the larger B).
o Negative correlation shows inverse relationship (the larger A, the smaller B).
o Zero correlation coefficient indicates no relationship between the variables at all.
o .1 < | r | < .3 … small / weak correlation
o .3 < | r | < .5 … medium / moderate correlation
o .5 < | r | ……… large / strong correlation

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N
Advantages of correlation analysis

• Observe relationships: correlation helps to identify absence/presence of relationship between two


variables.
• Good starting point for research/analysis.
• Uses for further studies: Guides to identify direction and strength of relationship between two
variables and later narrow the findings down in later studies.
• Simple metrics: findings are simple to classify (range from -1.00 to 1.00). Only three potential broad
outcomes of the analysis.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N
Bessel's correction

• Use of ‘n − 1’ instead of ‘n’ in the formula for sample variance and sample standard deviation.
• Corrects the bias in estimation of population variance and population standard deviation.
• Except for rare cases (sample mean = population mean), data will be closer to sample mean than it will be
to the true population mean.
• So the value on denominator will probably be a bit smaller than what it would be if used the true
population mean. To make up for this, divide by ‘n-1’ (a smaller value) rather than ‘n’.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – C O R R E L AT I O N

Pearson r correlation:

r = Pearson r correlation coefficient between x and y


n = number of observations
xi = value of x (for ith observation)
yi = value of y (for ith observation)
Sx, Sy = S.D. for x and y

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N

rxy = Pearson r correlation coefficient between x and y


Pearson r correlation: n = number of observations
xi = value of x (for ith observation)
yi = value of y (for ith observation)

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – C O R R E L AT I O N
CATEGORICAL DATA ENCODING

• Categorical data: variables contain label values rather than numeric values.
• Number of possible values is often limited to a fixed set.
• Each value represents a different category.
• Categorical variables also called nominal (ordinal, if ordered).
o variable “pet” with values: “dog” and “cat”.
o Variable “color” with values: “red“, “green” and “blue”.
o variable “rank” with values: “first”, “second” and “third” (ordinal)
• Machine learning algorithms (data analytics) cannot operate on label data directly (all input/output variables
must be numeric).

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – C O R R E L AT I O N
CATEGORICAL DATA ENCODING

Integer/Label Encoding
• Each unique category value is assigned an integer value.
• “red” is 1, “green” is 2, and “blue” is 3.
• Easily reversible.
• Such integer values have a natural ordered relationship between each other → machine learning algorithms
tend to understand and harness this relationship.
o For some variables/analysis (ordinal), this may be enough/good.
• For nominal data label encoding is not enough/good.
o using such encoding and allowing the model to assume natural ordering between categories may result in
poor performance or unexpected results.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – C O R R E L AT I O N
CATEGORICAL DATA ENCODING

One-Hot Encoding
• New binary variable is added for each unique categorical data value.
• Original variable is discarded.
o In “color” variable example, there are 3 categories.
o 3 binary variables are added.
o “1” value is placed in the binary variable for respective color and “0” values for all other color variables.
Color Red Green Blue
Red 1 0 0
Green 0 1 0
Blue 0 0 1

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N
Summary Table
• Visualization that summarizes statistical information about data in table form.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C O R R E L AT I O N
Contingency table:
• crosstabs or two-way tables
• Tabular representation of categorical data.
• Used in statistics to summarize relationship between several categorical variables.
• Special type of frequency distribution table, where two variables are shown simultaneously.
• Usually shows frequencies for particular combinations of values of two discrete random variable s X and Y.
• Each cell in the table represents a mutually exclusive combination of X-Y values.

Gender Result
Male Pass
Female Pass
Pass Fail
Male Fail
Male 2 2 4
Male Fail
Female 2 1 3
Male Pass
4 3
Female Pass
Female Fail

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – C H I S Q U A R E
• Pearson’s chi-square test.
• Primary use of chi-square test is to examine whether two variables are independent (not related) or not.
o If two variables are correlated, their values tend to move together, either in same or opposite direction.
o One variable is "not correlated with" or "independent of" other if increase in one variable is not
associated with increase in another.
• Chi-Square statistic is based on the difference between what is actually observed data and what would be
expected if there was truly no relationship between the variables.
• Null and alternative Hypothesis:
o H0: There is no relationship between X and Y variable.
o H1: There is a relationship between X and Y variable.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – C H I S Q U A R E
• Calculation of Chi-Square statistic: χ2 = ∑(Oi – Ei)2/Ei

o Oi = observed frequency (observed counts in the cells)


o Ei = expected frequency (if NO relationship existed between the variables) Ei = row total*column total/sample size

• Chi-square statistic can't be negative (WHTHER related or not; doesn’t indicate directionality)
• degrees of freedom = (r-1)*(c-1). (Number of response categories)

o r, c: number of rows, columns in considered dataset (contingency table)


• Compare statistical value for degree of freedom (d) & critical/alpha value (p) from Chi-square distribution
table with calculated Chi-square statistical value to decide whether variables are related or not.
o Accept/reject hypothesis
o Chi-square calculated value > Chi-square critical value → reject the null hypothesis.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – C H I S Q U A R E

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C H I S Q U A R E
• Is gender independent of education level? A random sample of 395 people were surveyed and each person
was asked to report the highest education level they obtained. The data that resulted from the survey is
summarized in the following table:

• Question: Are gender and education level dependent at 95% level of significance?
• In other words, given the data collected above, is there a relationship between the gender of an individual
and the level of education that they have obtained?

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S - C H I S Q U A R E
Actual Data Expected Data

o H0: There is no relationship between X and Y variable.


o H1: There is a relationship between X and Y variable.
• Critical value of χ2 with 3 degree of freedom is 7.815.
• 8.006 > 7.815 → reject the null hypothesis.
• Education level depends on gender at a 95% level of significance.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – T- T E S T
One-Sample T – test
• Compares the mean of sample data to a known value.
o Example, one might want to know how sample mean compares to population mean.
• One sample t-test used when population standard deviation not known or sample size is small.
• H0: μ = x̄ (there is no difference in sample and population mean)
• H1: μ > x̄ (there is a difference in sample and population mean)

o x̄ : sample mean
o μ : population mean
o S : sample standard deviation
o N : Number of observations

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – T- T E S T
Example: your company wants to improve sales. Past sales data indicate that the average sale was $100 per
transaction. After training your sales force, recent sales data (taken from a sample of 25 salesmen) indicates an
average sale of $130, with a standard deviation of $15. Did the training work? Test your hypothesis at a
95% confidence lelve.
sample mean(x̄) $130.
population mean(μ) $100 (from past data).
H0: μ = x̄
sample standard deviation(s) = $15.
H1: μ > x̄
Number of observations(n) = 25.
calculated t-value = (130 – 100) / ((15 / √(25)) = t = (30 / 3) = 10
degrees of freedom: 25 – 1 = 24.
Alpha = 0.05
Critical t-value = 1.711

calculated t-value > Critical t-value → reject null hypothesis (it’s highly likely that sample mean of sale is
greater → sales training was probably a success)

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – T- T E S T
Example: A company wants to test the claim that their batteries last more than 40 hours. Using a simple
random sample of 15 batteries yielded a mean of 44.9 hours, with a standard deviation of 8.9 hours. Test this
claim using a significance level of 0.05..

H0: μ = 40
H1: μ > 40

calculated t-value > Critical t-value → reject null hypothesis

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – T- T E S T
Two-Sample T – test

• Compares the means of two sample data (means).


o Test the difference (d0) between two sample means.
o To determine whether the means are equal.
o Example; Compare the mean scores of two section (sample) of class (population).

• H0: μ1 = μ2 (there is no difference in sample means)


• H1: μ1 ≠ μ2 (there is a difference in sample means)

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


DATA ANALYSIS – T-TEST
Two-Sample T – test
Assuming unequal variances in two sample;

o ̄x1 ̄x2 : sample means


o S1 s2 : sample variances
o n1 n2 : number of observations in the two sample

Assuming equal variances in two sample;

o x̄ 1 ̄x2 : sample means


o Sp : pooled sample standard deviation
o n1 n2 : number of observations in the two sample
o S1 s2 : sample variances

• degrees of freedom; df = n1 + n2 − 2

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


DATA ANALYSIS – T-TEST
Example: Average body fat percentages measures a person’s fitness and it vary by age. Some study tells, normal
range for men is 15-20% body fat, and the normal range for women is 20-25%. Sample data collected from a
group of men and women.
There are some overlap in data and also some differences. Just by looking at the data, it's hard to draw any solid
conclusions about whether the underlying populations of men and women have the same mean body fat.
Check it statistically.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – T- T E S T

• Two histograms are on same scale.


• There are no very unusual points (outliers).
• data look roughly bell-shaped (normal
distribution seems reasonable).
• Examining summary statistics, standard
deviations looks similar → supports the idea
of equal variances.
o THIS can also be checked using test for
variances.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – T- T E S T
Assuming equal variances in two sample

H0: μ1 = μ2 H1: μ1 ≠ μ2

degrees of freedom: df=n1+n2−2=10+13−2=21 Alpha = 0.05


Critical t-value = 2.080

calculated t-value > Critical t-value → reject null hypothesis → mean body fat for men and women are NOT equal.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


D ATA A N A LY S I S – T- T E S T
Example: One data set contains miles per gallon for U.S. cars (sample 1) and for Japanese cars (sample 2); the
summary statistics for each sample are shown below.
Apply t-test to conclude whether fuel consumptions in both countries are identical (at alpha = 0.05).

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT


DATA ANALYSIS – T-TEST
Hypothesis to test that the means are equal for two samples.
We assume that the variances for the two samples are equal.

H0: μ1 = μ2
Ha: μ1 ≠ μ2

Test statistic: T = -12.62059


Pooled standard deviation: sp = 6.34260
Degrees of freedom: ν = 326
Significance level: α = 0.05
Critical value = 1.9673
Critical region: Reject H0 if |T| > 1.9673

absolute value of test statistic (12.62059) > critical value (1.9673) → reject null hypothesis
conclude that two sample means are different at 0.05 significance level.

SSS Shameem Mar, 2022 Data Analytics, DSCA, MIT

You might also like