0% found this document useful (0 votes)
13 views29 pages

Understanding Statistical Analysis_ Techniques and Applications

The document provides an overview of statistical analysis, detailing its importance, types, and processes involved in analyzing data to identify patterns and trends. It covers various statistical methods such as descriptive, inferential, predictive, and hypothesis testing, explaining their applications in decision-making and forecasting. Additionally, it outlines the steps of statistical analysis, including data collection, organization, presentation, analysis, and interpretation.

Uploaded by

Unor Job
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views29 pages

Understanding Statistical Analysis_ Techniques and Applications

The document provides an overview of statistical analysis, detailing its importance, types, and processes involved in analyzing data to identify patterns and trends. It covers various statistical methods such as descriptive, inferential, predictive, and hypothesis testing, explaining their applications in decision-making and forecasting. Additionally, it outlines the steps of statistical analysis, including data collection, organization, presentation, analysis, and interpretation.

Uploaded by

Unor Job
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1

Understanding
Statistical
Analysis:

SE
Techniques

EA
And
H
Applications
IT
W
RN
A
LE

Prepared by
Jahbuikem Anderson
2

Statistical analysis is the process of collecting and analyzing data in order to


discern patterns and trends. It is a method for removing bias from evaluating
data by employing numerical analysis. This technique is useful for collecting the
interpretations of research, developing statistical models, and planning surveys
and studies.

SE
Statistical analysis is a scientific tool in AI and ML that helps collect and analyze
large amounts of data to identify common patterns and trends to convert them
into meaningful information. In simple words, statistical analysis is a data

EA
analysis tool that helps draw meaningful conclusions from raw and unstructured
data.

H
The conclusions are drawn using statistical analysis facilitating decision-making
IT
and helping businesses make future predictions on the basis of past trends. It
can be defined as a science of collecting and analyzing data to identify trends
W

and patterns and presenting them. Statistical analysis involves working with
numbers and is used by businesses and other institutions to make use of data to
derive meaningful information.
RN
A
LE
3

Types of Statistical Analysis


Given below are the 6 types of statistical analysis:

● Descriptive Analysis

Descriptive statistical analysis involves collecting, interpreting, analyzing, and

SE
summarizing data to present them in the form of charts, graphs, and tables.
Rather than drawing conclusions, it simply makes the complex data easy to read

EA
and understand.

● Inferential Analysis

H
The inferential statistical analysis focuses on drawing meaningful conclusions
IT
on the basis of the data analyzed. It studies the relationship between different
variables or makes predictions for the whole population.
W

● Predictive Analysis
RN

Predictive statistical analysis is a type of statistical analysis that analyzes data to


derive past trends and predict future events on the basis of them. It uses
A

machine learning algorithms, data mining, data modelling, and artificial


intelligence to conduct the statistical analysis of data.
LE

● Prescriptive Analysis

The prescriptive analysis conducts the analysis of data and prescribes the best
course of action based on the results. It is a type of statistical analysis that helps
you make an informed decision.
4

● Exploratory Data Analysis

Exploratory analysis is similar to inferential analysis, but the difference is that it


involves exploring the unknown data associations. It analyzes the potential
relationships within the data.

SE
● Causal Analysis

The causal statistical analysis focuses on determining the cause and effect

EA
relationship between different variables within the raw data. In simple words, it
determines why something happens and its effect on other variables. This
methodology can be used by businesses to determine the reason for failure.

H
IT
Importance of Statistical Analysis
W

Statistical analysis eliminates unnecessary information and catalogs important


RN

data in an uncomplicated manner, making the monumental work of organizing


inputs appear so serene. Once the data has been collected, statistical analysis
may be utilized for a variety of purposes. Some of them are listed below:
A
LE

● The statistical analysis aids in summarizing enormous amounts of data


into clearly digestible chunks.
● The statistical analysis aids in the effective design of laboratory, field,
and survey investigations.
● Statistical analysis may help with solid and efficient planning in any
subject of study.
5

● Statistical analysis aid in establishing broad generalizations and


forecasting how much of something will occur under particular
conditions.
● Statistical methods, which are effective tools for interpreting numerical
data, are applied in practically every field of study. Statistical
approaches have been created and are increasingly applied in physical

SE
and biological sciences, such as genetics.
● Statistical approaches are used in the job of a businessman, a
manufacturer, and a researcher. Statistics departments can be found in

EA
banks, insurance businesses, and government agencies.
● A modern administrator, whether in the public or commercial sector,
relies on statistical data to make correct decisions.
H
● Politicians can utilize statistics to support and validate their claims
IT
while also explaining the issues they address.
W
RN
A
LE
6

Statistical Analysis Process

There are five major steps involved in the statistical analysis process:

1. Data collection

The first step in statistical analysis is data collection. You can collect data through

SE
primary or secondary sources such as surveys, customer relationship
management software, online quizzes, financial reports and marketing

EA
automation tools. To ensure the data is viable, you can choose data from a
sample that's representative of a population. For example, a company might
collect data from previous customers to understand buyer behaviors.

2. Data organization
H
IT
The next step after data collection is data organization. Also known as data
cleaning, this stage involves identifying and removing duplicate data and
W

inconsistencies that may prevent you from getting an accurate analysis. This step
is important because it can help companies ensure their data and the
RN

conclusions they draw from the analysis are correct.

3. Data presentation
A

Data presentation is an extension of data cleaning, as it involves arranging the


data for easy analysis. Here, you can use descriptive statistics tools to
LE

summarize the data. Data presentation can also help you determine the best way
to present the data based on its arrangement.

4. Data analysis

Data analysis involves manipulating data sets to identify patterns, trends and
relationships using statistical techniques, such as inferential and associational
7

statistical analysis. You can use computer software like spreadsheets to


automate this process and reduce the likelihood of human error in the statistical
analysis process. This can allow you to analyze data efficiently.

5. Data interpretation

The last step is data interpretation, which provides conclusive results regarding

SE
the purpose of the analysis. After analysis, you can present the result as charts,
reports, scorecards and dashboards to make it accessible to nonprofessionals.
For example, the interpretation of the analysis of the impact of a 6,000-worker

EA
factory on crime rate in a small town with a population of 13,000 residents can
show a declining rate of criminal activities. You may use a line graph to display
this decline.

H
4 Common statistical analysis methods
IT
Here are four common methods for performing statistical analysis:
W

Mean

You can calculate the mean, or average, by finding the sum of a list of numbers
RN

and then dividing the answer by the number of items in the list. It is the simplest
form of statistical analysis, allowing the user to determine the central point of a
data set. The formula for calculating mean is:
A

Mean = Set of numbers / Number of items in the set


LE

Example: You can find the mean of the numbers 1, 2, 3, 4, 5 and 6 by first
adding the numbers together, then dividing the answer from the first step by the
number of figures in the list, which is six. The mean of the numbers is 3.5.
8

Standard deviation

Standard deviation (SD) is used to determine the dispersion of data points. It is a


statistical analysis method that helps determine how the data spreads around the
mean. A high standard deviation means the data disperses widely from the
mean. A low standard deviation shows that most of the data are closer to the
mean.

SE
An application of SD is to test whether participants in a survey gave similar
questions. If a large percentage of respondents' answers are similar, it means

EA
you have a low standard deviation and you can apply their responses to a larger
population. To calculate standard deviation, use this formula:
σ2 = Σ(x − μ)2/n
● σ represents standard deviation
H
● Σ represents the sum of the data
IT
● x represents the value of the dataset
● μ represents the mean of the data
W

● n represents the number of data points in the population

Example: You can calculate the standard deviation of the data set used in the
RN

mean calculation. The first step is to find the variance of the data set. To find
variance, subtract each value in the data set from the mean, square the answer,
A

add everything together and divide by the number of data points.


Variance = ((3.5-1)² + (3.5-2) ² + (3.5-3) ² + (3.5-4) ² + (3.5-5) ² + (3.5-6) ²) / 6
LE

Variance = (6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25) / 6


Variance = 17.25/6 = 2.875
Next, you can calculate the square root of the variance to find the standard
deviation of the data.
Standard deviation = √2.875 = 1.695
9

Regression

Regression is a statistical technique used to find a relationship between a


dependent variable and an independent variable. It helps track how changes in
one variable affect changes in another or the effect of one on the other.
Regression can show whether the relationship between two variables is weak,
strong or varies over a time interval. The regression formula is:

SE
Y = a + b(x)
● Y represents the independent variable, or the data used to predict the

EA
dependent variable
● x represents the dependent variable which is the variable you want to
measure
● a represents the y-intercept or the value of y when x equals zero
H
● b represents the slope of the regression graph
IT
Example: Find the dollar cost of maintaining a car driven for 40,000 miles if the
W

cost of maintenance when there is no mileage on the car is $100. Take b as 0.02,
so the cost of maintenance increases by $0.02 for every unit increase in miles
driven.
RN

● Y = cost of maintaining the car


● X = 40,000 miles
A

● a = $100
● b = $0.02
LE

Y = $100 + 0.02(40,000)
Y = $900
This shows that mileage affects the maintenance costs of a car.
10

Hypothesis testing

Hypothesis testing is used to test if a conclusion is valid for a specific data set by
comparing the data against a certain assumption. The result of the test can nullify
the hypothesis, where it is called the null hypothesis or hypothesis 0. Anything
that violates the null hypothesis is called the first hypothesis or hypothesis 1.

SE
Example: From the regression calculation above, you want to test the hypothesis
that mileage affects the maintenance costs of a car. To test the hypothesis, you

EA
claim mileage affects the maintenance costs of a car. Here, we reject the null
hypothesis since the regression above shows that mileage influences car
maintenance costs.

H
IT
W
RN
A
LE
11

UNDERSTANDING HYPOTHESIS TESTING


Hypothesis testing involves formulating assumptions about population

parameters based on sample statistics and rigorously evaluating these

assumptions against empirical evidence. This article sheds light on the

significance of hypothesis testing and the critical steps involved in the

SE
process.
Hypothesis testing is a statistical method that is used to make a statistical

EA
decision using experimental data. Hypothesis testing is basically an assumption

that we make about a population parameter. It evaluates two mutually exclusive

statements about a population to determine which statement is best supported

by the sample data. H


IT
Defining Hypothesis
W

● Null hypothesis (H0): In statistics, the null hypothesis is a general

statement or default position that there is no relationship between two

measured cases or no relationship among groups. In other words, it is a


RN

basic assumption or made based on the problem knowledge. Example: A

company’s mean production is 50 units/per da H0: μ = 50μ


A

● Alternative hypothesis (H1): The alternative hypothesis is the hypothesis


LE

used in hypothesis testing that is contrary to the null hypothesis.

Example: A company’s production is not equal to 50 units/per day i.e. H1:

μ ≠ 50μ
12

Key terms of Hypothesis testing


● Level of significance: It refers to the degree of significance in which we
accept or reject the null hypothesis. 100% accuracy is not possible for
accepting a hypothesis, so we, therefore, select a level of significance that
is usually 5% This is normally denoted with α and generally, it is 0.05 or
5%, which means your output should be 95% confident to give a similar

SE
kind of result in each sample.

EA
● P-value: The P-value or calculated probability, is the probability of finding
the observed/extreme results when the null hypothesis (H0) of a
study-given problem is true. If your P-value is less than the chosen
significance level then you reject the null hypothesis i.e. accept that your
H
sample claims to the alternative hypothesis.
IT
● Test Statistic: the test statistics is a numerical value calculated from
W

sample data during a hypothesis test, used to determine whether to reject


the null hypothesis. It is compared to a critical value or p-value to make
decisions about the statistical significance of the observed results.
RN

● Critical value: The critical value is statistics is a threshold or cutoff point


A

used to determine whether to reject or to accept the null hypothesis in a


hypothesis testing.
LE

● Degrees of freedom: Degrees of freedom are associated with the


variability or freedom one has in estimating a parameter. The degrees of
freedom are related to the sample size and determine the shape.
13

Why use Hypothesis Testing


Hypothesis testing is an important procedure in statistics. Hypothesis testing
evaluates two mutually exclusive population statements to determine which
statement is most supported by sample data. Hypothesis testing helps us to
determine if a finding is statistically significant

SE
One-Tailed and two-Tailed Test
One tailed test focuses on one direction, either greater than > or less than < a

EA
specified value. We use one-tailed test when there is a clear directional
expectation based on prior knowledge or theory. The critical region is located on
only one side of the distribution curve. If the sample falls into this critical region,
the null hypothesis is rejected in favor of the alternative hypothesis.
H
IT
One Tailed test

There are two types of one-tailed test:


W

Left-Tailed (Left-Sided) Test: The alternative hypothesis asserts that the

true parameter value is less than the null hypothesis. Example: H0​:μ≥50 μ≥50
RN

and H1: μ<50 μ<50

Right-Tailed (Right-Sided) Test: The alternative hypothesis asserts that the


A

true parameter value is greater than the null hypothesis. Example: H0 : μ≤50
LE

μ≤50 and H1:μ>50 μ>50

Two Tailed test

A two-tailed test considers both directions, greater than and less than a

specified value.We use a two-tailed test when there is no specific directional

expectation, and want to detect any significant difference.


14

Example: H0: μ= 50 and H1: μ≠50

Type 1 and Type 2 errors in Hypothesis Testing


These errors are associated with the decisions made regarding the null
hypothesis and the alternative hypothesis.
Type I error: When we reject the null hypothesis, although that hypothesis
was true. Type I error is denoted by alpha(α).

SE
Type II errors: When we accept the null hypothesis, but it is false. Type II
errors are denoted by beta(β).

EA
HNull
Hypothesis is
Null
Hypothesis is
IT
True False
W
RN

Null Hypothesis Type II Error


Correct Decision
is True (Accept) (False Negative)
A
LE

Alternative
Type I Error (False
Hypothesis is Correct Decision
Positive)
True (Reject)
15

How Hypothesis Testing work


Step 1: Define Null and Alternative Hypothesis;

State the null hypothesis (H0), representing no effect, and the alternative
hypothesis (H1​), suggesting an effect or difference.

Step 2: Choose significance level;

SE
Select a significance level (α), typically 0.05, to determine the threshold for
rejecting the null hypothesis. It provides validity to our hypothesis test,

EA
ensuring that we have sufficient data to back up our claims.

Step 3: Collect and Analyze data;

H
Step 4: Calculate Test Statistic

There are various hypothesis tests, each appropriate for various goal to
IT
calculate our test. This could be a Z-test, Chi-square, T-test, and so on.
W

1. Z-test: If population means and standard deviations are known.

Z-statistics is commonly used.


RN

2. T-test: if population standard deviations are unknown and the sample

size is small then t-test statistic is more appropriate.


A

3. Chi-square test: Chi-square test is used for categorical data or for

testing independence in contingency tables.


LE

4. F-test: F-test is often used in analysis of variance (ANOVA) to compare

variance or test the equality of means across multiple groups.

Step 5: Comparing Test Statistic;

Method 1: Using Critical values

Comparing the test statistics and tabulated critical value we have,


16

● If Test Statistics>Critical Value: Reject the null hypothesis.

● If Test Statistics<Critical Value: Do not reject the null hypothesis.

Method 2: Using P-values

● If the p-value is less than or equal to < the significance level i.e (p<α), you

reject the null hypothesis. This indicates that the observed results are

unlikely to have occurred by chance alone, providing evidence in favor of

SE
the alternative hypothesis.

● If the p-value is greater than the significance level i.e (p>α), you do not

EA
reject the null hypothesis. This suggests that the observed results are

consistent with what would be expected under the null hypothesis.

H
Step 6: Interpret the Results
IT
We can conclude/interpret our result using either of the methods above.
W
RN
A
LE
17

Real life Examples of Hypothesis Testing


Let’s examine hypothesis testing using two real life situations

Case A: Does a New Drug Affect Blood Pressure?

Imagine a pharmaceutical company has developed a new drug that they believe can

SE
effectively lower blood pressure in patients with hypertension. Before bringing the

drug to market, they need to conduct a study to assess its impact on blood

pressure.

EA
Data:

H
Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119

After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114
IT
Solution:
W

Step 1: Define the Hypothesis

Null Hypothesis: (H0)The new drug has no effect on blood pressure.

Alternate Hypothesis: (H1)The new drug has an effect on blood pressure.


RN

Step 2: Define the Significance level


A

Let’s consider the Significance level at 0.05, indicating rejection of the null
LE

hypothesis.

If the evidence suggests less than a 5% chance of observing the results due to

random variation.

Step 3: Compute the test statistic

Using paired T-test analyze the data to obtain a test statistic and a p-value.
18

The test statistic (e.g., T-statistic) is calculated based on the differences between

blood pressure measurements before and after treatment.

t = m/(s/√n)

where:

m = mean of the difference i.e Xafter, Xbefore

s = standard deviation of the difference (d) i.e di​=Xafter,i​−Xbefore,

SE
n = sample size,

then, m= -3.9, s= 1.8 and n= 10

EA
we, calculate the , T-statistic = -9 based on the formula for paired t test

Step 4: Find the p-value H


IT
The calculated t-statistic is -9 and degrees of freedom df = 9, you can find the

p-value using statistical software or a t-distribution table.


W

Thus, p-value = 8.538051223166285e-06


RN

Step 5: Result

If the p-value is less than or equal to 0.05, the researchers reject the null
A

hypothesis.

If the p-value is greater than 0.05, they fail to reject the null hypothesis.
LE

Conclusion: Since the p-value (8.538051223166285e-06) is less than the

significance level (0.05), the researchers reject the null hypothesis. There is

statistically significant evidence that the average blood pressure before and after

treatment with the new drug is different.


19

Case B: Cholesterol level in a population

Data: A sample of 25 individuals is taken, and their cholesterol levels are measured.

Cholesterol Levels (mg/dL): 205, 198, 210, 190, 215, 205, 200, 192, 198, 205, 198,

202, 208, 200, 205, 198, 205, 210, 192, 205, 198, 205, 210, 192, 205.

SE
Populations Mean = 200

EA
Population Standard Deviation (σ): 5 mg/dL(given for this problem)

Step 1: Define the Hypothesis


H
Null Hypothesis (H0): The average cholesterol level in a population is 200 mg/dL.
IT
Alternate Hypothesis (H1): The average cholesterol level in a population is different

from 200 mg/dL.


W

Step 2: Define the Significance level


RN

As the direction of deviation is not given , we assume a two-tailed test, and based

on a normal distribution table, the critical values for a significance level of 0.05
A

(two-tailed) can be calculated through the z-table and are approximately -1.96 and

1.96.
LE

Step 3: Compute the test statistic

The test statistic is calculated by using the z formula Z=(203.8 – 200)/(5÷25)

(203.8–200)/(5÷25 )​and we get accordingly , Z=2.039999999999992.


20

Step 4: Result

Since the absolute value of the test statistic (2.04) is greater than the critical value

(1.96), we reject the null hypothesis. And conclude that, there is statistically

significant evidence that the average cholesterol level in the population is different

from 200 mg/dL

SE
Limitations of Hypothesis Testing
Although a useful technique, hypothesis testing does not offer a comprehensive

EA
grasp of the topic being studied. Without fully reflecting the intricacy or whole

context of the phenomena, it concentrates on certain hypotheses and statistical

significance.
H
The accuracy of hypothesis testing results is contingent on the quality of available
IT
data and the appropriateness of statistical methods used. Inaccurate data or poorly
W

formulated hypotheses can lead to incorrect conclusions.

Relying solely on hypothesis testing may cause analysts to overlook significant

patterns or relationships in the data that are not captured by the specific
RN

hypotheses being tested. This limitation underscores the importance of

complimenting hypothesis testing with other analytical approaches.


A

In Conclusion…
LE

Hypothesis testing stands as a cornerstone in statistical analysis, enabling data

scientists to navigate uncertainties and draw credible inferences from sample data.

By systematically defining null and alternative hypotheses, choosing significance

levels, and leveraging statistical tests, researchers can assess the validity of their

assumptions. The article also elucidates the critical distinction between Type I and

Type II errors, providing a comprehensive understanding of the nuanced


21

decision-making process inherent in hypothesis testing. The real-life example of

testing a new drug’s effect on blood pressure using a paired T-test showcases the

practical application of these principles, underscoring the importance of statistical

rigor in data-driven decision-making.

SE
EA
H
IT
W
RN
A
LE
22

CHI-SQUARE TEST
The chi-square test is a statistical test used to determine if there is a significant
association between two categorical variables. It is a non-parametric test,
meaning it makes no assumptions about the distribution of the data. The test is
based on the comparison of observed and expected frequencies within a

SE
contingency table. The chi-square test helps with feature selection problems by
looking at the relationship between the elements. It determines if the association

EA
between two categorical variables of the sample would reflect their real
association in the population. It belongs to the family of continuous probability
distributions.

H
The chi-square distribution is a continuous probability distribution that arises in
IT
statistics and is associated with the sum of the squares of independent standard
W

normal random variables. It is often denoted as \chi^2 and is parameterized by


the degrees of freedom k.
RN

It is widely used in statistical analysis, particularly in hypothesis testing and


calculating confidence intervals. It is often used with non-normally distributed
data.
A
LE

Key terms used in Chi-Square test


● Degrees of freedom
● Observed values: Actual data collected
● Expected values: Predicted data based on a theoretical model in
chi-square test.
where,
● Ri : Totals of row i
23

● Cj : Totals of column j
● N: Total number of Observations

Contingency table: A contingency table, also known as a cross-tabulation or


two-way table, is a statistical table that displays the distribution of two categorical
variables.

SE
Types of Chi-Square test

EA
There are several types of chi-square tests, each designed to address specific
research questions or scenarios. The two main types are the chi-square test for
independence and the chi-square goodness-of-fit test.

H
● Chi-Square Test for Independence: This test assesses whether there is a
IT
significant association or relationship between two categorical variables. It
is used to determine whether changes in one variable are independent of
W

changes in another. This test is applied when we have counts of values for
two nominal or categorical variables. To conduct this test, two requirements
RN

must be met. independence of observations and a relatively large sample


size.
For example, suppose we are interested in exploring whether there is a
A

relationship between online shopping preferences and the payment


methods people choose. The first variable is the type of online shopping
LE

preference (e.g., Electronics, Clothing, Books), and the second variable is


the chosen payment method (e.g., Credit Card, Debit Card, PayPal).
The null hypothesis in this case would be that the choice of online
shopping preference and the selected payment method are independent.
● Chi-Square Goodness-of-Fit Test: The Chi-Square Goodness-of-Fit test is
used in statistical hypothesis testing to ascertain whether a variable is
24

likely from a given distribution or not. This test can be applied in situations
when we have value counts for categorical variables. With the help of this
test, we can determine whether the data values are a representative
sample of the entire population or if they fit our hypothesis well.
For example, imagine you are testing the fairness of a six-sided die. The
null hypothesis is that each face of the die should have an equal probability
of landing face up. In other words, the die is unbiased, and the proportions

SE
of each number (1 through 6) occurring are expected to be equal.

EA
Why we use the Chi-Square Test
● The chi-square test is widely used across diverse fields to analyze

H
categorical data, offering valuable insights into associations or differences
between categories.
IT
● Its primary application lies in testing the independence of two categorical
W

variables, determining if changes in one variable relate to changes in


another.
● It is particularly useful for understanding relationships between factors,
RN

such as gender and preferences or product categories and purchasing


behaviors.
● Researchers appreciate its simplicity and ease of application to categorical
A

data, making it a preferred choice for statistical analysis.


LE

● The test provides insights into patterns and associations within categorical
data, aiding in the interpretation of relationships.
● Its utility extends to various fields, including genetics, market research,
quality control, and social sciences, showcasing its broad applicability.
● The chi-square test helps assess the conformity of observed data to
expected values, enhancing its role in statistical analysis.
25

Steps to perform Chi-square test


1. Define
● Null Hypothesis (H0): There is no significant association between the two
categorical variables.
● Alternative Hypothesis (H1): There is a significant association between the

SE
two categorical variables.
2. Create a contingency table that displays the frequency distribution of the
two categorical variables.

EA
3. Find the Expected values
4. Calculate the Chi-Square Statistic
5. Degrees of Freedom

H
6. Accept or Reject the Null Hypothesis: Compare the calculated chi-square
IT
statistic to the critical value from the chi-square distribution table for the
chosen significance level (e.g., 0.05)
W

To Conclude…
RN

The Chi-Square test stands as a versatile tool for exploring categorical data
associations, offering valuable insights into dependencies between variables.
Whether applied for independence or goodness-of-fit, its significance resonates
A

across genetics, market research, and social sciences. Feature selection using
Chi-Square enhances model efficiency, exemplified by the Python
LE

implementation on Iris dataset features.


26

T-Test
The t-test is named after William Sealy Gosset’s Student’s t-distribution, created
while he was writing under the pen name “Student.”

A t-test is a type of inferential statistic test used to determine if there is a


significant difference between the means of two groups. It is often used when

SE
data is normally distributed and population variance is unknown.

EA
The t-test is used in hypothesis testing to assess whether the observed
difference between the means of the two groups is statistically significant or just
due to random variation.

H
Assumptions in T-test
IT
● Independence: The observations within each group must be independent
W

of each other. This means that the value of one observation should not
influence the value of another observation. Violations of independence can
occur with repeated measures, paired data, or clustered data.
RN

● Normality: The data within each group should be approximately normally


distributed i.e the distribution of the data within each group being
compared should resemble a normal (bell-shaped) distribution. This
A

assumption is crucial for small sample sizes (n < 30).


LE

● Homogeneity of Variances (for independent samples t-test): The


variances of the two groups being compared should be equal. This
assumption ensures that the groups have a similar spread of values.
Unequal variances can affect the standard error of the difference between
means and, consequently, the t-statistic.
27

● Absence of Outliers: There should be no extreme outliers in the data as


outliers can disproportionately influence the results, especially when
sample sizes are small.
Types of T-tests
There are three types of t-tests, and they are categorized as dependent and
independent t-tests.

SE
1. One sample t-test test: The mean of a single group against a known mean.
2. Two-sample t-test: It is further divided into two types:
- Independent samples t-test: compares the means for two groups.

EA
- Paired sample t-test: compares means from the same group at
different times (say, one year apart).

One sample T-test H


IT
One sample t-test is one of the widely used t-tests for comparison of the sample
mean of the data to a particularly given value. Used for comparing the sample
W

mean to the true/population mean.


We can use this when the sample size is small. (under 30) data is collected
randomly and it is approximately normally distributed. It can be calculated as:
RN

t = t-value
x_bar = sample mean
A

μ = true/population mean
LE

σ = standard deviation
n = sample size
28

Independent sample T-test


An Independent sample t-test, commonly known as an unpaired sample t-test is
used to find out if the differences found between two groups is actually significant
or just a random occurrence.

We can use this when:

SE
➔ the population mean or standard deviation is unknown. (information about
the population is unknown)
➔ the two samples are separate/independent. For eg. boys and girls (the two

EA
are independent of each other)

Paired Two-sample T-test


H
Paired sample t-test, commonly known as dependent sample t-test is used to find
IT
out if the difference in the mean of two samples is 0. The test is done on
dependent samples, usually focusing on a particular group of people or things. In
W

this, each entity is measured twice, resulting in a pair of observations.

We can use this when:


RN

➔ Two similar (twin like) samples are given. [Eg, Scores obtained in English
and Math (both subjects)]
A

➔ The dependent variable (data) is continuous.


➔ The observations are independent of one another.
LE

➔ The dependent variable is approximately normally distributed.


29

To conclude…
T-test, play a crucial role in hypothesis testing, comparing means, and drawing
conclusions about populations. The test can be one-sample, independent
two-sample, or paired two-sample, each with specific use cases and
assumptions. Interpretation of results involves considering T-values, P-values,
and critical values.

SE
These tests aid researchers in making informed decisions based on statistical
evidence.

EA
H
IT
W
RN
A
LE

Prepared by
Jahbuikem Anderson

As researched from GeeksforGeeks, Simplilearn, Statology and indeed

You might also like