Understanding Statistical Analysis_ Techniques and Applications
Understanding Statistical Analysis_ Techniques and Applications
Understanding
Statistical
Analysis:
SE
Techniques
EA
And
H
Applications
IT
W
RN
A
LE
Prepared by
Jahbuikem Anderson
2
SE
Statistical analysis is a scientific tool in AI and ML that helps collect and analyze
large amounts of data to identify common patterns and trends to convert them
into meaningful information. In simple words, statistical analysis is a data
EA
analysis tool that helps draw meaningful conclusions from raw and unstructured
data.
H
The conclusions are drawn using statistical analysis facilitating decision-making
IT
and helping businesses make future predictions on the basis of past trends. It
can be defined as a science of collecting and analyzing data to identify trends
W
and patterns and presenting them. Statistical analysis involves working with
numbers and is used by businesses and other institutions to make use of data to
derive meaningful information.
RN
A
LE
3
● Descriptive Analysis
SE
summarizing data to present them in the form of charts, graphs, and tables.
Rather than drawing conclusions, it simply makes the complex data easy to read
EA
and understand.
● Inferential Analysis
H
The inferential statistical analysis focuses on drawing meaningful conclusions
IT
on the basis of the data analyzed. It studies the relationship between different
variables or makes predictions for the whole population.
W
● Predictive Analysis
RN
● Prescriptive Analysis
The prescriptive analysis conducts the analysis of data and prescribes the best
course of action based on the results. It is a type of statistical analysis that helps
you make an informed decision.
4
SE
● Causal Analysis
The causal statistical analysis focuses on determining the cause and effect
EA
relationship between different variables within the raw data. In simple words, it
determines why something happens and its effect on other variables. This
methodology can be used by businesses to determine the reason for failure.
H
IT
Importance of Statistical Analysis
W
SE
and biological sciences, such as genetics.
● Statistical approaches are used in the job of a businessman, a
manufacturer, and a researcher. Statistics departments can be found in
EA
banks, insurance businesses, and government agencies.
● A modern administrator, whether in the public or commercial sector,
relies on statistical data to make correct decisions.
H
● Politicians can utilize statistics to support and validate their claims
IT
while also explaining the issues they address.
W
RN
A
LE
6
There are five major steps involved in the statistical analysis process:
1. Data collection
The first step in statistical analysis is data collection. You can collect data through
SE
primary or secondary sources such as surveys, customer relationship
management software, online quizzes, financial reports and marketing
EA
automation tools. To ensure the data is viable, you can choose data from a
sample that's representative of a population. For example, a company might
collect data from previous customers to understand buyer behaviors.
2. Data organization
H
IT
The next step after data collection is data organization. Also known as data
cleaning, this stage involves identifying and removing duplicate data and
W
inconsistencies that may prevent you from getting an accurate analysis. This step
is important because it can help companies ensure their data and the
RN
3. Data presentation
A
summarize the data. Data presentation can also help you determine the best way
to present the data based on its arrangement.
4. Data analysis
Data analysis involves manipulating data sets to identify patterns, trends and
relationships using statistical techniques, such as inferential and associational
7
5. Data interpretation
The last step is data interpretation, which provides conclusive results regarding
SE
the purpose of the analysis. After analysis, you can present the result as charts,
reports, scorecards and dashboards to make it accessible to nonprofessionals.
For example, the interpretation of the analysis of the impact of a 6,000-worker
EA
factory on crime rate in a small town with a population of 13,000 residents can
show a declining rate of criminal activities. You may use a line graph to display
this decline.
H
4 Common statistical analysis methods
IT
Here are four common methods for performing statistical analysis:
W
Mean
You can calculate the mean, or average, by finding the sum of a list of numbers
RN
and then dividing the answer by the number of items in the list. It is the simplest
form of statistical analysis, allowing the user to determine the central point of a
data set. The formula for calculating mean is:
A
Example: You can find the mean of the numbers 1, 2, 3, 4, 5 and 6 by first
adding the numbers together, then dividing the answer from the first step by the
number of figures in the list, which is six. The mean of the numbers is 3.5.
8
Standard deviation
SE
An application of SD is to test whether participants in a survey gave similar
questions. If a large percentage of respondents' answers are similar, it means
EA
you have a low standard deviation and you can apply their responses to a larger
population. To calculate standard deviation, use this formula:
σ2 = Σ(x − μ)2/n
● σ represents standard deviation
H
● Σ represents the sum of the data
IT
● x represents the value of the dataset
● μ represents the mean of the data
W
Example: You can calculate the standard deviation of the data set used in the
RN
mean calculation. The first step is to find the variance of the data set. To find
variance, subtract each value in the data set from the mean, square the answer,
A
Regression
SE
Y = a + b(x)
● Y represents the independent variable, or the data used to predict the
EA
dependent variable
● x represents the dependent variable which is the variable you want to
measure
● a represents the y-intercept or the value of y when x equals zero
H
● b represents the slope of the regression graph
IT
Example: Find the dollar cost of maintaining a car driven for 40,000 miles if the
W
cost of maintenance when there is no mileage on the car is $100. Take b as 0.02,
so the cost of maintenance increases by $0.02 for every unit increase in miles
driven.
RN
● a = $100
● b = $0.02
LE
Y = $100 + 0.02(40,000)
Y = $900
This shows that mileage affects the maintenance costs of a car.
10
Hypothesis testing
Hypothesis testing is used to test if a conclusion is valid for a specific data set by
comparing the data against a certain assumption. The result of the test can nullify
the hypothesis, where it is called the null hypothesis or hypothesis 0. Anything
that violates the null hypothesis is called the first hypothesis or hypothesis 1.
SE
Example: From the regression calculation above, you want to test the hypothesis
that mileage affects the maintenance costs of a car. To test the hypothesis, you
EA
claim mileage affects the maintenance costs of a car. Here, we reject the null
hypothesis since the regression above shows that mileage influences car
maintenance costs.
H
IT
W
RN
A
LE
11
SE
process.
Hypothesis testing is a statistical method that is used to make a statistical
EA
decision using experimental data. Hypothesis testing is basically an assumption
μ ≠ 50μ
12
SE
kind of result in each sample.
EA
● P-value: The P-value or calculated probability, is the probability of finding
the observed/extreme results when the null hypothesis (H0) of a
study-given problem is true. If your P-value is less than the chosen
significance level then you reject the null hypothesis i.e. accept that your
H
sample claims to the alternative hypothesis.
IT
● Test Statistic: the test statistics is a numerical value calculated from
W
SE
One-Tailed and two-Tailed Test
One tailed test focuses on one direction, either greater than > or less than < a
EA
specified value. We use one-tailed test when there is a clear directional
expectation based on prior knowledge or theory. The critical region is located on
only one side of the distribution curve. If the sample falls into this critical region,
the null hypothesis is rejected in favor of the alternative hypothesis.
H
IT
One Tailed test
true parameter value is less than the null hypothesis. Example: H0:μ≥50 μ≥50
RN
true parameter value is greater than the null hypothesis. Example: H0 : μ≤50
LE
A two-tailed test considers both directions, greater than and less than a
SE
Type II errors: When we accept the null hypothesis, but it is false. Type II
errors are denoted by beta(β).
EA
HNull
Hypothesis is
Null
Hypothesis is
IT
True False
W
RN
Alternative
Type I Error (False
Hypothesis is Correct Decision
Positive)
True (Reject)
15
State the null hypothesis (H0), representing no effect, and the alternative
hypothesis (H1), suggesting an effect or difference.
SE
Select a significance level (α), typically 0.05, to determine the threshold for
rejecting the null hypothesis. It provides validity to our hypothesis test,
EA
ensuring that we have sufficient data to back up our claims.
H
Step 4: Calculate Test Statistic
There are various hypothesis tests, each appropriate for various goal to
IT
calculate our test. This could be a Z-test, Chi-square, T-test, and so on.
W
● If the p-value is less than or equal to < the significance level i.e (p<α), you
reject the null hypothesis. This indicates that the observed results are
SE
the alternative hypothesis.
● If the p-value is greater than the significance level i.e (p>α), you do not
EA
reject the null hypothesis. This suggests that the observed results are
H
Step 6: Interpret the Results
IT
We can conclude/interpret our result using either of the methods above.
W
RN
A
LE
17
Imagine a pharmaceutical company has developed a new drug that they believe can
SE
effectively lower blood pressure in patients with hypertension. Before bringing the
drug to market, they need to conduct a study to assess its impact on blood
pressure.
EA
Data:
H
Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114
IT
Solution:
W
Let’s consider the Significance level at 0.05, indicating rejection of the null
LE
hypothesis.
If the evidence suggests less than a 5% chance of observing the results due to
random variation.
Using paired T-test analyze the data to obtain a test statistic and a p-value.
18
The test statistic (e.g., T-statistic) is calculated based on the differences between
t = m/(s/√n)
where:
SE
n = sample size,
EA
we, calculate the , T-statistic = -9 based on the formula for paired t test
Step 5: Result
If the p-value is less than or equal to 0.05, the researchers reject the null
A
hypothesis.
If the p-value is greater than 0.05, they fail to reject the null hypothesis.
LE
significance level (0.05), the researchers reject the null hypothesis. There is
statistically significant evidence that the average blood pressure before and after
Data: A sample of 25 individuals is taken, and their cholesterol levels are measured.
Cholesterol Levels (mg/dL): 205, 198, 210, 190, 215, 205, 200, 192, 198, 205, 198,
202, 208, 200, 205, 198, 205, 210, 192, 205, 198, 205, 210, 192, 205.
SE
Populations Mean = 200
EA
Population Standard Deviation (σ): 5 mg/dL(given for this problem)
As the direction of deviation is not given , we assume a two-tailed test, and based
on a normal distribution table, the critical values for a significance level of 0.05
A
(two-tailed) can be calculated through the z-table and are approximately -1.96 and
1.96.
LE
Step 4: Result
Since the absolute value of the test statistic (2.04) is greater than the critical value
(1.96), we reject the null hypothesis. And conclude that, there is statistically
significant evidence that the average cholesterol level in the population is different
SE
Limitations of Hypothesis Testing
Although a useful technique, hypothesis testing does not offer a comprehensive
EA
grasp of the topic being studied. Without fully reflecting the intricacy or whole
significance.
H
The accuracy of hypothesis testing results is contingent on the quality of available
IT
data and the appropriateness of statistical methods used. Inaccurate data or poorly
W
patterns or relationships in the data that are not captured by the specific
RN
In Conclusion…
LE
scientists to navigate uncertainties and draw credible inferences from sample data.
levels, and leveraging statistical tests, researchers can assess the validity of their
assumptions. The article also elucidates the critical distinction between Type I and
testing a new drug’s effect on blood pressure using a paired T-test showcases the
SE
EA
H
IT
W
RN
A
LE
22
CHI-SQUARE TEST
The chi-square test is a statistical test used to determine if there is a significant
association between two categorical variables. It is a non-parametric test,
meaning it makes no assumptions about the distribution of the data. The test is
based on the comparison of observed and expected frequencies within a
SE
contingency table. The chi-square test helps with feature selection problems by
looking at the relationship between the elements. It determines if the association
EA
between two categorical variables of the sample would reflect their real
association in the population. It belongs to the family of continuous probability
distributions.
H
The chi-square distribution is a continuous probability distribution that arises in
IT
statistics and is associated with the sum of the squares of independent standard
W
● Cj : Totals of column j
● N: Total number of Observations
SE
Types of Chi-Square test
EA
There are several types of chi-square tests, each designed to address specific
research questions or scenarios. The two main types are the chi-square test for
independence and the chi-square goodness-of-fit test.
H
● Chi-Square Test for Independence: This test assesses whether there is a
IT
significant association or relationship between two categorical variables. It
is used to determine whether changes in one variable are independent of
W
changes in another. This test is applied when we have counts of values for
two nominal or categorical variables. To conduct this test, two requirements
RN
likely from a given distribution or not. This test can be applied in situations
when we have value counts for categorical variables. With the help of this
test, we can determine whether the data values are a representative
sample of the entire population or if they fit our hypothesis well.
For example, imagine you are testing the fairness of a six-sided die. The
null hypothesis is that each face of the die should have an equal probability
of landing face up. In other words, the die is unbiased, and the proportions
SE
of each number (1 through 6) occurring are expected to be equal.
EA
Why we use the Chi-Square Test
● The chi-square test is widely used across diverse fields to analyze
H
categorical data, offering valuable insights into associations or differences
between categories.
IT
● Its primary application lies in testing the independence of two categorical
W
● The test provides insights into patterns and associations within categorical
data, aiding in the interpretation of relationships.
● Its utility extends to various fields, including genetics, market research,
quality control, and social sciences, showcasing its broad applicability.
● The chi-square test helps assess the conformity of observed data to
expected values, enhancing its role in statistical analysis.
25
SE
two categorical variables.
2. Create a contingency table that displays the frequency distribution of the
two categorical variables.
EA
3. Find the Expected values
4. Calculate the Chi-Square Statistic
5. Degrees of Freedom
H
6. Accept or Reject the Null Hypothesis: Compare the calculated chi-square
IT
statistic to the critical value from the chi-square distribution table for the
chosen significance level (e.g., 0.05)
W
To Conclude…
RN
The Chi-Square test stands as a versatile tool for exploring categorical data
associations, offering valuable insights into dependencies between variables.
Whether applied for independence or goodness-of-fit, its significance resonates
A
across genetics, market research, and social sciences. Feature selection using
Chi-Square enhances model efficiency, exemplified by the Python
LE
T-Test
The t-test is named after William Sealy Gosset’s Student’s t-distribution, created
while he was writing under the pen name “Student.”
SE
data is normally distributed and population variance is unknown.
EA
The t-test is used in hypothesis testing to assess whether the observed
difference between the means of the two groups is statistically significant or just
due to random variation.
H
Assumptions in T-test
IT
● Independence: The observations within each group must be independent
W
of each other. This means that the value of one observation should not
influence the value of another observation. Violations of independence can
occur with repeated measures, paired data, or clustered data.
RN
SE
1. One sample t-test test: The mean of a single group against a known mean.
2. Two-sample t-test: It is further divided into two types:
- Independent samples t-test: compares the means for two groups.
EA
- Paired sample t-test: compares means from the same group at
different times (say, one year apart).
t = t-value
x_bar = sample mean
A
μ = true/population mean
LE
σ = standard deviation
n = sample size
28
SE
➔ the population mean or standard deviation is unknown. (information about
the population is unknown)
➔ the two samples are separate/independent. For eg. boys and girls (the two
EA
are independent of each other)
➔ Two similar (twin like) samples are given. [Eg, Scores obtained in English
and Math (both subjects)]
A
To conclude…
T-test, play a crucial role in hypothesis testing, comparing means, and drawing
conclusions about populations. The test can be one-sample, independent
two-sample, or paired two-sample, each with specific use cases and
assumptions. Interpretation of results involves considering T-values, P-values,
and critical values.
SE
These tests aid researchers in making informed decisions based on statistical
evidence.
EA
H
IT
W
RN
A
LE
Prepared by
Jahbuikem Anderson