Statistical methods Unit III & IV (1)
Statistical methods Unit III & IV (1)
Unit-III
Dispersion is the state of getting dispersed or spread. Statistical dispersion means the extent to
which numerical data is likely to vary about an average value. In other words, dispersion helps
to understand the distribution of the data.
Measures of Dispersion
In statistics, the measures of dispersion help to interpret the variability of data i.e. to
know how much homogenous or heterogeneous the data is. In simple terms, it shows how
squeezed or scattered the variable is.
Explanation:
Variance measures the average squared deviation of data points from the mean.
A higher variance indicates greater dispersion, meaning data points are spread out
from the mean.
A lower variance suggests the data points are closer to the mean.
Characteristics:
Always non-negative since deviations are squared.
Sensitive to outliers due to squaring of deviations.
Measured in squared units of the data (e.g., if data is in meters, variance is in square
meters).
Applications:
Used in risk analysis in finance to assess market volatility.
In quality control for measuring product consistency.
Basis for standard deviation, which is its square root and easier to interpret.
Mean Deviation:
Mean Deviation (MD), also called the Average Deviation, is a measure of dispersion
that represents the average of the absolute deviations of data points from a central value,
typically the mean or median.
Explanation:
It calculates the average absolute deviation from the mean or median.
Unlike variance, it does not square the deviations, making it more straightforward but
less commonly used in advanced statistical analysis.
Characteristics:
Non-negative and expressed in the same unit as the data.
Provides a simpler measure of spread compared to variance and standard deviation.
Less sensitive to outliers than variance but does not handle large deviations effectively.
Applications:
Used in descriptive statistics for understanding data spread.
Helpful in finance and economics for measuring variability in returns.
Useful when a non-squared measure of deviation is preferred.
Standard Deviation (σ or s)
Standard Deviation (SD) is a statistical measure of dispersion that indicates how much
individual data points deviate from the mean of a dataset. It is the square root of
variance and provides a clearer understanding of data spread in the same units as the
data
Explanation:
Standard deviation measures the average spread of data points around the mean.
A higher SD indicates greater variability, while a lower SD suggests data points are
closer to the mean.
Characteristics:
Always non-negative.
Expressed in the same units as the original data.
Sensitive to outliers due to squaring in variance calculation.
Applications:
Used in finance to measure market volatility.
Applied in quality control to assess product consistency.
Central to normal distribution and statistical inference.
Coefficient of Variation (CV)
The Coefficient of Variation (CV) is a statistical measure of relative dispersion that
expresses the standard deviation as a percentage of the mean. It is useful for comparing the
variability of datasets with different units or scales.
Explanation:
CV measures the extent of variability in relation to the mean of the dataset.
A higher CV indicates greater relative variability, while a lower CV suggests less
variability compared to the mean.
Characteristics:
Dimensionless: Expressed as a percentage, making it suitable for comparing datasets
with different units.
Sensitive to the Mean: If the mean is close to zero, CV can become misleadingly high.
Commonly used when the scale of measurement differs across datasets.
Applications:
Finance: Comparing the risk (volatility) of different investments.
Quality Control: Assessing consistency in production processes.
Biology and Medicine: Comparing variability in biological measurements.
The Product Moment Correlation (r) measures the linear relationship between two
continuous variables. It quantifies the degree to which changes in one variable correspond with
changes in another.
Limitations:
Only measures linear relationships.
Sensitive to outliers.
Does not imply causation (correlation ≠ causation).
Pearson's Correlation Coefficient is a powerful tool for measuring the strength and
direction of linear relationships between two continuous variables but must be used carefully
with respect to its assumptions and limitations.
Test of Significance:
Test of Significance is a statistical technique used to determine whether the results of a
sample data set are statistically significant or if they could have occurred by random chance.
These tests help in making inferences about population parameters based on sample data.
The most commonly used tests of significance are the T-test, F-test, Z-test, and ANOVA.
1. T-test (Student’s t-test)
The T-test is a statistical test used to determine whether there is a significant difference
between the means of two groups or between a sample mean and a population mean. It
is especially useful when the sample size is small (less than 30) and the population
variance is unknown. The T-test helps to make inferences about the population from the
sample data, providing a way to test hypotheses and draw conclusions based on sample
data.
Types of T-tests
There are three main types of T-tests, each suited to different research questions and
data structures:
One-Sample T-test:
o Purpose: Compares the mean of a single sample to a known population mean.
o Example: Testing whether the average weight of a sample of apples is different
from a known population average.
Independent Two-Sample T-test:
Purpose: Compares the means of two independent groups to determine if they
are significantly different from each other.
o Example: Comparing the average test scores of two different classes of students.
Paired Sample T-test (Dependent T-test):
o Purpose: Compares the means of two related groups, such as measurements
taken before and after a treatment on the same group of subjects.
o Example: Comparing the blood pressure of patients before and after taking a
particular medication.
Paired Sample T-test: Evaluating the effect of a diet on weight loss before and after
an intervention.
Advantages:
Simple and easy to compute.
Can be used to make inferences about population means based on small samples.
Limitations:
The T-test is an essential statistical tool used to compare means and assess the
significance of differences in various scenarios, especially when sample sizes are small. By
helping researchers and analysts determine whether observed differences are statistically
meaningful, the T-test serves as a fundamental test in hypothesis testing, supporting data-driven
decisions in fields ranging from healthcare and social sciences to business and engineering.
2. F -Test
The F-test is a statistical test used to compare the variances of two or more groups to
determine if they are significantly different from each other. It is commonly used in the
context of Analysis of Variance (ANOVA), but it can also be applied to test hypotheses
about the equality of variances in two independent samples.
The F-test is a crucial statistical test used for comparing variances and testing the
significance of group differences, particularly in the context of ANOVA. It helps researchers
determine whether observed data suggest that the means of different groups are significantly
different or if such differences could have occurred by random chance. Despite its power, the
F-test requires careful attention to the underlying assumptions, especially normality and
homogeneity of variances.
3. Z-Test
The Z-test is a statistical test used to determine if there is a significant difference
between sample data and population data, or between two sample means, when the population
variance is known or the sample size is large (typically n > 30). It is one of the most widely
used tests in hypothesis testing and is often employed in scenarios involving large sample sizes
or when the population standard deviation is known.
Types of Z-tests:
1. One-Sample Z-test:
o Purpose: Used to compare the mean of a sample to the population mean when
the population standard deviation is known.
o Example: Testing if the average height of a group of individuals differs from a
known population average.
2. Two-Sample Z-test:
o Purpose: Used to compare the means of two independent samples when the
population standard deviation is known and the sample size is large.
o Example: Comparing the average income between two different regions.
3. Z-test for Proportions:
o Purpose: Used to compare sample proportions to population proportions, or to
compare proportions from two different samples.
o Example: Testing if the proportion of voters supporting a candidate differs from
a hypothesized proportion.
The Z-test is a powerful tool for hypothesis testing, particularly when dealing with
large sample sizes or when the population standard deviation is known. It is widely used in
various fields, including business, healthcare, and social sciences, to compare sample data to
population data or compare means and proportions between two independent groups. However,
it is essential to ensure that its assumptions are met, particularly regarding the sample size and
the known population standard deviation. When these conditions are not met, alternative tests
like the t-test may be more appropriate.
Purpose of ANOVA:
ANOVA is primarily used to:
1. Compare means of multiple groups: It helps to test whether at least one group mean
differs from the others.
2. Identify factors influencing variation: It helps identify factors that explain variability
in data (e.g., different treatments or conditions).
3. Reduce the risk of Type I errors: It prevents multiple pairwise t-tests from inflating
the risk of Type I errors (false positives).
Types of ANOVA:
1. One-Way ANOVA:
o Purpose: Compares the means of three or more independent groups based on
one factor (independent variable).
o Example: Testing the effect of different diets on weight loss, where the factor
is the type of diet.
2. Two-Way ANOVA:
o Purpose: Compares means across two independent variables (factors) and their
interaction.
o Example: Testing the effect of diet type and exercise on weight loss, where both
factors (diet and exercise) are considered.
3. Repeated Measures ANOVA:
o Purpose: Used when the same subjects are tested multiple times under different
conditions (i.e., within-subjects design).
o Example: Testing the effect of a drug on blood pressure at different time points.
Applications of ANOVA:
ANOVA is a powerful statistical tool for testing differences in means across multiple
groups. It is commonly used in experimental and observational studies to determine whether
treatments or factors have a significant effect. By comparing the variances within and between
groups, ANOVA allows researchers to draw conclusions about the population from the sample
data. However, its assumptions must be checked, and when necessary, modifications like
Welch’s ANOVA can be applied.
Coefficient of Determination (R²)
The coefficient of determination (R²) is a statistical measure that indicates the
proportion of the variance in the dependent variable that is explained by the independent
variables in a regression model. It provides insight into the goodness of fit of a model and helps
to understand how well the regression model represents the data.
Linear Regression
Linear regression is a statistical method for modeling the relationship between a
dependent variable (often called the response or target variable) and one or more independent
variables (predictors or features). It assumes that there is a linear relationship between the
dependent and independent variables.