STATS 201
STATS 201
2. Frequency Distribution
● Definition: A table or graph that shows the number of times each value or group
of values occurs in a dataset.
● Types:
○ Ungrouped: Lists each distinct value and its frequency.
○ Grouped: Organizes data into class intervals and shows the frequency within
each interval. Includes class limits, class boundaries, class width, and class
midpoint.
○ Relative Frequency: Proportion of observations in each category/interval
(frequency / total number of observations).
○ Cumulative Frequency: Total number of observations up to and including a
specific category/interval.
3. Graphical Representation
● Purpose: To visually summarize and present data, making patterns and trends
easier to understand.
● Types:
○ Categorical Data:
■ Bar Chart: Compares frequencies of different categories (bars don't
touch).
■ Pie Chart: Shows proportions of different categories as slices of a circle.
○ Numerical Data:
■ Histogram: Represents the frequency distribution of continuous data (bars
touch).
■ Frequency Polygon: Line graph connecting the midpoints of the tops of
histogram bars.
■ Ogive (Cumulative Frequency Curve): Line graph showing cumulative
frequencies.
■ Scatter Plot: Shows the relationship between two quantitative variables.
■ Box Plot (Box and Whisker Plot): Displays the distribution of data based on
quartiles, median, and outliers.
5. Binomial Distribution
● Definition: A discrete probability distribution that describes the probability of
obtaining a certain number of successes in a fixed number of independent
Bernoulli trials (experiments with only two possible outcomes: success or failure).1
● Conditions (BINS):
○ Binary outcome (success or failure).
○ Independent trials.
○ Number of trials is fixed (n).
○ Same probability of success (p) for each trial.
● Probability Mass Function: P(X=k)=(kn)pk(1−p)n−k, where:
○ n = number of trials
○ k = number of successes
○ p = probability of success in a single trial
○ (kn)=k!(n−k)!n!(binomial coefficient)
● Mean: μ=np
● Variance: σ2=np(1−p)
6. Hypothesis Testing
● Definition: A formal procedure used to determine whether there is enough
statistical evidence to reject a null hypothesis in favor of an alternative
hypothesis.2
● Hypotheses:
○ Null Hypothesis (H0): A statement of no effect or no difference (the status
quo).
○ Alternative Hypothesis (H1or Ha): A statement that contradicts the null
hypothesis (what the researcher wants to find evidence for). Can be
one-tailed (directional) or two-tailed (non-directional).
● Types of Errors:
○ Type I Error (False Positive, α): Rejecting a true null hypothesis. The
probability of making a Type I error is the significance level (α).
○ Type II Error (False Negative, β): Failing to reject a false null hypothesis. The
power of a test is 1−β, the probability of correctly rejecting a false null
hypothesis.
● Significance Level (α): The probability of rejecting the null hypothesis when it is
true (commonly 0.05).
● P-value: The probability of obtaining test results at least as extreme as the
observed results, assuming the null hypothesis is true.3 If the p-value is less than4
α, we reject H0.
● Parametric Tests: Statistical tests that assume the data follows a specific
distribution (usually normal) and make assumptions about population parameters.
Examples: t-tests, ANOVA, Pearson correlation.
● Non-Parametric Tests: Statistical tests that do not rely on specific distributional
assumptions. Used when data is not normally distributed or is ordinal/nominal.
Examples: Chi-square tests, Mann-Whitney U test, Kruskal-Wallis test, Spearman
correlation.
7. Correlation
● Definition: A statistical measure that describes the extent to which two or more
variables fluctuate together. It indicates the strength and direction of a linear
relationship.
● Types:
○ Positive Correlation: Both variables increase or decrease together (e.g.,
height and weight).
○ Negative Correlation: As one variable increases, the other decreases (e.g.,
study time and exam anxiety).
○ No Correlation: No linear relationship between the variables.
● Measures:
○ Pearson Correlation Coefficient (r): Measures the strength and direction of
a linear relationship between two continuous variables.5 Ranges from -1 to +1.
■ r=+1: Perfect positive correlation
■ r=−1: Perfect negative correlation
■ r=0: No linear correlation
○ Spearman's Rank Correlation Coefficient (ρ): Measures the strength and
direction of a monotonic relationship (not necessarily linear) between two
ordinal or continuous variables after ranking them.
● Advantages: Helps identify relationships between variables, useful for prediction
(in conjunction with regression).
● Importance: Provides insights into how variables are associated, guides further
research.
● Limitations: Correlation does not imply causation! Can be affected by outliers.
Only measures linear (Pearson) or monotonic (Spearman) relationships.
8. Regression
● Definition: A statistical method used to model the relationship between a
dependent variable (outcome) and one or more independent variables
(predictors).6 It aims to predict the value of the dependent variable based on the
values of the independent variables.7
● Types:
○ Simple Linear Regression: One independent variable predicts a dependent
variable. The model is a straight line: Y=a+bX, where:
■ Y = dependent variable
■ X = independent variable
■ a = y-intercept (value of Y when X=0)
■ b = slope (change in Y for a one-unit change in X)
○ Multiple Linear Regression: Two or more independent variables predict a
dependent variable.
● Purpose: Prediction, explanation of relationships between variables.
● Assumptions of Linear Regression: Linearity, independence of errors,
homoscedasticity (constant variance of errors), normality of errors.
● R-squared (R2): Coefficient of determination, represents the proportion of the
variance in the dependent variable that is predictable from the independent8
variable(s). Ranges from 0 to 1.