Chapter-Summary of Bacal
Chapter-Summary of Bacal
• A random variable is a function that associates a real number with each element is the sample space.
It is a variable whose values are determined by chance.
• A random variable is a discrete random variable if its set of possible outcomes is countable. Mostly,
discrete random variables represent count data, such as the number of defective chairs produced in
a factory.
• A random variable is a continuous random variable if it takes on values on a continuous scale. Often,
continuous random variables represent measured data, such as height, weight, and temperature.
• A discrete probability distribution or a probability mass function consists of the values a random
variable can assume and the corresponding probabilities of the values.
• Formula for the Variance and Standard Deviation of a Discrete Probability Distribution
The variance of a discrete probability distribution is given by the formula:
𝜎 2 = ∑(𝑋 − 𝜇)2 • 𝑃(𝑋)
The standard deviation of a discrete probability distribution is given by the formula:
𝜎 = √∑𝑋 2 • 𝑃(𝑋 ) − 𝜇2
where:
X = value of the random variable
P(X) = probability of the random variable X
𝜇 = mean of the probability distribution
Chapter II
NORMAL DISTRIBUTION
• A standard normal curve is a normal probability distribution that has a mean 𝜇 = 0 and a standard
deviation, 𝑠 = 1.
• Descriptive measures computed from a population are called parameters while descriptive measures
computed from a sample are called statistics.
• The number of samples of size n that can be drawn from a population of size N is given by the NCn.
• A sampling distribution of sample means is a frequency distribution using the means computed from
all possible random samples of a specific size taken from a population.
• The probability distribution of the sample means is also called the sampling distribution of the sample
means.
• The standard deviation of sampling distribution of the sample means is also known as the standard
error of the mean. It measures the degree of accuracy of the sample mean (𝑋̅) as an estimate of the
population mean (𝜇).
NCn
Where N = size of the population
n = size of the sample
2. List all the possible samples and compute the mean of each sample.
3. Construct a frequency distribution of the sample means obtained in Step 2.
• An estimate is a value that approximates a parameter. It is based on sample statistics computed from
sample data.
• In Statistics, estimation is the process of determining values parameters.
• The confidence level of an interval estimate of a parameter is the probability that the interval
estimate contains the parameter. It describes what percentage of intervals from many different
samples contains the unknown population parameter.
∑(𝑋−𝑋̅)2
• Variance (s2): 𝑠 2 =
𝑛−1
∑(𝑋−𝑋̅)2
• Standard deviation (s): 𝑠 =√ 𝑛−1
• An interval estimate, called a confidence interval, is a range of values that is used to estimate a
parameter. This estimate may or may not contain the true parameter value.
• The confidence level of an interval estimate of a parameter is the probability that the interval
estimate contains the parameter. It describes what percentage of intervals from many different
samples contains the unknown population parameter.
• General formula for confidence intervals for large samples:
𝜎 𝜎
𝑋̅ − 𝑧𝛼 ( 𝑛) < 𝜇 < 𝑋̅ + 𝑧𝛼 ( 𝑛).
2 √ 2 √
• Computing formula for error E
𝜎 𝑠
𝐸 = 𝑧𝛼 ( ) ≈ 𝑧𝛼 ( ).
2 √𝑛 2 √𝑛
𝑝̂ 𝑞̂ 𝑝̂ 𝑞̂
𝑝̂ − 𝑧𝛼√ 𝑛 < 𝑝 < 𝑝̂ + 𝑧𝛼√ 𝑛
2 2
• Formula in determining the minimum sample size needed when estimation the population mean:
𝑧𝛼 • 𝜎 2
2
𝑛=( )
𝐸
• Formula in determining the minimum sample size needed when estimation the population
proportion:
𝑧𝛼 2
2
𝑛 = 𝑝̂ 𝑞̂ ( )
𝐸
Chapter V
CONDUCTING HYPOTHESIS TESTING
• Hypothesis testing is a decision-making process for evaluation claims about a population based on
the characteristics of a sample purportedly coming from that population. The decision is whether the
characteristic is acceptable or not.
• The null hypothesis, denoted by 𝑯𝟎 , is a statement that there is no difference between a parameter
and a specific value, or that there is no difference between two parameters.
• The alternative hypothesis, denoted by 𝑯𝟏 is a statement that there is a difference between a
parameter and a specific value, or that there is a difference between two parameters.
• When the alternative hypothesis utilizes the ≠ symbol, the test is said to be non-directional.
• When the alternative hypothesis utilizes the > or the < symbol, the test is said to be directional.
• A non-directional test is also called a two-tailed test.
• A directional test may either be left-tailed or right-tailed.
• The rejection region refers to the region where the value of the test statistic lies for which we will
reject the null hypothesis. This region is also called critical region
• The probability of committing a Type 1 error is called significance level of a test
• For any hypothesis test, p-value = probability of committing a Type 1 error.
𝑋̅−𝜇 𝜎
• Test statistic: 𝑧 = where: 𝜎𝑋̅ =
𝜎𝑋
̅ √𝑛
• Under the normal curve, the rejection region refers to the region where the value of the test statistic
lies for which we will reject the null hypothesis. This region is also called critical region.
• If the test is one-tailed, the p-value is equal to the area beyond z is the same direction as the
alternative hypothesis.
• Decision rule for the p-value approach:
o Reject H0 if 𝑝 ≤ 𝛼
o Do not reject if H0 if 𝑝 > 𝛼.
Sample proportion−Null hypothesized proportion
• Test statistic: z =
Standard deviation of sample proportion
𝑝̂−𝑝0 𝑝̂−𝑝0
z= 𝜎𝑝
modified into z= 𝑝0 𝑞0
̂ √ 𝑛
o For a one-tailed test:
H0: 𝑝 = 𝑝0
H1: 𝑝 > 𝑝0 and the rejection region is 𝑧 > +𝑧𝛼
Or (H1: p < p0) and the rejection region is 𝑧 < −𝑧𝛼
o For a two-tailed test:
H0: p = p0
H1: p ≠ p0
The rejection region is 𝑧 < −𝑧𝛼/2or is 𝑧 > 𝑧𝛼/2 .
Chapter VI
CORRELATION AND REGRESSION ANALYSIS
• Scatterplot
1. A scatterplot is the point-graph of all the scores taken from bivariate data.
2. The trend line is the line “closest to the points”.
3. In interpreting a scatterplot, the strength and direction of association is considered.
✓ If the points are arranged in a “thinner” line the strength of association is greater.
✓ If the trend of points are arranged in a line that points to the right, the direction of association
is positive; if it points to the left, the direction is negative.
✓ When the direction of association between two bivariate data is positive, it means that when
the value of one variable is high, the other variable is also high or when the value of one is
low, the other is also low.
✓ When the direction is negative, it means that if the value of variable is low, the other is high
or vice-versa.
• Pearson r
1. The Pearson r is the most commonly used statistic measure the strength of correlation or
association between two variables. To compute r, we use the formula:
𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 • ∑ 𝑌
r=
√[𝑛√𝑋 2 − (∑ 𝑋)2 ][𝑛√𝑌 2 − (∑ 𝑌)2 ]
2. The value of the computed r ranges from -1 to + 1. If the value is negative, the direction of
correlation is negative. We conclude that the two variables are negatively correlated. If the value
is positive, when the two variables are positively correlated. If the value of r is zero, then there
is no correlation between the two variables.
The closer the value of r is to ±1, then the stronger the correlation is; the closer the value is to
0, the weaker the correlation is.
• Regression and Prediction
1. If two variables are significantly correlated, we can predict the value of the dependent variable if
we know the value of the independent variable. The process is called regression analysis.
2. In regression analysis, the goal is to determine the regression line that is the same as the trend
line. The regression line is used in prediction; hence it is sometimes called the predictor.
3. The regression line is the same as the point-slope form equation of a line in algebra. The
regression line is:
Y = bX + a, where 𝑏 is the slope of the line and 𝑎 is the y-intercept.
(∑ 𝑌)(∑ 𝑋 2 ) − (∑ 𝑋)(∑ 𝑋𝑌)
a=
𝑛(∑ 𝑋 2 ) − (∑ 𝑋 )2
𝑛(∑ 𝑋𝑌) − (∑ 𝑋)(∑ 𝑋𝑌)
b=
𝑛(∑ 𝑋 2 ) − (∑ 𝑋)2
GLOSSARY
Arithmetic average. In a set of numerical data, this is the value obtained by adding the number of cases and then dividing the
sum by the total number of cases.
Central Limit Theorem. Given a random sample of n observations selected from a population with mean and standard deviation.
Then, when n is sufficiently large, the sampling distribution of means, will be approximately a normal distribution with mean
and the standard deviation. The larger the sample size, the closer the sampling distribution of means to the normal
distribution will be.
Confidence coefficient. This is a number that is used in determining an estimate of a population parameter. It is also known as
critical value. It is usually readily available from a table.
Confidence interval. It is a range of values that purportedly contains a population parameter.
Confidence level. This is the confidence statement regarding the interval estimate of the population parameter. It is the
probability that the interval estimate contains the parameter.
Confidence limits. These are the lower and the upper values in a confidence interval. Also known as confidence boundaries.
Continuous random variable. It takes on values on a continuous scale. Often, continuous random variables represent measured
data.
Critical values. These are the confidence coefficients.
Degrees of freedom (df). This is a number of values that are free to vary after a sample statistics has been computed.
Discrete random variable. The set of possible outcomes is countable.
Error. This refers to the difference between an observed value and a parameter.
Estimate. In inferential statistics, it is a value that approximates a population parameter.
Estimation. This is an area of inferential statistics where population values are determined by utilizing standard statistical
procedures.
Hypothesis testing. This is an area of inferential statistics consisting of standard procedures in decision-making for evaluating
claims about a population based on the characteristics of a sample, or samples, purportedly coming from that population.
Interval estimate. This is a range of values that may contain the parameter of a population.
Margin of Error (E). This is the maximum likely difference between the observed sample mean and the true value of the
population mean.
Mean. In a set of data, this is a measure of central tendency or location. This value is used to represent the entire set of data.
Parameter. This is a population value usually denoted by Greek letters.
Population. This is the set of all people, objects, events, or ideas one wish to investigate.
Point estimate. This is a specific numerical value of a population parameter.
Proportion. In frequencies obtained from surveys, it is fraction expression where the favorable response is in the numerator
and the total number or respondents is in the denominator. In general, it is a number obtained when we divide an observed
frequency of a subset by the total number of cases in the set.
P approach. This is a decision-making process where the null hypothesis is evaluated by assuming it to be true and then test the
reasonableness of this assumption by calculating the probability of getting the results if chance alone is operating.
P-value. This is the obtained probability value when utilizing the p approach.
Random sample. This is a sample obtained from target population utilizing random sampling techniques.
Random variable. A function that associates a real number to each element in the sample space.
Sample. This is a subgroup of a target population.
Significant difference. This is the difference between two, or among, the values that are too big to be ignored for decision-
making. Marginal differences are usually ignored and their occurrences are attributed to chance factors operating.
Standard deviation. This is a measure of dispersion, or spread, in a given set of numerical data.
Standard normal curve. This is a normal probability distribution that has a mean of 0 and a standard deviation of 1.
Statistic. This is a value obtained from a sample data.
Statistical hypothesis. This is an assertion or a conjecture about one or more populations.
Test statistic. This refers to the statistical value that is appropriate for a particular analysis. This value is the result of applying a
specific formula.
T-distribution. This is a distribution of values less than 30.
T-table. This table provides the confidence coefficients or critical values when the t-test statistic is applied.
T-value. This is the statistic resulting from the application of a t-test statistic.
Z-table. The table provides the confidence coefficients or critical values when the z-test statistic is applied.
Z-value. This is the statistic resulting from the application of z-test statistic.