Educ3063 Notes
Educ3063 Notes
INTRODUCTION TO STATISTICS
1.1 Meaning and basic Concepts of Statistics
Learning objectives
What does the word statistics mean? To most people, it suggests numerical facts or data, such
as unemployment figures, farm prices, or the number of marriages and divorces. The most
common definition of the word statistics is as follows:
Statistics is the science of planning organizing, summarizing, presenting, analyzing,
interpreting, and drawing conclusions based on the data (Triola, 2012).
Importance of statistics: Using statistics has different benefits. Some of them are:
to select an appropriate statistical test
to collect the right kinds of information for analysis
to perform statistical calculations in a straightforward, step-by-step manner
to accurately interpret and present statistical results
to be an intelligent consumer of statistical information
to write up analyses and results in American Psychological Association (APA) style
1.2. Types of Statistics: Descriptive vs. Inferential
Statistics
Descriptive Inferencial
Statistics statistcs
Data
are collections of observations (such as measurements, genders, survey responses)
Consist of information coming from observations, counts, measurements, or responses.
Table 1-1 Data Used for Analysis
Data Sets
There are two types of data sets you will use when studying statistics. These data sets are called
populations and samples
Population
The complete collection of all individuals (scores, people, measurements, and so on) to be
studied. The collection is complete in the sense that it includes all of the individuals to be
studied
The collection of all individuals or items under consideration in a statistical study
Complete set of events in which you are interested.
is the collection of all outcomes, responses, measurements, or counts that are of interest
For instance
if we were interested in the stress levels of all adolescent Americans, then the
collection of all adolescent Americans’ stress scores would form a population,
the scores of all morphine-injected mice
the milk production of all cows in the country
The ages at which every girl first began to walk
the stress scores of the sophomore class in Woldia University
The population can range from a relatively small set of numbers, which is easily collected, to an
infinitely large set of numbers, which can never be collected completely. If the populations in
which we are interested are usually quite large, then, collecting data can be difficult for
researchers so the researchers can collect data from a representative sample taken from the
population.
Census
The collection of data from every member of the population
If a researcher take the whole population, then, no need of sampling to select the research
participants
A census consists of data from an entire population. But, unless a population is small, it is
usually impractical to obtain all the population data. In most studies, information must be
obtained from a sample.
Sample
A sub collection of members selected from a population
A part of the population from which information is obtained
Set of actual observations; subset of a population
N.B - A sample should be representative of a population so that sample data can be used to form
conclusions about that population. Sample data must be collected using an appropriate
methodsuch as random sampling
Practical example 2- Identify the population and the sample
In a recent survey, 1500 adults in the United States were asked if they thought there was
solid evidence of global warming. Eight hundred fifty-five of the adults said yes.
Solution
The population consists of the responses of all adults in the United States
The sample consists of the responses of the 1500 adults in the United States in
the survey.
Undergraduate //// / 6
Postgraduate /// 3
Total 40
te y y e e e
ra ar d ar lleg uat uat
te rim n d d
Illi P co Co ra ra
Se ue/ erg stg
iq d Po
c hn Un
Te
Pie Chart
A pie chart is a disk divided into wedge-shaped pieces proportional to the relative
frequencies of the qualitative data
Another method for organizing and summarizing data is to draw a picture of some kind. The old
saying “a picture is worth a thousand words” has particular relevance in statistics—a graph or
chart of a data set often provides the simplest and most efficient display. Two common methods
for graphically displaying qualitative data are pie charts and bar charts. We begin with pie charts.
To Construct a Pie Chart
Step 1 Obtain a relative-frequency distribution of the data
Step 2 Divide a disk into wedge-shaped pieces proportional to the relative frequencies.
We see that, in this case, we need to divide a disk into three wedge-shaped pieces that
comprise 32.5%, 45.0%, and 22.5% of the disk.
Step 3 Label the slices with the distinct values and their relative frequencies.
Notice that we expressed the relative frequencies as decimal or percentage.
8% 10%
15%
30%
18%
20%
Range 36
Class width (i) = = 7.2 (round to 8)
Numberofinterval 5
To set the lower and upper boundary 0.5 is subtracted from the lower limit and added to the
upper limit boundary of each class interval. Therefore, the class boundary of the distribution is
organized as follows
UNIT THREE
2. MEASURES OF CENTRAL TENDENCY
Central tendency is a statistical measure that determines a single value that accurately
describes the center of the distribution and represents the entire distribution of scores
The goal of central tendency is to identify thesingle value that is the best representative
forthe entire set of data.
Central tendency serves as a descriptivestatistic because it allows researchers todescribe
or present a set of data in a verysimplified, concise form.
Characteristics of a good measure of central tendency
Measure of central tendency is a single value representing a group of values and hence
issupposed to have the following properties.
1. Easy to understand and simple to calculate.
A good measure of central tendency must be easy to comprehend and the procedure involved in
its calculation should be simple.
2. Based on all item
A good average should consider all items in the series.
3. Rigidly defined
A measure of central tendency must be clearly and properly defined. It will be better if itis
algebraically defined so that personal bias can be avoided in its calculation.
3. Capable of further algebraic treatment
A good average should be usable for further calculations.
5. Not be unduly affected by extreme values
A good average should not be unduly affected by the extreme or extra ordinary values in a series.
The most common measures of central tendency are
3.1. The mean
The sum of all the data entries divided by the number of entries
The mean, also known as the arithmetic average, is found by adding the values of thedata
and dividing by the total number of values
The mean is the sum of the values, divided by the total number of values.
3.1.1. Properties of Mean
It is simple to understand and easy to calculate
It takes into account all the items of the series
It is rigidly defined and is mathematical in nature
It is relatively stable
It is capable of further algebraic treatment
Mean is the center in balancing the values on either side of it and hence is more typical
The mean is sensitive to the exact value of all the scores in the distribution
The sum of the deviations about the mean equals zero
3.1.2 Computing Means of Ungrouped Data
sum of all x
x−bar =
Or number of x
Example: The following data represents the ages of 20 students in a statistics class. Calculate the
meanage ofstudents.
20 20 20 20 20 20 21
21 21 21 22 22 22 23
23 23 23 24 24 65
Step 3 -Find the sum of the products of the midpoints and the frequencies.
Step 4 - Find the sum of the frequencies.
Step 5 - Find the mean of the frequency distribution
1 9.5 – 14.5 1 1 12 12
2 14.5 – 19.5 1 2 17 17
3 19.5 – 24.5 2 4 44 22
5 29.5 – 34.5 3 14 96 32
6 34.5 – 39.5 2 16 74 37
Mean =
∑ fx = 600 = 30
N 20
Questions
Calculate median of the frequency distribution.
There are steps for the calculation of the median in frequency distribution
n
Step2: Find ( )to identify the median class
2
n
Step3: See in the cumulative frequency the value first greater than ( ), Then the corresponding
2
class interval is called the Median class.
Step 4:Calculate the median of the distribution
20
Median class ( ) = 10
2
n
−m
Median = L + 2 xC
f
Where: n = the total number of scores
L = the lower limit of the median class
m = the frequency before the median class
f = frequency of the median class
c = class width
The median lies between 4 and 11. Corresponding to 4 the less than class is 24.5 and
corresponding to 11 the less than class is 29.5. Therefore the median class is 24.5-29.5. Its lower
limit is 24.5. Here L = 24.5, n= 20, f = 7, c = 20, m =4
10−4 6
Median = 24.5 + x 5 = 24.5 + x 5 = 24.5 + 4.28 = 28.285
7 7
3.3 The Mode
Mode is the most frequently occurring value in a data set
The mode is the most frequently occurring category of score.
It is merely the most common score or most frequent category of scores
3.1.1. Properties of mode
can apply the mode to any category of data
The mode is the only measure that applies to nominal (category) data aswell as
numerical score data.
You can have a single number for the mode, no mode, or more than one number.
best average for nominal data
easy to determine
When two data values occur with the same greatest frequency, each one is amode
and the data set is bimodal.
When more than two data values occur with the same greatest frequency, each isa
mode and the data set is said to be multimodal.
When no data value is repeated, we say that there is no mode.
3.1.2 Computing Mode of Ungrouped Data
Identify the number that occurs most often.
Organize frequency distribution to identify the most frequent score in distribution
For example: 10 11 12 13 15 16 18 19 21 21 26 27 28
31 32. From the above 21 is the mode of the data set
3.1.3 Computing Mode for Grouped data
Based on the following frequency distribution, answer the questions given below the data.
Questions
Calculate mode of the frequency distribution
The modal class can be easily identified compared with median. The modal class can be
observed with the higher frequency in frequency of the distribution. Then, the modal class is 24.5
to 29.5.
Step 2:Calculate the mode of the distribution
fs
Mode =L + xC
fs+ fp
3 3
Mode = 24.5 + x 5 = 24.5 + x 5 = 24.5 + 3 = 27.5
3+2 5
Class work
UNIT FOUR
4. MEASURES OF DISPERSION/ VARIATION
Measures of variability provide information about the amount of spread or dispersion among the
variables. Range, variance, and standard deviation are the common measures of variability.
4.1 Range, standard deviation and variance
Range
Simply the difference between the largest and smallest values in a set of data
Is considered primitive as it considers only the extreme values which may not be useful
indicators of the bulk of the population.
The formula is - Range = largest observation - smallest observation
is the difference between the largest and the smallest values.
used for ordinal data
Range = the highest – the lowest scores
Standard deviation
Measures the variation of observations from the mean
Isthe positive square root of variance
The most common measure of dispersion
Takes into account every observation
Measures the ‘average deviation’ of observations from the mean
used on ratio or interval data
The standard deviation measures the variation among data values.
Values close together have a small standard deviation, but values with muchmore
variation have a larger standard deviation.
For many data sets, a value is unusual if it differs from the mean by more thantwo
standard deviations
Steps in Calculating Standard deviation
For example – The following are assessment scores of students in Abnormal psychology
Then, calculate the variance and standard deviation of the data set
Variance
is the sum of the squared deviations of each value from the mean divided by the number
of observations
mean of squared differences between scores andthe mean
used on ratio or interval data
used for advanced statistical analysis
is equal to the average of the squared deviations from the mean of a distribution.
Symbolically, sample variance is s2and population variance is
:
Classwork - Test scores - 6, 3, 8, 5, 3. Find the variance
4.3.3 Percentiles
are measures of location, denoted which dividea set of data into 100 groups
with about 1% of the values in each group
Percentiles are merely a form of cumulative frequency distribution, but instead of being
expressed in terms of accumulating scores from lowest to highest, the categorisation is in
terms of whole numbers of percentages of people.
The percentile is thescore which a given percentage of scores equals or is less than.
examined to find the cut of points in a given data set
For example - 80% of scores are equal to 61 or less
For example, the 50th percentile, denoted, has about 50% of the data values below it and about
50% of the data values above it. So the 50th percentile is the same as
the median. There is not universal agreement on a single procedure for calculating
percentiles, but we will describe two relatively simple procedures for
(1) Finding the percentile of a data value,
x
p= ∗100
n
Sorted data = 3, 4, 5, 6, 7, 9, 12, 15, 20, 22, 23, 24, 25
Find the percentile of 22
P = n < X/ N* 100% = 9/13 * 100 = 69.23 = 70% - Then, the above, 70% of students scored
22 and below or only 30 % of students scored above score 22
(2) Converting a percentile to its corresponding data value.
Examples
P30 is the value that divides the lowest 30% of the data from the highest 70% of the data
P70 divides the lowest 70% of the data from the highest 30% of the data
When:
N - Total number of values in the data set
K - Percentile being used (Example: For the 25th percentile, )
L - Locator that gives the position of a value (Example: For the 12th value in
the sorted list,)
Find the value of 25th percentile
L = K/ 100 *N = 25/100 * 100 = 3.25 = 4 position = 6 - This shows that 25% of students scored
6 and below
Find the value of 50th percentile
L = K/ 100 *N = 50/100 * 100 = 6.5 = 7 position = 12
4.3.5. Z-score
Z-scores are merely scores expressed in terms of the number of standard statistical
units of measurement (standard deviations) they are from the mean of the set of scores.
A z score (or standardized value) is found by converting a value to a
standardized
scale, as given in the following definition. This definition shows that a z score
is the
number of standard deviations that a data value is deviated from the mean.
A z score (or standardized value) is the number of standard deviations
that
a given value x is above or below the mean
We used the range rule of thumb to conclude that a value is “unusual”
if it is more than 2 standard deviations away from the mean. It follows that
unusual
values have z scores less than-2 or greater than + 2.
For Example:
Class work
Test1 6 6 5 4 7 4 4 3 6 10 6 6 4 8 12 12 11
Test2 8 4 8 2 4 8 2 5 10 10 10 8 7 12 11 10 9
Students 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Calculate the relationship between Test 1 and Test 2, check its significance and interpret it
3.2 Spearman’s rho Correlation Coefficient
When to use
- there are two a ranked data for variable A and variable B
- The data is skewed for from the normal distribution
- If N is less than 30
The Pearson correlation coefficient is the dominant correlation index in psychological statistics.
There is another called Spearman’s rho which is not very different. Instead of taking the scores
directly from your data, the scores on a variable are ranked from smallest to largest. That is, the
smallest score on variable X is given rank 1, the second smallest score on variable X is given
rank 2, and so forth. The smallest score on variable Y is given rank 1, the second smallest score
on variable Y is given rank 2, etc. Then Spearman’s rho is calculated like the Pearson correlation
coefficient between the two sets of ranks as if the ranks were scores.
A special procedure is used to deal with tied ranks. Sometimes certain scores on a variable are
identical. There might be two or three people who scored 7 on variable X, for example.This
situation is described as tied scores or tied ranks. The question is what to do about them. The
conventional answer in psychological statistics is to pretend first of all that the tied scores can be
separated by fractional amounts. Then we allocate the appropriate ranks to these ‘separated’
scores but give each of the tied scores the average rank that they would have received if they
could have been separated
The two scores of 5 are each given the rank 2.5 because if they were slightly different they
would have been given ranks 2 and 3, respectively. But they cannot be separated and so we
average the ranks as follows:
In the above Table, there are two 5 scores, for these tied scores, the average rank should be given
2+3/2 = 2.5
7+8+9/ 3 = 8
There are three scores of 9 which would have been allocated the ranks 7, 8 and 9 if the scores
had been slightly different from each other. These three ranks are averaged to give an average
rank of 8 which is entered as the rank for each of the three tied scores
Participants 1 2 3 4 5 6 7 8 9 10
Test1 for MA 8 3 9 7 2 3 9 8 6 7
Rank1 7.5 2.5 9.5 5.5 1 2.5 9.5 7.5 4 5.5
Test2 for MUA 2 6 4 5 7 7 2 3 5 4
Rank 2 1.5 8 4.5 6.5 9.5 9.5 1.5 3 6.5 4.5
Difference (D) 6 5.5 5 1 8.5 7 8 4.5 2.5 1
D2 36 30.25 25 1 72.25 49 64 20.25 6.25 1
2. Is there a relationship between students study time per hour and their levels of perceived
stress?
3. Do people with higher study time lower levels of perceived stress or higher level of
stress?
CHAPTER SIX
HYPOTHESIS TESTING
6.1 Concepts of Hypothesis Testing
Hypothesis is usually considered as the principal instrument in research.
Many experiments are carried out with the deliberate object of testing hypotheses.
In social science, where direct knowledge of population parameter(s) is rare, hypothesis
testing is used strategy for deciding whether a sample data offer generalization can be
made or not
Hypothesis testing enables us to make probability statements about population
parameter(s).
6.2 what is Hypothesis
Hypothesis simply means a mere assumption or some supposition to be proved or
disproved. But in a researcher hypothesis is a formal question that researcher intends to
resolve. A research hypothesis is a predictive statement, capable of being tested by
scientific methods, that relates an independent variable to some dependent variable.
For example, consider the following statement
A. “Students who receive counseling will show a greater increase in creativity than
students not receiving counseling
Typically, in hypothesis testing, we have two options to choose from. These are termed as null
hypothesis and alternative hypothesis.
NULL HYPOTHESIS VS ALTERNATIVE HYPOTHESIS
Null hypothesis: Is a statistical hypothesis testing that assumes that the observation is due
to a chance factor. In hypothesis testing, null hypothesis is denoted by; H0: μ1 = μ2,
which shows that there is no difference between the two population means.
The null hypothesis always states that there is no effect in the underlying population. By
effect we might mean a relationship between two or more variables, a difference between
two or more different populations or a difference in the responses of one population
under two or more different conditions.
Alternative Hypothesis (H1) - a hypothesis to be considered as an alternative to the null
hypothesis.
Examples of null hypotheses (H0): in the above example
There is no relationship between study time and exam grade
There is no difference between female and male participants in exam result
There is no difference in of exam result after the participants take training on study skill
Students 1 2 3 4 5 6 7 8 9
Test 8 7 5 6 8 7 8 6 6
Population mean ( μ) = 5
Step 1 State the null and alternative hypotheses.
❑
H0: x 1 ¿ μ (the sample mean is equal to the population mean – no difference between the sample
mean and the population mean)
❑
H1: x 1 ≠ μ (the sample mean is different from the population)
Step 2 Specify the level of significance = 0.05
Step 3 Determine the degrees of freedom = N– 1 = 9-1 = 8
Step 4 Determine the critical value = from the table = 2.30
Step 5 Determine the rejection region – All values > 2.30
Step 6 Find the test statistic
6.77−5 1.77
= =4.86
= 1.092 .364
√9
Step 7 Make a decision to reject or fail to reject H0
The calculated t-value is 4.86 > the critical value 2.30 at 0.05 significance level. Then,
H0 is rejected.
Step 8 interpret the result
Table2: One sample t-test result for statistics test I
*P < 0.05
This shows that there is a significant difference between the sample mean and the population
mean scores t (8) = 4.86, p < 0.05. This also implies that the sample mean score of stat test (M =
6.77) is significantly higher than the population mean score (M = 5) for students.
6.7.2 Independent Sample T- test
Basic concepts
An independent t-test measures differences between two distinct groups. Those differences might
be directly manipulated (e.g. drug treatment group vs. placebo group), they may be naturally
occurring (e.g. male vs. female), or they might be beyond the control of the experimenter (e.g.
depressed people vs. healthy people). In an independent t-test mean dependent variablescores are
compared between the two groups (the independent variable). For example, we couldmeasure
differences in the amount of money spent on clothes between men and women
The t test (unrelated) is based on comparing the means for the two groups doing each condition.
This is because there is no basis for comparing differences between related pairs of scores for
each participant. Because the t test (unrelated) is based on unrelated scores for two conditions,
which are independent of each other, another name for the t test (unrelated) is the independent t
test.
In many real life situations, we cannot determine the exact value of the population mean. We are
only interested in comparing two populations using a random sample from each. Such
experiments, where we are interested in detecting differences between the means of two
independent groups are called independent samples test. Some situations where independent
samples t-test can be used are given below:
An economist wants to compare the per capita income of two different regions.
A labor union wants to compare the productivity levels of workers for two different
groups.
An aspiring MBA student wants to compare the salaries offered to the graduates of two
business schools.
In all the above examples, the purpose is to compare between two independent groups in contrast
to determining if the mean of the group exceeds a specific value as in the case of one sample t-
tests.
Assumptions
The independent variable must be categorical
- It must consist of two distinct groups
- Group membership must be independent and exclusive
- No person (or case) can appear in more than one group
There must be one parametric dependent variable
- The dependent variable data must be interval or ratio
- And should be reasonably normally distributed (across both groups)
We should check for homogeneity of variances
If these assumptions are not met the non-parametric Mann–Whitney U test could be
considered
Computing Independent Sample T- test: For example: Gender differences in statistics test results
M. Students 1 2 3 4 5 6 7 8 9 10
Scores 4 6 5 7 8 4 3 2 4 5 48
X12 16 36 25 49 64 16 9 4 16 25 260
F. students 11 12 13 14 14 15 15 16 17 18
Scores 8 9 6 7 8 10 8 9 7 10 82
2
X2 64 81 36 49 64 100 64 81 49 100 688
Maths 4 3 3 3 4 5 4 3 5 4 38
Civic 1 2 2 3 3 2 2 4 1 1 21
d 3 1 1 0 1 3 2 -1 4 3 17
d2 9 1 1 0 1 9 4 1 16 9 51
Student 1 2 3 4 5 6 7 8 9 10
√ n ∑ d −¿ ¿ ¿ ¿ ¿
2
10 X
51
10−1
¿
9
3.43
Step 7 Make a decision to reject or fail to reject H0
The calculated t-value is 3.43 > the critical value 2.26 at 0.05 significance level. Then,
H0 is rejected.
Step 8 interpret the result
Table 1: T-test results for students test scores
*P < 0.05
This shows that there is a significant difference between mathematics and Civic mean scores t
(9) = 3.43, p < 0.05. This also implies that the mean of mathematics (M = 3.8) is significantly
higher than the mean score of civic education (M =2.1) for students
6.8Analysis Of Variance (ANOVA)
The analysis of variance (ANOVA) currently enjoys the status of being probably the most used
statistical technique in psychological research integrating with other tests of analysis such as
regression, multiple analysis of variance and covariance. Analysis of variance is highly related
with t- test in comparing means in the process of conducting psychological researches. The
popularity and usefulness of this technique can be attributed to two facts. First, the analysis of
variance, like t, deals with differences between sample means, but unlike t, it has no restriction
on the number of means. Instead of asking merely whether two means differ, we can ask whether
two, three, four, five, or k means differ. Second, the analysis of variance allows us to deal with
two or more independent variables simultaneously, asking not only about the individual effects
of each variable separately but also about the interacting effects of two or more variables
(Pagano, 2009).
Based on the number of the independent variables included in the research, there are different
forms of the analysis of variance such as one way analysis , two way, three way and so on. On
the other hand, considering the design, the nature of the dependent variable and the hypothesis to
be tested scholars categorized analysis of variance in to between group participants design,
repeated measure design and mixed design. In other words, one way analysis of variance used
one independent variable having three and more levels with one dependent variable ( Hiwett&
Crammer, 2011).
As a parametric test, analysis of variance is interested in testing the null hypothesis having one
continuously measured dependent variable with one or more categorical independent variables.
The independent variables are expected to have different levels that have organized scores
obtained from data gathering tools. Stating the null and alternative hypotheses in symbols and in
words and thereby calculating the F-ratio in accordance with the steps are important activities in
analysis of variance. If the F-ratio showed significant differences across the means, post hoc test
analysis can be done in order to know which mean is significantly different from the others. At
the same time, calculating the effect size of the independent variable on dependent variable using
different statistical techniques such as omega and eta squares is still impotant (Dancey&Reidey,
2011).
E. The different samples are from a populations that are categorized in only one way
These are expected to come one independent variable organized as levels. In other words, the
samples didn’t show the number of independent variables.
3.2 Sources of variance
Analysis of variance (ANOVA), as the name suggests, analyses the different sources from which
variation in the scores arises.
Between-groups variance
ANOVA looks for differences between the means of the groups. When the means are very
different, we say that there is a greater degree of variation between the conditions. If there were
no differences between the means of the groups, then there would be no variation. This sort of
variation is called between-groups variation (Dancey&Reidey, 2011).
Between-groups variation arises from:
Treatment effects: When we perform an experiment, or study, we are looking to see that the
differences between means are big enough to be important to us, and that the differences
reflect our experimental manipulation. The differences that reflect the experimental
manipulation are called the treatment effects
Individual differences: Each participant is different, therefore participants will respond
differently, even when faced with the same task. Although we might allot participants
randomly to different conditions, sometimes we might find, say, that there are more
motivated participants in one condition, or they are more practiced at that particular task.
Experimental error: Most experiments are not perfect. Sometimes experimenters fail to
give all participants the same instructions; sometimes the conditions under which the tasks
are performed are different, for each condition. At other times, equipment used in the
experiment might fail, etc. Differences due to errors such as these contribute to the
variability.
Within-groups variance
Another source of variance is the differences or variation within a group. This can be thought of
as variation within the columns.
Within-groups variation arises from:
Individual differences: In each condition, even though participants have been given the
same task, they will still differ in scores. This is because participants differ among
themselves in abilities, knowledge, IQ, personality and so on. Each group, or condition, is
bound to show variability.
Experimental error: This has been explained above
Steps for test statistic in One-Way ANOVA
Step 1 State the null and alternative hypotheses.
❑
H0: μ1=μ2 =¿ μ 3 ¿ (All population means are equal.)
Ha: At least one mean is different from the others
Step 2 Specify the level of significance = 0.05, 0.01, 0.1
Step 3 Determine the degrees of freedom = N - K, K - 1
Step 4 Determine the critical value = from the table
Step 5 Determine the rejection region
Step 6 Find the test statistic
Step 7 Make a decision to reject or fail to reject H0
Step 8 interpret the result
Example 1: A researcher wanted to test the effect of study skills support on academic
achievement scores of students in DeberMarkos University. Then, he took 15 students who need
study skills support and Assign them randomly in to three groups such as placebo, low support
and high support. Level of significance for this hypothesis testing is 0.05. The data ollected from
students are presented in the following table.
Placebo Low support High support
2 10 10
3 8 13
7 7 14
2 5 13
6 10 15
n1 = 5 n2 = 5
N = 15
Solution:
Step1: State the null and alternative hypotheses
❑
H0: μ1=μ2 =¿ μ 3 ¿ (All population means are equal)
❑
Ha: μ1 ≠ μ2 ≠ μ 3 (At least one mean is different from the others)
Step 2: Specify the level of significance
The significance level is α = 0.05
Step 3 Determine the degrees of freedom
Degree of freedom for between groups = (K – 1) = 3 – 1 = 2 (K is number of groups)
Degree of freedom for within groups = (N – K) = 15 – 3 = 12 (K is number of groups)
Degree of freedom for total = (N – 1) = 15 – 1 = 14 (N is all participants in the research)
Step 4 Determine the critical value from F distribution.
To find F critical value we use F (2, 12) = 3.89
Step 5 Determine the rejection region
In the F distribution, the rejection region is all the values greater than 3.89. In other words, if F
calculated greater than 3.89 reject the null hypothesis because it is in the rejection region or, if F
calculated is less than 3.89 accept the null hypothesis.
Step 6 Find the test statistic
Calculate between sum of square (SSB)
SSB = ¿ ¿+ ¿ ¿ + ¿ ¿ - ¿ ¿
SSB = ¿ ¿+ ¿ ¿ + ¿ ¿ - ¿ ¿
SSB = (80 + 320 + 845) - 1041.667
SSB = 1245 - 1041.667 SSB = 203.333
Calculate within sum of square (SSW)
SSW = Σ x 2−¿ ¿+ ¿ ¿ + ¿ ¿
SSW = 1299 -¿ ¿+ ¿ ¿ + ¿ ¿
SSW = 1299 - (80 + 320 + 845)
SSW = 1299 –1245 SSW = 54
Calculate total sum of square (SST)
SST = SSB + SSW, SST = 203.333 + 54 SST = 257.333
Calculate Between groups mean of square (MSB)
❑ ❑
SSB 203.333
MSB = MSB= MSB=101.667
DFB 2
Calculate within groups mean of square (MSW)
❑ ❑
SSW 54
MSW = MSW = MSW =4.5
DFW 12
❑ ❑
MSB 101.667
Calculate F- ratio F= F= = 22.59
MSW 4.5
Step 7 Make a decision to reject or fail to reject H0
F critical (2, 12) = 3.89
F calculated = 22.59
ANOVA Summary table study skills support given for students
Sources of variation Degree of freedom Sum of squares Mean squares F
Between groups 2 203.333 101.667
22.59
Within groups 12 54 4.5
Total 14 257.333
Then, F calculated = 22.59 > F critical (2, 12) = 3.89 reject the null hypothesis. This shows the
location of the rejection region and the test statistic. Therefore, F is in the rejection region, you
should to reject the null hypothesis
Interpretation:There is enough evidence at the 5% level of significance to conclude that study
skills support hassignificant effect on the means of academic achievement scores of students.
3.3 Post hoc Analysis
Post hoc analysis is a multiple comparison techniques for making comparisons between two or
more group means subsequent to an analysis of variance. Since there is enough evidence at the
5% level of significance to conclude that the means of academic achievement scores of students
are different. Then, which mean is different from the others can be known through post hoc
analysis. Post hoc analysis methods are different in their power minimizing type I error. Some of
them are listed below.
Let’s use the Post hoc analysis technique of Tukey test for the example given above.
When Tukey test is used for the post hoc analysis we use Q-distribution to find the critical
value .Then, the multiple comparisons through Tukey test has four steps done as follows.
Step1: Find Q-calculated by comparing two means at a time
1. Placebo with low study skills support (mean1 with mean2)
❑ ❑
mean2 – mean1 8–4
Q-cal =
√
MSW ❑
n 5√
= 4.5❑ = 4.5*
3. low study skills support with high study skills support (mean2 with mean3)
❑ ❑
mean3 – mean2 13 – 8
4. Q-cal =
√ MSW ❑
n √
= 4.5❑
5
= 5.27*
Step 2: Find Q-critical from Q-distribution by (r, df) - Q (5, 12) = 3.77
Step3: Make decision based on the three mean comparisons.
Therefore, for mean1 and mean2Q-cal>Q-cri or 4.5 > 3.77 reject the null hypothesis
For mean1 and mean3Q-cal>Q-cri or 9.48 > 3.77 reject the null hypothesis
For mean3 and mean2Q-cal>Q-cri or 5.27 > 3.77 reject the null hypothesis
Step 4: Interpretation
There is enough evidence at the 5% level of significance to conclude that study skills support all
means of academic achievement scores of students are significantly different each other.
Introduction
Regression analysis is a statistical technique that is widely used for research. Regression analysis is used
to predict the behavior of the dependent variables, based on the set of independent variables. In regression
analysis, dependent variables can be metric or non-metric and the independent variable can be metric,
categorical, or both a combination of metric and categorical. These days, researchers are using regression
analysis in two manners, for linear regression analysis and for non-linear regression analysis. Linear
regression analysis is further divided into two types, simple linear regression analysis and multiple linear
regression analysis. In simple linear regression analysis, there is a dependent variable and an independent
variable. In multiple linear regressions analysis, there is a dependent variable and many independent
variables. Non- linear regression analysis is also of two types, simple non-linear regression analysis and
multiple non-linear regression analysis. When there is a non-liner relationship between the dependent and
independent variables and there is a dependent and an independent variable, then it said to be simple non-
liner regression analysis. When there is a dependent variable and two or more than two independent
variables, then it said to be multiple non-linear regression.
Learning outcomes
Upon completing this topic, the students will be able to:
Describe basic concepts of regression
Appropriately use regression principles in different research fields
Apply regression models in research design
Perform regression analysis and interpret the results
Key Terms: Regression, Intercept, Slope, Curve it, Polynomial, Best fit line
Linear regression is the most basic and commonly used predictive analysis. Regression estimates are used
to describe data and to explain the relationship between one dependent variable and one or more
independent variables.
At the center of the regression analysis is the task of fitting a single line through a scatter plot. The
simplest form with one dependent and one independent variable is defined by the formula y = a + b*x.
Sometimes the dependent variable is also called endogenous variable, prognostic variable or regressand.
The independent variables are also called exogenous variables, predictor variables or regressors.
However Linear Regression Analysis consists of more than just fitting a linear line through a cloud of
data points. It consists of 3 stages – (1) analyzing the correlation and directionality of the data, (2)
estimating the model, i.e., fitting the line, and (3) evaluating the validity and usefulness of the model.
1). Might be used to identify the strength of the effect that the independent variable(s) have on a
dependent variable. Typical questions are what is the strength of relationship between dose and effect,
sales and marketing spend, age and income.
2). It can be used to forecast effects or impacts of changes. That is regression analysis helps us to
understand how much will the dependent variable change, when we change one or more independent
variables. Typical questions are how much additional Y do I get for one additional unit X.
3). Regression analysis predicts trends and future values. The regression analysis can be used to get point
estimates. Typical questions are what will the price for gold be in 6 month from now? What is the total
effort for a task X?
Assumptions:
With the exception of the mean and standard deviation, linear regression is possibly the most widely
used of statistical techniques. This because any of the problems that we encounter in research settings
require that we quantitatively evaluate the relationship between two variables for predictive purposes.
By predictive, I mean that the values of one variable depend on the values of a second. We might be
interested in calibrating an instrument such as a sprayer pump. We can easily measure the current or
voltage that the pump draws, but specifically want to know how much fluid it pumps at a given
operating level. Or we may want to empirically determine the production rate of a chemical product
given specified levels of reactants.
Linear regression, which is the natural extension of correlation analysis, provides a great starting
point toward these objectives.
Curve fit - This is perhaps the most general term for describing a predictive relationship between two
variables, because the "curve" that describes the two variables is of unspecified form.
Polynomial fit - A polynomial fit describes the relationship between two variables as a mathematical
series. Thus a first order polynomial fit (a linear regression) is defined as y = a + bx. A second order
(parabolic) fit is y= a + bx + cx^2, a third order (spline) fit is y = a + bx + cx^2 + dx^3, and so on...
Best fit line - The equation that best describes the y or dependent variable as a function of the x or
independent.
Linear regression and least squares linear regression - This is the method of interest. The
objective of linear regression analysis is to find the line that minimizes the sum of squared deviations
of the dependent variable about the "best fit" line. Because the method is based on least squares, it is
said to be a BLUE method, a Best Linear Unbiased Estimator.
Specifically, the slope is defined as the summed cross product of the deviations of x and y from their
respective means, divided by the sum of squares of the deviations x from it's mean. The second
relationship above is useful if these quantities have to be calculated by hand. The standard error
values of the slope and intercept can are mainly used to compute the 95% confidence intervals. If you
accept the assumptions of linear regression, there is a 95% chance that the 95% confidence interval of
the slope contains the true value of the slope, and that the 95% confidence interval for the intercept
contains the true value of the intercept.
It's interesting to note that the slope in the generalized case is equal to the linear correlation coefficient
scaled by the ratio of the standard deviations of y and x:
There are several assumptions that must be met for the linear regression to be valid:
The scatter of the y values about y estimates (denoted yhat) based on the best fit line is often referred
to as the "standard error of the regression":
Notice that two degrees of freedom are lost in the denominator: one for the slope and one for the
intercept. A more descriptive definition - and strictly correct name - for this statistic is the root mean
square error (denoted RMS or RMSE).
Just as in linear correlation analysis, we can explicitly calculate the variance explained by the regression
model:
As with the other statistics that we have studied the slope and intercept are sample statistics based on data
that includes some random error, e: y + e = a + b x. We are of course actually interested in the true
population parameters which are defined without error. y = a + b x. How do we assess the significance
level of the model? In essence we want to test the null hypothesis that b=0 against one of three possible
alternative hypotheses: b>0, b<0, or b not = 0.
There are at least two ways to determine the significance level of the linear model. Perhaps the easiest
method is to calculate r, and then determine significance based on the value of r and the degrees of
freedom using a table for significance of the linear or product moment correlation coefficient. This
method is particularly useful in the standardized regression case when b=r.
The significance level of b, can also be determined by calculating a confidence interval for the slope. Just
as we did in earlier hypothesis testing examples, we determine a critical t-value based on the correct
number of degrees of freedom and the desired level of significance. It is for this reason that the random
variables x and y must be bivariate normal.
For the linear regression model the appropriate degrees of freedom is always df=n-2. The level of
significance of the regression model is determined by the user, the 95% or 99% levels are generally
used. The standard error values of the slope and intercept can be hard to interpret, but their main purpose
is to compute the 95% confidence intervals. If you accept the assumptions of linear regression, there is a
95% chance that the 95% confidence interval of the slope contains the true value of the slope, and that the
95% confidence interval for the intercept contains the true value of the intercept. The confidence interval
is then defined as the product of the critical t-value and Sb, the standard error of the slope:
whereSb is
defined as:
Interpretation.
If there is a significant slope, then b will be statistically different from zero. So if b is greater than (t-
crit)*Sb, the confidence interval does not include zero. We would thus reject the null hypothesis that b=0
at the pre-determined significance level. As (t-crit)*Sbbecomes smaller, the greater our certainty in beta,
and the more accurate the prediction of the model. If we plot the confidence interval on the slope, then
positive and negative limits of the confidence interval of the slope plot as lines that intersect at the point
defined by the mean x,y pair for the data set. In effect, this tends to underestimate the error associated
with the regression equation because it neglects the role of the intercept in controlling the position of the
line in the cartesian plane defined by the data. Fortunately, we can take this into account by calculating a
confidence interval on line.
Just as we did in the case for the confidence interval on the slope, we can write this out explicitly as a
confidence interval for the regression line, that is defined as follows: The degrees of freedom is still df=
n-2, but now the standard error of the regression line is defined as: Because values that are further
from the mean of x and y have less probability and thus greater uncertainty, this confidence interval is
narrowest near the location of the joint x and y mean (the centroid or center of the data distribution), and
flares out at points further from centroid. While the confidence interval is curvilinear, the model is in fact
linear.