0% found this document useful (0 votes)
5 views

Educ3063 Notes

Uploaded by

awokegoshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Educ3063 Notes

Uploaded by

awokegoshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

CHAPTER ONE

INTRODUCTION TO STATISTICS
1.1 Meaning and basic Concepts of Statistics
Learning objectives

After studying this chapter, the participants should be able to:


 Define statistics
 Distinguish between qualitative data and quantitative data.
 Describe nominal, ordinal, interval, and ratio scales of measurements.
 Describe the difference between population and sample.

What does the word statistics mean? To most people, it suggests numerical facts or data, such
as unemployment figures, farm prices, or the number of marriages and divorces. The most
common definition of the word statistics is as follows:
 Statistics is the science of planning organizing, summarizing, presenting, analyzing,
interpreting, and drawing conclusions based on the data (Triola, 2012).
Importance of statistics: Using statistics has different benefits. Some of them are:
 to select an appropriate statistical test
 to collect the right kinds of information for analysis
 to perform statistical calculations in a straightforward, step-by-step manner
 to accurately interpret and present statistical results
 to be an intelligent consumer of statistical information
 to write up analyses and results in American Psychological Association (APA) style
1.2. Types of Statistics: Descriptive vs. Inferential

Statistics

Descriptive Inferencial
Statistics statistcs

Figure 1: Branches of statistics – there are two types/ branches of statistics


Inferential statistics
Consists of methods for drawing and measuring the reliability of conclusions about a
population based on information obtained from a sample of the population.
permit generalizations to be made about populations based on sample data drawn from
them
Use statistics, which are measures of a sample, to infer values of parameters, which are
measures of a population.
is the branch of statistics that involves using a sample to draw conclusions about a
population
Inferential statistics- t- test, correlation, ANOVA, MANOVA, regression, factor analysis that
use sample data and generalize the findings to the population
Descriptive statistics
Consists of methods for organizing and summarizing information
statistical procedures that describe, organize, and summarize the main characteristics of
sample data
Simply describe the set of data at hand
is the branch of statistics that involves the organization, summarization, and display of
data
Descriptive statistics use – ratio, percentage, mean, tables, graphs, figures, charts, standard
deviations, diagram, range
Practical Example 1- Decide which part of the study represents the descriptive of statistics.
What conclusions might be drawn from the study using inferential statistics?
1. A large sample of men, aged 48, was studied for 18 years. For unmarried men,
approximately 70% were alive at age 65. For married men, 90% were alive at age 65.
2. A survey conducted among 1017 men and women by Opinion Research Corporation
International found that 76% of women and 60% of men had a physical examination
within the previous year
 Solution for question1
Descriptive statistics involves statements such as
 “For unmarried men, approximately 70% were alive at age 65”
 “For married men, 90% were alive at 65.”
The inference drawn from the study is that
 Being married is associated with a longer life for men
 Solution for question 2
 Descriptive statistics involve the statement - “76% of women and 60% of men had
a physical examination within the previous year.”
 An inference drawn from the study is that - a higher percentage of women had a
physical examination within the previous year

Data
 are collections of observations (such as measurements, genders, survey responses)
 Consist of information coming from observations, counts, measurements, or responses.
Table 1-1 Data Used for Analysis

Data taken from


students’ grade report
for further analysis
Participants Creativity Pedagogy
1 82 84
2 67 77
3 80 86
4 69 66
5 56 68
6 90 94
7 85 95

Data Sets
There are two types of data sets you will use when studying statistics. These data sets are called
populations and samples

Population
The complete collection of all individuals (scores, people, measurements, and so on) to be
studied. The collection is complete in the sense that it includes all of the individuals to be
studied
The collection of all individuals or items under consideration in a statistical study
Complete set of events in which you are interested.
is the collection of all outcomes, responses, measurements, or counts that are of interest
For instance
 if we were interested in the stress levels of all adolescent Americans, then the
collection of all adolescent Americans’ stress scores would form a population,
 the scores of all morphine-injected mice
 the milk production of all cows in the country
 The ages at which every girl first began to walk
 the stress scores of the sophomore class in Woldia University
The population can range from a relatively small set of numbers, which is easily collected, to an
infinitely large set of numbers, which can never be collected completely. If the populations in
which we are interested are usually quite large, then, collecting data can be difficult for
researchers so the researchers can collect data from a representative sample taken from the
population.
Census
The collection of data from every member of the population
If a researcher take the whole population, then, no need of sampling to select the research
participants
A census consists of data from an entire population. But, unless a population is small, it is
usually impractical to obtain all the population data. In most studies, information must be
obtained from a sample.
Sample
A sub collection of members selected from a population
A part of the population from which information is obtained
Set of actual observations; subset of a population
N.B - A sample should be representative of a population so that sample data can be used to form
conclusions about that population. Sample data must be collected using an appropriate
methodsuch as random sampling
Practical example 2- Identify the population and the sample
 In a recent survey, 1500 adults in the United States were asked if they thought there was
solid evidence of global warming. Eight hundred fifty-five of the adults said yes.
Solution
 The population consists of the responses of all adults in the United States
 The sample consists of the responses of the 1500 adults in the United States in
the survey.

Responses of all adults in the Responses of adults


United States (population) in survey (sample)

Figure 2: population and sample


Parameter
 is a numerical description of a population characteristic
 a measure of population
 numerical measurement describing some characteristic of apopulation
Statistic
 is a numerical description of a sample characteristic
 a measure of a sample
 is a numerical measurement describing some characteristic of asample.
N.B -It is important to note that a sample statistic can differ from sample to sample
whereas a population parameter is constant for a population.
Practical example 3:Decide whether the numerical value describes a population parameter
or a sample statistic.
1. A recent survey of 200 college career centers reported that the average starting salary
for petroleum engineering majors is $83,121.
2. The 2182 students who accepted admission offers to Northwestern University in 2009
have an average SAT score of 1442
3. In a random check of a sample of retail stores, the Food and Drug Administration found
that 34% of the stores were not storing fish at the proper temperature.
Solution
1. The average of $83,121 is based on a subset of the population, it is a sample statistic.
2. The SAT score of 1442 is based on all the students who accepted admission offers in 2009, it is
a population parameter.
3. The percent of 34% is based on a subset of the population, it is a sample statistic.
1.3 Types of Variables
Types of Data
When doing a study, it is important to know the kind of data involved. The nature of the data you
are working with will determine which statistical procedures can be used. Data sets can consist
of two types of data: qualitative data and quantitative data.
Qualitative data
 Consist of attributes, labels, names or non numerical entries.
 consist of names or labels that are not numbers representing counts or measurements.
 Are also called categorical data
For example
- eye colors of green and brown
- Numbers 24, 28, 17, 54, and 31 are sewn on the shirtsof the LA Lakers starting
basketball team. These numbers are substitutes fornames. They don’t count or
measure anything, so they are categorical data.
- The political party affiliations (Democrat, Republican,Independent, other) of
survey respondents
Quantitative data
 Consist of numerical measurements or counts.
 Consist of numbers representing counts or measurements.
For example
- The ages (in years) of survey respondents
- Measurements of height and weight
- Academic achievement of students
- Time used by students in the classroom
- Number of students in Woldia University

THE CONCEPT OF A VARIABLE


A variable is any factor that can be measured or have a different value. Such factors can vary
from person to person, place to place, or experimental situation to experimental situation.
A variable is anything that can take on different values
the speed with which we drive down the street
the intensity of our caring for another person
the color of our skin
The height and weight of students
Academic achievement of students
There different types of variables which can be important in conducting research
1. Discrete variable
data are expressed in number with possible values is either a finite number or a
“countable” number 0, 1, 2, 3, 4,5 and so on
A quantitative variable whose possible values are counting numbers but not fractional
numbers
"discrete;' variable is used to characterize data in terms of whole numbers (1,2,3, and so
on) with no fractional counts occurring between them.
- Number of children in a family - number students in a classroom
- Number of cows in a village - Births and deaths per year
- The numbers of eggs that hens lay are discrete data because they represent
counts.
2. Continuous variables
 Quantitative, score, metric, ungrouped variables
 can take on any numerical value on a scale, and there exists an infinite number of
values between any two numbers on a scale
 Data from infinitely many possible values that correspond to some continuous scale
that covers a range of values without gaps, interruptions, or jumps.
 Are those variables which differ in degree rather than kind. These could be measured
on interval or ratio scales.
For example
 The amount of milk a cow gives in a day
 Academic achievement of students
 The intensity of stress among students
 Aggression of children
 Time, age, intelligence, creativity and behavior
3. Categorical variables.
 Also named as nonmetric, dichotomous, grouped, classification Variables
 A variable with different levels, groups, categories and classification
 Qualitative data consist of names or labels that are not numbers representing counts
or measurements.
 Qualitative variables are those variables which differ in kind rather than degree.
These could be measured on nominal or ordinal scales.
For example
 Gender - females and males
 Political parties – liberals, democratic, republican and so on
 Grade levels – grade 1, grade 2 or 1st year, 2nd year, 3rd year
 Economic status - destitute, poor, rich, wealthy
 Academic status – warning, probation, promoted
 Colleges – Educations, FBE, Technology, Agriculture ……
1.4 Scales / levels/ of measurement
Measurement represents a set of rules informing us of how values are assigned to objects or
events. Stevens, in 1946 identified four scales in his theory: nominal, ordinal, interval, and ratio
scales in that order. Each scale includes an extra feature or rule over those in the one before it.
We will add a fifth scale to Steven’s treatment, summative response scaling, placing it between
the ordinal and the interval scale.
1. Nominal Scales
 an observation is simply given a name, a label, or otherwise classified
 Nominal scales use numbers, but these numbers are not in any mathematical relationship
with one another.
 A nominal scale uses numbers to identify qualitative differences among measurements.
 The measurements made by a nominal scale are names, labels, or categories, and no
quantitative distinctions can be drawn among them.
 More qualitative and provide less information
 the lowest level of measurement are nominal scales
 categorical variables that represent different categories
 shows membership or member of a category
 the data are organized in the form of frequency counts for a given category
 Frequency counts simply tell us how many people we have in each category.
For example - Gender (1 = male, 2 = female), Ethnicity or religion of person, Smoker vs.
nonsmoker, literate versus illiterate,
2. Ordinal scales
 The measurement of an observation involves ranking or ordering based on an underlying
dimension.
 An ordinal scale ranks or orders observations based on whether they are greater than or
less than one another
 Ordinal scales do not provide information about how close or distant observations are
from one another.
 An ordinal scale of measurement uses numbers to convey “less than” and“more than”
information. This most commonly translates as rank ordering. Objects may be ranked in
the order that they align themselves on some quantitative dimension but it is not possible
from the ranking information to determine how far apart they are on the underlying
dimension.
3 INTERVAL SCALES
 Interval scales of measurement have all of the properties of nominal and ordinal.
 The most common illustrations of an equal interval scale are the Fahrenheit and Celsius
temperature scales.
 According to Stevens “Equal intervals of temperature are sealed off by noting equal
volumes of expansion. Eg” Essentially, the difference in temperature between 30 and 40◦
F is equal to the difference between 70 and 80◦ F.
 A less-obvious but important characteristic of interval scales is that they have
arbitrary zero points.
For example, the term zero degrees do not mean the absence of temperature – on the
Celsius scale, zero degrees is the temperature at which water freezes.
 As was true for summative response scales, it is meaningful to average data collected on
an interval scale of measurement.
 The average high temperature in our home town last week was 51.4◦ F
4. RATIO SCALES
 A ratio scale of measurement has the properties of nominal, ordinal, and interval
scales
 It has an absolute/true/ zero point, where zero means absence of the property
 Ratio scales are time and measures of distance.
 interpret in a meaningful way ratios of the numbers on these scales
 four hours is twice as long as two hours or that three miles is half the distance of six
miles
CHAPTER TWO
ORGANIZING AND PRESENTING DATA
1.1 Raw Data
Raw data are primary data or secondary data (e.g., numbers, instrument readings, figures, etc.)
collected from a source.
Lesson objectives
 Know ways of organizing data
 Create different types of charts that describe data sets.

Raw data
data that are not summarized,
not providing a meaning to the data
not interpreted to give some kind of meaning
has not been subjected to processing, "cleaning" by researchers to remove outliers
2.2 Organizing & Graphing Qualitative Data
2.2.1 Frequency distribution for qualitative Data
Qualitative data are values of a qualitative (nonnumerically valued) variable.One way of
organizing qualitative data is to construct a table that gives the number oftimes each distinct
value occurs. The number of times a particular distinct value occurs is called its frequency (or
count).
A frequency distribution of qualitative data is a listing of the distinct values and their
frequencies. Frequency distribution provides a table of the values of the observations and how
often they occur
To Construct a Frequency Distribution of Qualitative Data, there are three steps
Step 1 - List the distinct values of the observations in the data set in the first column of a table.
Step 2 - For each observation, place a tally mark in the second column of the table in the row of
the appropriate distinct value.
Step 3 - Count the tallies for each distinct value and record the totals in the third column of the
table.
Practical example 6: Frequency Distribution of Qualitative Data
What is the highest level of education you have completed (please tick)?The responses of the 40
participants in the study are given in Table below. Determine a frequency distribution of these
data
❐ 1. Illiterate ❐ 4.Technique/College/
❐ 2.Primary school ❐ 5. Undergraduate university
❐ 3.secondary school ❐ 6. Postgraduate
Table 3: Categorical data for frequency distribution
Undergraduat
Illiterate Primary Primary Illiterate Primary Primary Primary
e
Tech/ Undergraduat
Illiterate Postgraduate Illiterate Tech/College Primary Primary
College e
Undergraduat Tech/
Primary Postgraduate Tech/College Secondary Secondary Secondary
e College
Secondar Undergraduat
Primary Tech/College Postgraduate Secondary primary Secondary
y e
Undergraduat Tech/ Undergraduat
Primary Secondary Tech/College Secondary Secondary
e College e
Solution
Step 1 - List the distinct values of the observations in the data set in the first column of a table.
The distinct values of the observations are illiterate,primary, secondary, college/technique,
undergraduate degree and post graduate which we list in the first column of Table 3
Step 2 - For each observation, place a tally mark in the second column of the table in the row of
the appropriate distinct value.
Step 3 Count the tallies for each distinct value and record the totals in the third column of the
table. Counting the tallies in the second column of Table, gives the frequencies in the third
column of Table. The first and third columns of Table and provide a frequency distribution for
the data in Table.
Table 4: Frequency distribution Table for categorical data

category tally Frequency count


Illiterate //// 4

Primary //// //// // 12


What is the highest level of
education you have completed? Secondary //// /// 8

Technique /College //// // 7

Undergraduate //// / 6

Postgraduate /// 3

Total 40

2.2.2 Relative Frequency & Percentage Distribution


In addition to the frequency that a particular distinct value occurs, we are often interested in the
relative frequency, which is the ratio of the frequency to the total number of observations:
There are two steps to find relative frequency
Step 1 Obtain a frequency distribution of the data.We obtained a frequency distribution of the
data from

Frequency
Relative frequency=
Number of total observations

Frequency of category
Relative frequency a category=
Number of total observations
❑ ❑
Frequency of illiterate 4
Relative F for illiterate= = =0.1
Number of total observations 40
❑ ❑
Frequency of primary 12
Relative F for primary= = =0.3
Number of total observations 40

Frequency of others 8
Relative F for secondary= = =0.2
Number of total observations 40

Table 5: Relative frequency distribution Table


relative
category tally Frequency frequency
Illiterate 4 10
Primary 12 30
What is the Secondary 8 20
highest level Technique 7
of education /College 17.5
you have
Undergraduate 6 15
completed?
Postgraduate 3 7.5
Total 40 100

2.2.3 Graphical presentation of qualitative Data


2.2.3.1 Bar Graph
A bar graph is a graph that displays the frequency or numerical distribution of a categorical
variable, showing values for each bar next to each other for easy comparison. A bar chart is a
graphical display of data that have been classified into a number of categories. Equal-width
rectangular bars are used to represent each category, with the heights of the bars being
proportional to the observed frequency in the corresponding category.
Bar Graph Characteristics
1. Data can be quantitative or categorical
2. Bars can be vertical or horizontal
3. The x-axis represents the category displayed
4. The y-axis represents the quantitative values of the variable being displayed
5. Bars are of uniform width and uniformly spaced
6. A consistent measurement scale is used for each vertical bar
7. Height of bars represent the values of the variable displayed, the frequency of occurrence or
percentage of occurrence
8. The graph is well-annotated with title, labels for each bar, vertical scale, horizontal categories,
source

what is the highest level of education


you have completed?
120.0
100.0
80.0
60.0 Percent
40.0
20.0
0.0
e y ry ge te te l
at ar ta
ter ir m n da olle dua d ua To
Illi P co C ra ra
Se ue/ erg stg
q d o
ni Un
P
ech
T

The other way of presenting the data using bar graph is

What is the highest level of education


you have completed?
What is the highest level of education you have completed?
30%
20% 18% 15%
10% 8%

te y y e e e
ra ar d ar lleg uat uat
te rim n d d
Illi P co Co ra ra
Se ue/ erg stg
iq d Po
c hn Un
Te

Pie Chart
 A pie chart is a disk divided into wedge-shaped pieces proportional to the relative
frequencies of the qualitative data
Another method for organizing and summarizing data is to draw a picture of some kind. The old
saying “a picture is worth a thousand words” has particular relevance in statistics—a graph or
chart of a data set often provides the simplest and most efficient display. Two common methods
for graphically displaying qualitative data are pie charts and bar charts. We begin with pie charts.
To Construct a Pie Chart
Step 1 Obtain a relative-frequency distribution of the data
Step 2 Divide a disk into wedge-shaped pieces proportional to the relative frequencies.
 We see that, in this case, we need to divide a disk into three wedge-shaped pieces that
comprise 32.5%, 45.0%, and 22.5% of the disk.
Step 3 Label the slices with the distinct values and their relative frequencies.
 Notice that we expressed the relative frequencies as decimal or percentage.

What is the highest level of education


you have completed?
Illiterate Primary Secondary
Technique /College Undergraduate Postgraduate

8% 10%
15%
30%
18%

20%

Every Pie Chart COULD be made into a Bar Graph


BUT
NOT all Bar Graphs CAN be made into Pie Charts

Pie Chart Bar Graph

Bar Graph not Pie Char


 Bar Graphs are easier to make & to read than pie charts
 Both pie charts & bar graphs can display the distribution of a categorical variable
 A bar graph can also compare any set of quantities measured in the same units
Organizing Quantitative Data using frequency distribution
To organize quantitative data, we first group the observations into classes. Consequently, once
we group the quantitative data into classes, we can construct frequency and relative-frequency
distributions of the data in exactly the same way as we did for qualitative data. Several methods
can be used to group quantitative data into classes. Here we discuss two of the most common
methods: single-value grouping and limit grouping
Single-Value Grouping
In some cases, the most appropriate way to group quantitative data is to use classes in which
each class represents a single possible value. Such classes are called single value classes, and this
method of grouping quantitative data is called single-value grouping.
Practical example 5: Frequency distribution for ungroup data
Table 7: Test scores taken from first year students in statistics class
Scor Scor
Score Sex e Sex e Sex Score Sex
6 F 4 F 9 M 7 M
5 M 5 M 2 M 7 M
4 F 5 F 2 M 7 M
9 F 9 F 6 F 4 M
10 F 10 F 7 F 7 F
Based on the above table above answer the following questions

1. What percentage of respondents are female and Male


2. What percentage of female are score 5 and below
3. What percentage of male are score 5 and below
4. What percentage of respondents are score 5 and below
5. Prepare a frequency distribution table for students test score
6. Generate a bar graph to assess students test score
Constructing Frequency Distribution for Grouped Data
A second way to group quantitative data is to use class limits. With this method, each class
consists of a range of values. The smallest value that could go in a class is called the lower limit
of the class, and the largest value that could go in the class is called the upper limit of the class.
This method of grouping quantitative data is called limit grouping. It is particularly useful when
the data are expressed as whole numbers and there are too many distinct values to employ single-
value grouping.
Terms Used in Grouping Data
 Lower class limit: The smallest value that could go in a class.
 Upper class limit: The largest value that could go in a class.
 Class width: The difference between the lower limit of a class and the lowerlimit of the
next-higher class.
 Midpoint: The average of the two class limits of a class.
Table 10: Grouped data frequency distribution
Question1. Construct a frequency table (with class boundary)
Solution:
Step1. Finding the range
Range = Highest score – Lowest score
Range = 54 – 18 = 36

Step2. Determine the class width (i)

Range 36
Class width (i) = = 7.2 (round to 8)
Numberofinterval 5

Step3. List the limits of each class interval

To set the lower and upper boundary 0.5 is subtracted from the lower limit and added to the
upper limit boundary of each class interval. Therefore, the class boundary of the distribution is
organized as follows

Class boundary Frequency


17.5 – 25.5 13
25.5 – 34.5 8
34.5 – 41.5 4
41.5 – 49.5 3
49.5 – 57.5 2

UNIT THREE
2. MEASURES OF CENTRAL TENDENCY
 Central tendency is a statistical measure that determines a single value that accurately
describes the center of the distribution and represents the entire distribution of scores
 The goal of central tendency is to identify thesingle value that is the best representative
forthe entire set of data.
 Central tendency serves as a descriptivestatistic because it allows researchers todescribe
or present a set of data in a verysimplified, concise form.
Characteristics of a good measure of central tendency
Measure of central tendency is a single value representing a group of values and hence
issupposed to have the following properties.
1. Easy to understand and simple to calculate.
A good measure of central tendency must be easy to comprehend and the procedure involved in
its calculation should be simple.
2. Based on all item
A good average should consider all items in the series.
3. Rigidly defined
A measure of central tendency must be clearly and properly defined. It will be better if itis
algebraically defined so that personal bias can be avoided in its calculation.
3. Capable of further algebraic treatment
A good average should be usable for further calculations.
5. Not be unduly affected by extreme values
A good average should not be unduly affected by the extreme or extra ordinary values in a series.
The most common measures of central tendency are
3.1. The mean
 The sum of all the data entries divided by the number of entries
 The mean, also known as the arithmetic average, is found by adding the values of thedata
and dividing by the total number of values
 The mean is the sum of the values, divided by the total number of values.
3.1.1. Properties of Mean
 It is simple to understand and easy to calculate
 It takes into account all the items of the series
 It is rigidly defined and is mathematical in nature
 It is relatively stable
 It is capable of further algebraic treatment
 Mean is the center in balancing the values on either side of it and hence is more typical
 The mean is sensitive to the exact value of all the scores in the distribution
 The sum of the deviations about the mean equals zero
3.1.2 Computing Means of Ungrouped Data

sum of all x
x−bar =
Or number of x
Example: The following data represents the ages of 20 students in a statistics class. Calculate the
meanage ofstudents.

20 20 20 20 20 20 21
21 21 21 22 22 22 23
23 23 23 24 24 65

3.1.3 Computing Mean for Grouped data


Finding the Mean of a Frequency Distribution
Example:The following data represents the ages of 30 students in a statistics class.
Construct a frequency distribution that has five classes.

Step1 – Prepare class interval or boundary


Step2- Find the midpoint of each class

Step 3 -Find the sum of the products of the midpoints and the frequencies.
Step 4 - Find the sum of the frequencies.
Step 5 - Find the mean of the frequency distribution

Class Frequenc Cumulative Midpoints


(x∙f)
interval y Frequency (x)
18 – 25 13 13 21.5 279.5
26 – 33 8 21 29.5 236
34 – 41 4 25 37.5 150
42 – 49 3 28 45.5 136.5
50 – 57 2 30 53.5 107
N = 30 Σ(x∙f) =
909
Then, Mean=
∑ ( f . x ) = 909 = 30.3
N 30

Therefore, the average age of students is 30.3 years


Example 2: calculate the mean of the following data set

Class Interval Frequency (F) Cumulative Frequency FX Midpoints (x)

1 9.5 – 14.5 1 1 12 12

2 14.5 – 19.5 1 2 17 17

3 19.5 – 24.5 2 4 44 22

4 24.5 – 29.5 7 11 189 27

5 29.5 – 34.5 3 14 96 32

6 34.5 – 39.5 2 16 74 37

7 39.5 – 44.5 4 20 168 42

Mean =
∑ fx = 600 = 30
N 20

3.2 The Median


 Median is a point in the data set above and below which half of the cases fall.
 The median of a data set is the measure of center that is the middle
value when
the original data values are arranged in order of increasing (or
decreasing)
magnitude.
 The median is the middle score of a data set if the scores are organized from the
smallest to the largest.
 The median is a number or score that precisely divides adistribution of data in
half. Fifty percent ofadistribution's observations will fall above the median and
fifty percent will fall below it.
 The middle number in an ordered set of numbers. Divides the data into two equal
parts.

3.1.1. Properties of Median


 The median can be used for calculations involving ordinal, interval, or ratioscale
data
 difficult to compute because data must be sorted
 best average for ordinal data
 unaffected by extreme data
3.1.2 Computing Median of Ungrouped Data
 If a data set is odd in number, the median falls exactly on the middle number.
 If a data set is even in number: the median is the average of the two middle values.
For an odd number of scores, here is a data set of 15 scores to consider
26 32 21 12 15 11 27 16 18 21 19 28 10 13 31
Step 1: To calculate the median, arrange the scores from the lowest to the highest:
10 1112 1315 16 18 19 21 21 26 2728 31 32
Step 2: The location of the median score can be found by taking the middle value or using a
N +1
simple formula: Median =
2
= 15+1
2
=8
3.1.3 Computing Median for Grouped data
Based on the following frequency distribution, answer the questions given below the data.

Cl. Interval Freq.(F) Cum. Frequency FX Midpoints (x)


1 9.5 – 14.5 1 1 12 12
2 14.5 – 19.5 1 2 17 17
3 19.5 – 24.5 2 4 44 22
4 24.5 – 29.5 7 11 189 27
5 29.5 – 34.5 3 14 96 32
6 34.5 – 39.5 2 16 74 37
7 39.5 – 44.5 4 20 168 42

Questions
Calculate median of the frequency distribution.
There are steps for the calculation of the median in frequency distribution
n
Step2: Find ( )to identify the median class
2
n
Step3: See in the cumulative frequency the value first greater than ( ), Then the corresponding
2
class interval is called the Median class.
Step 4:Calculate the median of the distribution
20
Median class ( ) = 10
2
n
−m
Median = L + 2 xC
f
Where: n = the total number of scores
L = the lower limit of the median class
m = the frequency before the median class
f = frequency of the median class
c = class width
The median lies between 4 and 11. Corresponding to 4 the less than class is 24.5 and
corresponding to 11 the less than class is 29.5. Therefore the median class is 24.5-29.5. Its lower
limit is 24.5. Here L = 24.5, n= 20, f = 7, c = 20, m =4
10−4 6
Median = 24.5 + x 5 = 24.5 + x 5 = 24.5 + 4.28 = 28.285
7 7
3.3 The Mode
 Mode is the most frequently occurring value in a data set
 The mode is the most frequently occurring category of score.
 It is merely the most common score or most frequent category of scores
3.1.1. Properties of mode
 can apply the mode to any category of data
 The mode is the only measure that applies to nominal (category) data aswell as
numerical score data.
 You can have a single number for the mode, no mode, or more than one number.
 best average for nominal data
 easy to determine
 When two data values occur with the same greatest frequency, each one is amode
and the data set is bimodal.
 When more than two data values occur with the same greatest frequency, each isa
mode and the data set is said to be multimodal.
 When no data value is repeated, we say that there is no mode.
3.1.2 Computing Mode of Ungrouped Data
 Identify the number that occurs most often.
 Organize frequency distribution to identify the most frequent score in distribution
For example: 10 11 12 13 15 16 18 19 21 21 26 27 28
31 32. From the above 21 is the mode of the data set
3.1.3 Computing Mode for Grouped data
Based on the following frequency distribution, answer the questions given below the data.

Cl. Interval Freq. (F) Cum. Frequency Midpoints (X) FX


1 9.5 – 14.5 1 1 12 12
2 14.5 – 19.5 1 2 17 17
3 19.5 – 24.5 2 4 22 44
4 24.5 – 29.5 7 11 27 189
5 29.5 – 34.5 3 14 32 96
6 34.5 – 39.5 2 16 37 74
7 39.5 – 44.5 4 20 42 168

Questions
Calculate mode of the frequency distribution
The modal class can be easily identified compared with median. The modal class can be
observed with the higher frequency in frequency of the distribution. Then, the modal class is 24.5
to 29.5.
Step 2:Calculate the mode of the distribution
fs
Mode =L + xC
fs+ fp

When L – the lower limit of the median class

Fs =Here L = 24.5, f = 7, c = 5, fs = 3, fp= 2

3 3
Mode = 24.5 + x 5 = 24.5 + x 5 = 24.5 + 3 = 27.5
3+2 5

Class work

Test scores taken from first year students in statistics class


Scor Scor
Score Sex e Sex e Sex Score Sex
6 F 4 F 9 M 7 M
5 M 5 M 2 M 7 M
4 F 5 F 2 M 7 M
7 M 6 M 7 F 9 F
7 M 7 M 4 F 10 M
7 F 8 M 5 M 5 M
9 F 9 F 6 F 4 M
10 F 10 F 7 F 7 F
10 M 2 M 8 M 7 F
2 F 8 F 9 F 6 M
Based on the above table above answer the following questions

1. What is the average test result of the sample?


2. What is the median test result of the sample?
3. What is the mode of the sample?
4. What percentage of the sample indicated that they had a problem with their
academic achievement?

UNIT FOUR
4. MEASURES OF DISPERSION/ VARIATION
Measures of variability provide information about the amount of spread or dispersion among the
variables. Range, variance, and standard deviation are the common measures of variability.
4.1 Range, standard deviation and variance
Range
 Simply the difference between the largest and smallest values in a set of data
 Is considered primitive as it considers only the extreme values which may not be useful
indicators of the bulk of the population.
 The formula is - Range = largest observation - smallest observation
 is the difference between the largest and the smallest values.
 used for ordinal data
Range = the highest – the lowest scores
Standard deviation
 Measures the variation of observations from the mean
 Isthe positive square root of variance
 The most common measure of dispersion
 Takes into account every observation
 Measures the ‘average deviation’ of observations from the mean
 used on ratio or interval data
 The standard deviation measures the variation among data values.
 Values close together have a small standard deviation, but values with muchmore
variation have a larger standard deviation.
 For many data sets, a value is unusual if it differs from the mean by more thantwo
standard deviations
Steps in Calculating Standard deviation

For example – The following are assessment scores of students in Abnormal psychology
Then, calculate the variance and standard deviation of the data set

Sum of square x = 88.5


Sample Variance
2
2 = Σ ( x−x ) 88.5
S n−1
= 10−1 = 9.83
SD = √ 9.83 = 3.135
Or

nΣ(x ¿¿ 2)−¿ ¿ ¿ ¿ = 9.83


SD =√ 9.83 = 3.135

Variance
 is the sum of the squared deviations of each value from the mean divided by the number
of observations
 mean of squared differences between scores andthe mean
 used on ratio or interval data
 used for advanced statistical analysis
 is equal to the average of the squared deviations from the mean of a distribution.
Symbolically, sample variance is s2and population variance is
:
Classwork - Test scores - 6, 3, 8, 5, 3. Find the variance

4.3 Measures of position


Measures of position tell where a specific data value falls within the data set or its relative
position in comparison with other data values.
4.3.1 Quartiles &Interquartile Range
Interquartile Range
 Measures the range of the middle 50% of the values only
 Is defined as the difference between the upper and lower quartiles
 Interquartile range = upper quartile - lower quartile
= Q3 - Q1
Find a quartile by determining the value in the appropriate position in the ranked data, where
First quartile position = Q1 = (n+1)/4 ranked value
Second quartile position = Q2 = (n+1)/2 ranked value
Third quartile position = Q3 = 3(n+1)/4 ranked value
Where n is the number of observed values
For example: Sample Ordered Data: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
soQ1 = (12+13)/2 = 12.5
Q2 is in the(9+1)/2 = 5th position of the ranked data,
soQ2 = median = 16
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
soQ3 = (18+21)/2 = 19.5
Then, Interquartile range = Q3 – Q1 = 19.5 - 12.5 = 7
When calculating the ranked position use the following rules
 If the result is a whole number then it is the ranked position to use
 If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then average the two
corresponding data values.
 If the result is not a whole number or a fractional half then round the result to the
nearest integer to find the ranked position.

4.3.3 Percentiles
 are measures of location, denoted which dividea set of data into 100 groups
with about 1% of the values in each group
 Percentiles are merely a form of cumulative frequency distribution, but instead of being
expressed in terms of accumulating scores from lowest to highest, the categorisation is in
terms of whole numbers of percentages of people.
 The percentile is thescore which a given percentage of scores equals or is less than.
 examined to find the cut of points in a given data set
For example - 80% of scores are equal to 61 or less
For example, the 50th percentile, denoted, has about 50% of the data values below it and about
50% of the data values above it. So the 50th percentile is the same as
the median. There is not universal agreement on a single procedure for calculating
percentiles, but we will describe two relatively simple procedures for
(1) Finding the percentile of a data value,

(number of data values below X )


Percentile Rank= ∗100 %
total number of values

x
p= ∗100
n
Sorted data = 3, 4, 5, 6, 7, 9, 12, 15, 20, 22, 23, 24, 25
Find the percentile of 22
P = n < X/ N* 100% = 9/13 * 100 = 69.23 = 70% - Then, the above, 70% of students scored
22 and below or only 30 % of students scored above score 22
(2) Converting a percentile to its corresponding data value.
Examples
P30 is the value that divides the lowest 30% of the data from the highest 70% of the data
P70 divides the lowest 70% of the data from the highest 30% of the data

When:
N - Total number of values in the data set
K - Percentile being used (Example: For the 25th percentile, )
L - Locator that gives the position of a value (Example: For the 12th value in
the sorted list,)
Find the value of 25th percentile
L = K/ 100 *N = 25/100 * 100 = 3.25 = 4 position = 6 - This shows that 25% of students scored
6 and below
Find the value of 50th percentile
L = K/ 100 *N = 50/100 * 100 = 6.5 = 7 position = 12

4.3.5. Z-score
 Z-scores are merely scores expressed in terms of the number of standard statistical
units of measurement (standard deviations) they are from the mean of the set of scores.
 A z score (or standardized value) is found by converting a value to a
standardized
scale, as given in the following definition. This definition shows that a z score
is the
number of standard deviations that a data value is deviated from the mean.
 A z score (or standardized value) is the number of standard deviations
that
a given value x is above or below the mean
 We used the range rule of thumb to conclude that a value is “unusual”
if it is more than 2 standard deviations away from the mean. It follows that
unusual
values have z scores less than-2 or greater than + 2.

 A positive z-score means that a score is above the mean.


 A negative z-score means that a score is below the mean.
 A z-score of 0 means that a score is the exact sameas the mean
For example
A student scored a 65 on a math test that had a mean of 50 and a standard deviation of 10. She
scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative
position on the two tests.
Solution

Math: z = (65-50)/10= 15/10 = 1.5


History: z = (30-25)/5 = 5/5 = 1
The student did better in math because the z-score was higher
Example 2
Find the z-score for each test and state which test is better
Test A:
Test B:
Test A: z = (38-40)/5 = -0.4
Test B: z = (94-100)/10 = -0.6
Test A is higher, therefore it is better. It has a higher relative position.
CHAPTER FIVE
MEASURES OF RELATIONSHIP
The correlation coefficient, r, measures the strength of the linear relationship between two paired
variables in a sample. Pearson correlation or Spearman correlation is used when you want to
explore the strength of the relationship between two continuous variables. This gives you an
indication of both the direction (positive or negative) and the strength of the relationship. A
positive correlation indicates that as one variable increases, so does the other. A negative
correlation indicates that as one variable increases, the other decreases

3.1 Characteristics of Associations between Variables


1) Main research questions are stated as null hypotheses, i.e., no relationship exists between
AbsenteeismandGPA? Does absenteeismincrease with GPA?
i. H0: r = 0 (there is no relationship)
ii. H1: r ≠ 0 (there is a relationship)
2) In simple correlation, there are two measures for each individual in the sample.
3) For Pearson correlation, there must be at least 30 individuals in the study.
4) Can be used to measure the degree of relationships, not simply whether a relationship
exists.
5) A perfect positive correlation is 1.00;
6) A perfect negativecorrelation (inverse) is -1.00.
7) A correlation of 0 indicates no linear relationship exists.
8) Correlation does not imply CAUSATION!
9) The magnitude or numerical value of a correlationexpresses the strength of the
relationship between the two variables
10) The sign of a correlation coefficientindicates the direction of the relationshipbetween the
two variables
11) Positive correlation – a direct, positiverelationship between two variables; as onevariable
increases, the other variable increases
12) Negative correlation – an inverse, negativerelationship between two variables; as
onevariable increases, the other variable decreases
How do you interpret values between 0 and 1?
Different authors suggest differentinterpretations; however, Cohen (1988, pp. 79–81) suggests
the following guidelines:
Category Positive Negative
SMALL r= 0.10 to0.29 r= -0.10 to-0.29

MEDIUM r= 0.30 to 0 .49 r= -0.30 to -0 .49

LARGE r= 0.50 to 1.0 r=-0.50 to-1.0


No linear relationship r=0
The Pearson Correlation Coefficient
The Pearson r is used to advance research beyond the arena of descriptive statistics. Specifically,
the Pearson‘‘r’’ enables investigators to assess the nature of the association between two
variables, X and Y
The Pearson r, a correlation coefficient, is a statistic that quantifies the extent to which two
variables Xand Yare associated, and whether the direction of their association is positive,
negative, or zero.
Apositive correlation is one where as the value of Xincreases, the corresponding value of Y also
increases. Similarly, apositive correlation exists when the value of Xdecreases, the value of Y
also decreases.
Anegative correlation identifies an inverse relationship between variables X and Y-as the value
of one increases, the other necessarily decreases.
Azero correlation indicates that there is no pattern or predictive linear relationship between the
behaviorof variables Xand Y.
Assumptions of Pearson product moment correlation
It needs two continuous/ score/ variables
Each participants should have two measurements
Number of participants should be greater than 30
The distribution should be symmetric or normal
Linear Relationships and Scatterplots of Variables X and Y
 Greater linearity in a scatter plot's points indicates a stronger correlation.
 Plotting the X and Y pairs on scatterplot is a good way to visualize a correlational
relationship
 When variables X and Y result in an r of +1.00 or -1.00, they are said to exhibit a perfect
linear relationship. By "linear;' we mean that the relationship between the two variables is
best represented by a straight line on a diagram. The diagram of choice for plotting
variables is called a scatter plot or scatter diagram.
 A scatter plot is a particular graph used to present correlational data. Each point in a
scatter plot represents the intersection of an X value with its corresponding Y value
Identify the type of correlation from the above scatter plot
Computing Correlation Coefficient

For Example:

Absent (X) Ac.Achievement(Y) XY 2 2


X Y
0 8 0 0 64
2 10 20 4 100
3 4 12 9 16
6 6 36 36 36
9 1 9 81 1
10 3 30 100 9
30 32 107 230 226
r = -.797
Considerthe following table as the number of participants in a study
is 300
Correlations
Absenteeism GPA
Absenteeism Pearson Correlation 1 -.797
Sig. (2-tailed) 0.00
N 300
GPA Pearson Correlation -.797 1
Sig. (2-tailed) 0.00
N 300
There was a strong, negative correlation between the two variables, r = –.797, n = 300, p < .0005,
with high levels of absenteeism associated with lower levels of GPA.It implies that the
relationship is negative and significant. This shows that the relationship between absenteeism
and GPA isnegative - as absenteeism increases GPA decreases

Class work
Test1 6 6 5 4 7 4 4 3 6 10 6 6 4 8 12 12 11
Test2 8 4 8 2 4 8 2 5 10 10 10 8 7 12 11 10 9
Students 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Calculate the relationship between Test 1 and Test 2, check its significance and interpret it
3.2 Spearman’s rho Correlation Coefficient
When to use
- there are two a ranked data for variable A and variable B
- The data is skewed for from the normal distribution
- If N is less than 30
The Pearson correlation coefficient is the dominant correlation index in psychological statistics.
There is another called Spearman’s rho which is not very different. Instead of taking the scores
directly from your data, the scores on a variable are ranked from smallest to largest. That is, the
smallest score on variable X is given rank 1, the second smallest score on variable X is given
rank 2, and so forth. The smallest score on variable Y is given rank 1, the second smallest score
on variable Y is given rank 2, etc. Then Spearman’s rho is calculated like the Pearson correlation
coefficient between the two sets of ranks as if the ranks were scores.
A special procedure is used to deal with tied ranks. Sometimes certain scores on a variable are
identical. There might be two or three people who scored 7 on variable X, for example.This
situation is described as tied scores or tied ranks. The question is what to do about them. The
conventional answer in psychological statistics is to pretend first of all that the tied scores can be
separated by fractional amounts. Then we allocate the appropriate ranks to these ‘separated’
scores but give each of the tied scores the average rank that they would have received if they
could have been separated
The two scores of 5 are each given the rank 2.5 because if they were slightly different they
would have been given ranks 2 and 3, respectively. But they cannot be separated and so we
average the ranks as follows:

Ranking of a set of scores when tied (equal) scores are involved


Score 4 5 5 6 7 8 9 9 9 10
Rank 1 2.5 2.5 4 5 6 8 8 8 10

In the above Table, there are two 5 scores, for these tied scores, the average rank should be given
 2+3/2 = 2.5
 7+8+9/ 3 = 8
There are three scores of 9 which would have been allocated the ranks 7, 8 and 9 if the scores
had been slightly different from each other. These three ranks are averaged to give an average
rank of 8 which is entered as the rank for each of the three tied scores

Participants 1 2 3 4 5 6 7 8 9 10
Test1 for MA 8 3 9 7 2 3 9 8 6 7
Rank1 7.5 2.5 9.5 5.5 1 2.5 9.5 7.5 4 5.5
Test2 for MUA 2 6 4 5 7 7 2 3 5 4
Rank 2 1.5 8 4.5 6.5 9.5 9.5 1.5 3 6.5 4.5
Difference (D) 6 5.5 5 1 8.5 7 8 4.5 2.5 1
D2 36 30.25 25 1 72.25 49 64 20.25 6.25 1

Step1 – State the hypothesis


H0: r = 0 (there is no relationship)
H1: r ≠ 0 (there is a relationship)
Step2 – Find n-2 = 10 - 2 = 8
Step3 - Find critical value using 8 = 0.738
Step4 – Compute Spearman rho correlation coefficient
2
6 ΣD 6 x 305 1830
r (Spearman)=1−
n ( n2 −1 )
=
1−
10 ( 10 −1 )
2
1−
990 = -0.85
Step5 - Decision making about H0 and H1
Since calculated r = -0.85> Table/critical value = -0.85 at 0.05 significance level, H0 is rejected
and H1 is accepted.
Step6 – Reporting or interpretation
This finding implies that Spearman correlation confident found evidence musical ability was
significantly and inversely related to mathematical ability(r =-0.85, p < 0.05).
Classwork exercise
1. A researcher wants to investigate the relationship between time of study per hour and
levels of perceived stress. A data is collected from 10 sample of students which indicated
below

no Study time stress


1 3 10
2 5 13
3 0 15
4 0 14
5 1 12
6 2 11
7 5 10
8 4 14
9 6 15
10 3 16

2. Is there a relationship between students study time per hour and their levels of perceived
stress?
3. Do people with higher study time lower levels of perceived stress or higher level of
stress?

CHAPTER SIX
HYPOTHESIS TESTING
6.1 Concepts of Hypothesis Testing
 Hypothesis is usually considered as the principal instrument in research.
 Many experiments are carried out with the deliberate object of testing hypotheses.
 In social science, where direct knowledge of population parameter(s) is rare, hypothesis
testing is used strategy for deciding whether a sample data offer generalization can be
made or not
 Hypothesis testing enables us to make probability statements about population
parameter(s).
6.2 what is Hypothesis
 Hypothesis simply means a mere assumption or some supposition to be proved or
disproved. But in a researcher hypothesis is a formal question that researcher intends to
resolve. A research hypothesis is a predictive statement, capable of being tested by
scientific methods, that relates an independent variable to some dependent variable.
 For example, consider the following statement
 A. “Students who receive counseling will show a greater increase in creativity than
students not receiving counseling
Typically, in hypothesis testing, we have two options to choose from. These are termed as null
hypothesis and alternative hypothesis.
NULL HYPOTHESIS VS ALTERNATIVE HYPOTHESIS
 Null hypothesis: Is a statistical hypothesis testing that assumes that the observation is due
to a chance factor. In hypothesis testing, null hypothesis is denoted by; H0: μ1 = μ2,
which shows that there is no difference between the two population means.
 The null hypothesis always states that there is no effect in the underlying population. By
effect we might mean a relationship between two or more variables, a difference between
two or more different populations or a difference in the responses of one population
under two or more different conditions.
 Alternative Hypothesis (H1) - a hypothesis to be considered as an alternative to the null
hypothesis.
Examples of null hypotheses (H0): in the above example
 There is no relationship between study time and exam grade
 There is no difference between female and male participants in exam result
 There is no difference in of exam result after the participants take training on study skill

We use the symbol Ha to represent the alternative hypothesis


 Alternative hypothesis shows that observations are the result of a real effect.
 The alternative hypothesis (H1) is a statement that the null hypothesis is not true.
 It is the statement that must be true if the null hypothesis is false.
Examples of alternative hypotheses (H1): in the above example
 . There is relationship between study time and exam grade
 There is difference between female and male participants in exam result
 There is difference between exam result after the participants take training on study skill

6.4 Directional & Non – directional Hypothesis


 Directional hypotheses
 In a study of investigating the relationship between numbers of hours spent
studying per week and final examination grade.
 We made the prediction (hypothesized) that, as a hours of study increased, so would
exam grades. This is said to be a directional hypothesis.
We have specified the exact direction of the relationship between the two variables: as
study hours increased, so would exam grades. This is also called a one-tailed hypothesis
 Non directional hypotheses
 In some study, we are not sure of the exact nature of the relationships
 Suppose, we wish to examine the relationship between anxiety and memory.
 In making such a prediction, there will be a relationship, but are not sure whether
as anxiety increase or decrease memory.
 Therefore would want to predict only that there was a relationship between the two
variables without specifying the exact nature of this relationship this is called two-
tailed hypothesis.
6.5 Types of error in statistical decision making:-type i and ii error
Type I error
 Suppose we conducted some research and found that the probability of finding the effect
we observe is small.
 In a study, null hypothesis said there is no relationship between length of hair in male and
number of criminal offence committed .
But, we have obviously made a Type I error if we conclude that we have support that for our
prediction is that there will be a relationship between length of hair in males and number of
criminal offences committed
• A Type I error occurs when the sample data appear to show a treatment effect when, in
fact, there is none.
• In this case the researcher will reject the null hypothesis and falsely conclude that the
treatment has an effect.
• Type I errors are caused by unusual, unrepresentative samples. Just by chance the
researcher selects an extreme sample with the result that the sample falls in the critical
region even though the treatment has no effect.
• The hypothesis test is structured so that Type I errors are very unlikely; specifically, the
probability of a Type I error is equal to the alpha level.
Type II Errors
• In this case, the researcher will fail to reject the null hypothesis and falsely conclude that
the treatment does not have an effect.
• Type II errors are commonly the result of a very small treatment effect. Although the
treatment does have an effect, it is not large enough to show up in the research study.

6.6 Significance Level (p-value)


 A big difference in mean scores between conditions may be due to the predicted effects
of the independent variable rather than random variability. But there is always a specific
probability that the differences in scores are caused by total random variability. So there
can never be 100 percent certainty that the scores in an experiment are due to the effects
of selecting the independent variable.
 Statistical tests calculate probabilities that results are significant. Statistical tables provide
probabilities that any differences in scores are due to random variability, as stated by the
null hypothesis. This means that the less probable it is that any differences are due to
random variability, the more justification there is for rejecting the null hypothesis. This is
the basis of all statistical tests. Statistical tables give the probability that scores in an
experiment occur on a random basis.
 If the probability that the scores are random is very low, then you can reject the null
hypothesis that the differences are random. Instead you can accept the research
hypothesis that the experimental results are significant, that is, that they are not likely to
be random. Strictly speaking, the only conclusion from looking up probabilities in
statistical tables is that they justify rejecting the null hypothesis. But you will find that, if
the null hypothesis can be rejected, psychological researchers usually claim that the
results provide support for the predictions in the research hypothesis.
 There is always a probabilistic component involved in the accept–reject decision in
testing hypothesis. The criterion that is used for accepting or rejecting a null hypothesis is
called significance level or p-value. The p-value represents the probability of concluding
(incorrectly) that here is a difference in your samples when no true difference exists.
 It is a statistic calculated by comparing the distribution of a given sample data and an
expected distribution (normal, F, t etc.) and is dependent upon the statistical test being
performed.
 For example, if two samples are being compared in a t-test, a p-value of 0.05 means that
there is only a 5% chance of arriving at the calculated t-value if the samples were not
different (from the same population).
 In other words, a p-value of 0.05 means there is only a 5% chance that you would be
wrong in concluding that the populations are different or 95% confident of making a right
decision. For social sciences research, a p-value of 0.05 is generally taken as standard.
 In psychology (possibly because it is thought that nothing too terrible can happen as a
result of accepting a result as significant!) there is a convention to accept probabilities of
either1 per cent or 5 per cent as grounds for rejecting the null hypothesis.
 The way levels of significance are expressed is to state that the probability of a result
being due to random variability is less than 1 percent or less than 5 per cent. That is why
in articles in psychological journals you will see statements that differences between
experimental conditions are ‘significant (p < 0.01)’ or ‘significant (p < 0.05)’. This
means that the probability (p) of a result occurring by chance is less than (expressed as
<)1 percent (0.01) or 5 percent (0.05).
 Sometimes, you will find other probabilities quoted, such as p < 0.02 orp< 0.001. These
represent probabilities of obtaining a random result 2times in 100 and 1 time in 1000 (2
per cent and 0.1 per cent). These percentage probabilities give you grounds for rejecting
the null hypothesis that your results are due to the effects of random variability
6.7 T –test
A t-test examines differences in the mean scores of a parametric dependent variable across two
groups or conditions (the independent variable). As we saw in Chapter 5, data are parametric if
they are represented by interval values and are reasonably normally distributed. The t-test
outcome is based on differences in mean scores between groups and conditions
6.7.1 One Sample T-test
One sample t-test is used to compare the mean of a single sample with the population mean. The
single- or one-sample t test is used to compare the observed mean of one sample with a
population mean. One-sample t tests are usually employed by researchers who want to determine
if some set of scores or observations deviate from some established pattern or standard.
Some situations where one sample t-test can be used are given below:
 An economist wants to know if the per capital income of a particular region is same as
the national average.
 The Quality Control department wants to know if the mean dimensions of a particular
product have shifted significantly away from the original specifications.
 Is academic achievement of ECCE department students significantly deviated from the
academic achievement of Woldia University
Computing One Sample Test
Steps for test statistic in one sample t-test

Students 1 2 3 4 5 6 7 8 9
Test 8 7 5 6 8 7 8 6 6
Population mean ( μ) = 5
Step 1 State the null and alternative hypotheses.

H0: x 1 ¿ μ (the sample mean is equal to the population mean – no difference between the sample
mean and the population mean)

H1: x 1 ≠ μ (the sample mean is different from the population)
Step 2 Specify the level of significance = 0.05
Step 3 Determine the degrees of freedom = N– 1 = 9-1 = 8
Step 4 Determine the critical value = from the table = 2.30
Step 5 Determine the rejection region – All values > 2.30
Step 6 Find the test statistic

6.77−5 1.77
= =4.86
= 1.092 .364
√9
Step 7 Make a decision to reject or fail to reject H0
 The calculated t-value is 4.86 > the critical value 2.30 at 0.05 significance level. Then,
H0 is rejected.
Step 8 interpret the result
Table2: One sample t-test result for statistics test I

Variable Mean SD N df t- value Critical value p-value


Stat Test 6.77 1.092 9 8 4.86 2.30 0.001
Population mean = 5

*P < 0.05
This shows that there is a significant difference between the sample mean and the population
mean scores t (8) = 4.86, p < 0.05. This also implies that the sample mean score of stat test (M =
6.77) is significantly higher than the population mean score (M = 5) for students.
6.7.2 Independent Sample T- test
Basic concepts
An independent t-test measures differences between two distinct groups. Those differences might
be directly manipulated (e.g. drug treatment group vs. placebo group), they may be naturally
occurring (e.g. male vs. female), or they might be beyond the control of the experimenter (e.g.
depressed people vs. healthy people). In an independent t-test mean dependent variablescores are
compared between the two groups (the independent variable). For example, we couldmeasure
differences in the amount of money spent on clothes between men and women
The t test (unrelated) is based on comparing the means for the two groups doing each condition.
This is because there is no basis for comparing differences between related pairs of scores for
each participant. Because the t test (unrelated) is based on unrelated scores for two conditions,
which are independent of each other, another name for the t test (unrelated) is the independent t
test.
In many real life situations, we cannot determine the exact value of the population mean. We are
only interested in comparing two populations using a random sample from each. Such
experiments, where we are interested in detecting differences between the means of two
independent groups are called independent samples test. Some situations where independent
samples t-test can be used are given below:
 An economist wants to compare the per capita income of two different regions.
 A labor union wants to compare the productivity levels of workers for two different
groups.
 An aspiring MBA student wants to compare the salaries offered to the graduates of two
business schools.
In all the above examples, the purpose is to compare between two independent groups in contrast
to determining if the mean of the group exceeds a specific value as in the case of one sample t-
tests.
Assumptions
 The independent variable must be categorical
- It must consist of two distinct groups
- Group membership must be independent and exclusive
- No person (or case) can appear in more than one group
 There must be one parametric dependent variable
- The dependent variable data must be interval or ratio
- And should be reasonably normally distributed (across both groups)
 We should check for homogeneity of variances
 If these assumptions are not met the non-parametric Mann–Whitney U test could be
considered
Computing Independent Sample T- test: For example: Gender differences in statistics test results

M. Students 1 2 3 4 5 6 7 8 9 10
Scores 4 6 5 7 8 4 3 2 4 5 48
X12 16 36 25 49 64 16 9 4 16 25 260
F. students 11 12 13 14 14 15 15 16 17 18
Scores 8 9 6 7 8 10 8 9 7 10 82
2
X2 64 81 36 49 64 100 64 81 49 100 688

Steps for test statistic in


Step 1 State the null and alternative hypotheses.

H0: mean 1=mean 2 (The two sample means are the same)

H1: mean 1≠ mean 2 (the two means are different each other)
Step 2 Specify the level of significance = 0.05
Step 3 Determine the degrees of freedom = N - 2 = 20 -2 = 18
Step 4 Determine the critical value = from the table = 2.10
Step 5 Determine the rejection region – All values > 2.10
Step 6 Find the test statistic
Step 7 Make a decision to reject or fail to reject H0
Step 8 interpret the result

6.7. 3 Dependent (Paired) Samples t-test


In case of independent samples test for testing the difference between means, we assume that the
observations on one sample are not dependent on the other. However, this assumption limits the
scope of analysis as in many cases the study has to be done on the same set of elements (people,
objects etc.) to control some of the sample specific extraneous factors. Such experiments where
the observations are made on the same sample at two different times, is called dependent or
paired sample t-test. Some situations where dependent samples t-test can be used are given
below:
 The HR manager wants to know if a particular training program had any impact
in increasing the motivation level of the employees.
 The production manager wants to know if a new method of handling machines
helps in reducing the break down period.
 An educationist wants to know if interactive teaching helps students learn more
as compared to one-way lecturing.
One can compare these cases with the previous ones to observe the difference. The subjects in all
these cases are the same and observations are taken at two different times
Computing paired T-test
6.7.2 Steps for test statistic inpaired T-test
For example:

Maths 4 3 3 3 4 5 4 3 5 4 38
Civic 1 2 2 3 3 2 2 4 1 1 21
d 3 1 1 0 1 3 2 -1 4 3 17
d2 9 1 1 0 1 9 4 1 16 9 51
Student 1 2 3 4 5 6 7 8 9 10

Step 1 State the null and alternative hypotheses.



H0: x 1 ¿ x 2 (mean1 is equal to mean2 – no difference between the two means)

H1: x 1 ≠ x 2 (mean1 is different from the mean2)
Step 2 Specify the level of significance = 0.05,
Step 3 Determine the degrees of freedom = N – 1 = 10 – 1 = 9
Step 4 Determine the critical value = from the table = 2.26 (0.05)
Step 5 Determine the rejection region – All the values > 2.26
17 17 17
∑d =
Step 6 Find the test statistic t= =
√ =
√ 221 √ 24.5 =
2
−¿(17)

√ n ∑ d −¿ ¿ ¿ ¿ ¿
2
10 X
51
10−1
¿
9
3.43
Step 7 Make a decision to reject or fail to reject H0
 The calculated t-value is 3.43 > the critical value 2.26 at 0.05 significance level. Then,
H0 is rejected.
Step 8 interpret the result
Table 1: T-test results for students test scores

Variable Mean SD df t- value Critical p-value


value
Mathematics 3.8 .7888
9 3.43* 2.26 0.008
Civics 2.1 .9944
N = 10

*P < 0.05
This shows that there is a significant difference between mathematics and Civic mean scores t
(9) = 3.43, p < 0.05. This also implies that the mean of mathematics (M = 3.8) is significantly
higher than the mean score of civic education (M =2.1) for students
6.8Analysis Of Variance (ANOVA)
The analysis of variance (ANOVA) currently enjoys the status of being probably the most used
statistical technique in psychological research integrating with other tests of analysis such as
regression, multiple analysis of variance and covariance. Analysis of variance is highly related
with t- test in comparing means in the process of conducting psychological researches. The
popularity and usefulness of this technique can be attributed to two facts. First, the analysis of
variance, like t, deals with differences between sample means, but unlike t, it has no restriction
on the number of means. Instead of asking merely whether two means differ, we can ask whether
two, three, four, five, or k means differ. Second, the analysis of variance allows us to deal with
two or more independent variables simultaneously, asking not only about the individual effects
of each variable separately but also about the interacting effects of two or more variables
(Pagano, 2009).
Based on the number of the independent variables included in the research, there are different
forms of the analysis of variance such as one way analysis , two way, three way and so on. On
the other hand, considering the design, the nature of the dependent variable and the hypothesis to
be tested scholars categorized analysis of variance in to between group participants design,
repeated measure design and mixed design. In other words, one way analysis of variance used
one independent variable having three and more levels with one dependent variable ( Hiwett&
Crammer, 2011).
As a parametric test, analysis of variance is interested in testing the null hypothesis having one
continuously measured dependent variable with one or more categorical independent variables.
The independent variables are expected to have different levels that have organized scores
obtained from data gathering tools. Stating the null and alternative hypotheses in symbols and in
words and thereby calculating the F-ratio in accordance with the steps are important activities in
analysis of variance. If the F-ratio showed significant differences across the means, post hoc test
analysis can be done in order to know which mean is significantly different from the others. At
the same time, calculating the effect size of the independent variable on dependent variable using
different statistical techniques such as omega and eta squares is still impotant (Dancey&Reidey,
2011).

Logic of Analysis of Variance


Analysis of variance (ANOVA) is a method of testing the equality of three or more population
means by analyzing sample variances. Then, the logic using preferring One-way analysis of
variance instead of t-test is that
 it saves time, cost, effort
 Increase power of the test to detect real effect of the independent variable.
 Like t test, analysis of variance deals with differences between two sample means, but
unlike t test, it has no restriction on the number of means. Instead, we can ask whether
two, three, four, five, or k means differ.
 Analysis of variance allows us to deal with two or more independent variables
simultaneously, asking not only about the individual effects of each variable separately
but also about the interacting effects of two or more variables (Pagano, 2009).
One Way Analysis of Variance
 One-way analysis of varianceis a hypothesis-testing technique that is used to compare the
meanscores of three or more populations.
 One independent variable with its different levels is being studied that is why it is called
one-way analysis of variance (Larson and Faber, 2012).
Assumptions of analysis of variance (ANOVA)
According to Howell (2011) assumptions underlie all analysis of variance (ANOVA) using the F
statistic is organized below.
A. The Assumption of Normality
For reasons dealing with our final test of significance, we will make the assumption that scores
in each population should be normally distributed around the population mean. We made the
same assumption for t- test. Moreover, even substantial departures from normality may, under
certain conditions, have remarkably little influence on the final result.
B. The Assumption of Homogeneity of Variance
 A second major assumption that we will make is that each population of scores has the
same variance; specificallyσ 12 =σ 22=σ 32=σ 42.
 Homogeneity of variance would be expected to occur if the effect of a treatment is to add
a constant to everyone’s score. Under certain conditions this assumption also can be
relaxed without doing too much damage to the final result.
 In other words, the analysis of variance is robust with respect to violations of the
assumptions of normality and homogeneity of variance.
C. The Assumption of Independence of Observations
Our third important assumption is that the observations are all independent of one another. For
any two observations in an experimental treatment, we assume that knowing how one
observation stands relative to the treatment (or population) mean tells us nothing about the other
observation. This assumption is one of the important reasons why participants are usually
randomly assigned to groups. Violation of the independence assumption can have serious
consequences for an analysis.
D. The sample is simple random samples from the populations.

POPULATION 1 POPULATION 2 POPULATION 3 POPULATION 4

Sample 1 Sample 2 Sample 3


Sample 4

E. The different samples are from a populations that are categorized in only one way
These are expected to come one independent variable organized as levels. In other words, the
samples didn’t show the number of independent variables.
3.2 Sources of variance
Analysis of variance (ANOVA), as the name suggests, analyses the different sources from which
variation in the scores arises.
Between-groups variance
ANOVA looks for differences between the means of the groups. When the means are very
different, we say that there is a greater degree of variation between the conditions. If there were
no differences between the means of the groups, then there would be no variation. This sort of
variation is called between-groups variation (Dancey&Reidey, 2011).
Between-groups variation arises from:
Treatment effects: When we perform an experiment, or study, we are looking to see that the
differences between means are big enough to be important to us, and that the differences
reflect our experimental manipulation. The differences that reflect the experimental
manipulation are called the treatment effects
Individual differences: Each participant is different, therefore participants will respond
differently, even when faced with the same task. Although we might allot participants
randomly to different conditions, sometimes we might find, say, that there are more
motivated participants in one condition, or they are more practiced at that particular task.
Experimental error: Most experiments are not perfect. Sometimes experimenters fail to
give all participants the same instructions; sometimes the conditions under which the tasks
are performed are different, for each condition. At other times, equipment used in the
experiment might fail, etc. Differences due to errors such as these contribute to the
variability.
Within-groups variance
Another source of variance is the differences or variation within a group. This can be thought of
as variation within the columns.
Within-groups variation arises from:
Individual differences: In each condition, even though participants have been given the
same task, they will still differ in scores. This is because participants differ among
themselves in abilities, knowledge, IQ, personality and so on. Each group, or condition, is
bound to show variability.
Experimental error: This has been explained above
Steps for test statistic in One-Way ANOVA
Step 1 State the null and alternative hypotheses.

H0: μ1=μ2 =¿ μ 3 ¿ (All population means are equal.)
Ha: At least one mean is different from the others
Step 2 Specify the level of significance = 0.05, 0.01, 0.1
Step 3 Determine the degrees of freedom = N - K, K - 1
Step 4 Determine the critical value = from the table
Step 5 Determine the rejection region
Step 6 Find the test statistic
Step 7 Make a decision to reject or fail to reject H0
Step 8 interpret the result
Example 1: A researcher wanted to test the effect of study skills support on academic
achievement scores of students in DeberMarkos University. Then, he took 15 students who need
study skills support and Assign them randomly in to three groups such as placebo, low support
and high support. Level of significance for this hypothesis testing is 0.05. The data ollected from
students are presented in the following table.
Placebo Low support High support
2 10 10
3 8 13
7 7 14
2 5 13
6 10 15

From the above Data


Mean1 = 4 Mean 2 = 8 Mean 2 = 8
Σx 1=20 Σx 2=4 Σx 2=4
Σx i=125 Σ x =1299 n3 = 5
2

n1 = 5 n2 = 5
N = 15

Solution:
Step1: State the null and alternative hypotheses

H0: μ1=μ2 =¿ μ 3 ¿ (All population means are equal)

Ha: μ1 ≠ μ2 ≠ μ 3 (At least one mean is different from the others)
Step 2: Specify the level of significance
The significance level is α = 0.05
Step 3 Determine the degrees of freedom
Degree of freedom for between groups = (K – 1) = 3 – 1 = 2 (K is number of groups)
Degree of freedom for within groups = (N – K) = 15 – 3 = 12 (K is number of groups)
Degree of freedom for total = (N – 1) = 15 – 1 = 14 (N is all participants in the research)
Step 4 Determine the critical value from F distribution.
To find F critical value we use F (2, 12) = 3.89
Step 5 Determine the rejection region
In the F distribution, the rejection region is all the values greater than 3.89. In other words, if F
calculated greater than 3.89 reject the null hypothesis because it is in the rejection region or, if F
calculated is less than 3.89 accept the null hypothesis.
Step 6 Find the test statistic
Calculate between sum of square (SSB)
SSB = ¿ ¿+ ¿ ¿ + ¿ ¿ - ¿ ¿
SSB = ¿ ¿+ ¿ ¿ + ¿ ¿ - ¿ ¿
SSB = (80 + 320 + 845) - 1041.667
SSB = 1245 - 1041.667 SSB = 203.333
Calculate within sum of square (SSW)
SSW = Σ x 2−¿ ¿+ ¿ ¿ + ¿ ¿
SSW = 1299 -¿ ¿+ ¿ ¿ + ¿ ¿
SSW = 1299 - (80 + 320 + 845)
SSW = 1299 –1245 SSW = 54
Calculate total sum of square (SST)
SST = SSB + SSW, SST = 203.333 + 54 SST = 257.333
Calculate Between groups mean of square (MSB)
❑ ❑
SSB 203.333
MSB = MSB= MSB=101.667
DFB 2
Calculate within groups mean of square (MSW)
❑ ❑
SSW 54
MSW = MSW = MSW =4.5
DFW 12
❑ ❑
MSB 101.667
Calculate F- ratio F= F= = 22.59
MSW 4.5
Step 7 Make a decision to reject or fail to reject H0
F critical (2, 12) = 3.89
F calculated = 22.59
ANOVA Summary table study skills support given for students
Sources of variation Degree of freedom Sum of squares Mean squares F
Between groups 2 203.333 101.667
22.59
Within groups 12 54 4.5
Total 14 257.333

Then, F calculated = 22.59 > F critical (2, 12) = 3.89 reject the null hypothesis. This shows the
location of the rejection region and the test statistic. Therefore, F is in the rejection region, you
should to reject the null hypothesis
Interpretation:There is enough evidence at the 5% level of significance to conclude that study
skills support hassignificant effect on the means of academic achievement scores of students.
3.3 Post hoc Analysis
Post hoc analysis is a multiple comparison techniques for making comparisons between two or
more group means subsequent to an analysis of variance. Since there is enough evidence at the
5% level of significance to conclude that the means of academic achievement scores of students
are different. Then, which mean is different from the others can be known through post hoc
analysis. Post hoc analysis methods are different in their power minimizing type I error. Some of
them are listed below.
Let’s use the Post hoc analysis technique of Tukey test for the example given above.
When Tukey test is used for the post hoc analysis we use Q-distribution to find the critical
value .Then, the multiple comparisons through Tukey test has four steps done as follows.
Step1: Find Q-calculated by comparing two means at a time
1. Placebo with low study skills support (mean1 with mean2)
❑ ❑
mean2 – mean1 8–4
Q-cal =

MSW ❑
n 5√
= 4.5❑ = 4.5*

2. Placebo with high study skills support (mean1 with mean3)


❑ ❑
mean3 – mean1 13 – 4
Q-cal =

MSW ❑
n 5√
= 4.5❑ = 9.48*

3. low study skills support with high study skills support (mean2 with mean3)
❑ ❑
mean3 – mean2 13 – 8
4. Q-cal =
√ MSW ❑
n √
= 4.5❑
5
= 5.27*

Step 2: Find Q-critical from Q-distribution by (r, df) - Q (5, 12) = 3.77
Step3: Make decision based on the three mean comparisons.
Therefore, for mean1 and mean2Q-cal>Q-cri or 4.5 > 3.77 reject the null hypothesis
For mean1 and mean3Q-cal>Q-cri or 9.48 > 3.77 reject the null hypothesis
For mean3 and mean2Q-cal>Q-cri or 5.27 > 3.77 reject the null hypothesis
Step 4: Interpretation
There is enough evidence at the 5% level of significance to conclude that study skills support all
means of academic achievement scores of students are significantly different each other.

LINEAR AND MULTIPLE REGRESSION ANALYSIS

Introduction

Regression analysis is a statistical technique that is widely used for research. Regression analysis is used
to predict the behavior of the dependent variables, based on the set of independent variables. In regression
analysis, dependent variables can be metric or non-metric and the independent variable can be metric,
categorical, or both a combination of metric and categorical. These days, researchers are using regression
analysis in two manners, for linear regression analysis and for non-linear regression analysis. Linear
regression analysis is further divided into two types, simple linear regression analysis and multiple linear
regression analysis. In simple linear regression analysis, there is a dependent variable and an independent
variable. In multiple linear regressions analysis, there is a dependent variable and many independent
variables. Non- linear regression analysis is also of two types, simple non-linear regression analysis and
multiple non-linear regression analysis. When there is a non-liner relationship between the dependent and
independent variables and there is a dependent and an independent variable, then it said to be simple non-
liner regression analysis. When there is a dependent variable and two or more than two independent
variables, then it said to be multiple non-linear regression.

Learning outcomes
Upon completing this topic, the students will be able to:
 Describe basic concepts of regression
 Appropriately use regression principles in different research fields
 Apply regression models in research design
 Perform regression analysis and interpret the results

Key Terms: Regression, Intercept, Slope, Curve it, Polynomial, Best fit line

3.1. Linear regression

Linear regression is the most basic and commonly used predictive analysis. Regression estimates are used
to describe data and to explain the relationship between one dependent variable and one or more
independent variables.

At the center of the regression analysis is the task of fitting a single line through a scatter plot. The
simplest form with one dependent and one independent variable is defined by the formula y = a + b*x.

Sometimes the dependent variable is also called endogenous variable, prognostic variable or regressand.
The independent variables are also called exogenous variables, predictor variables or regressors.

However Linear Regression Analysis consists of more than just fitting a linear line through a cloud of
data points. It consists of 3 stages – (1) analyzing the correlation and directionality of the data, (2)
estimating the model, i.e., fitting the line, and (3) evaluating the validity and usefulness of the model.

Uses of Linear Regression Analysis

1). Might be used to identify the strength of the effect that the independent variable(s) have on a
dependent variable. Typical questions are what is the strength of relationship between dose and effect,
sales and marketing spend, age and income.

2). It can be used to forecast effects or impacts of changes. That is regression analysis helps us to
understand how much will the dependent variable change, when we change one or more independent
variables. Typical questions are how much additional Y do I get for one additional unit X.

3). Regression analysis predicts trends and future values. The regression analysis can be used to get point
estimates. Typical questions are what will the price for gold be in 6 month from now? What is the total
effort for a task X?

Assumptions:

1. There is normal distribution.


2. There is a linear relationship between the dependent and independent variable.
3. There is no multicollinearity between the independent variables or no exact correlation between the
independent variable.
4. There is no autocorrelation.
5. The means lagged value of the regression variable does not affect the current value.
6. The homoscedasticity or variance between all the independent variables is equal.
Simple linear regression is a measure of linear association that investigates straight-line relationships
between a continuous dependent variable and an independent variable. It is explained best through
regression equation

3.2. The Regression Equation

The Regression Equation (Y = α + βX )


Y = the continuous dependent variable
X = the independent variable (can be a categorical dummy variable)
α= the Y intercept (regression line intercepts Y axis)
β = the slope of the coefficient (rise over run)
Parameter Estimate Choices
βis estimated coefficient of the strength and direction of the relationship between the
independent (IV) and dependent variable (DV).
α (Y intercept) is a fixed point that is considered a constant (how much Y can exist without X)
Standardized Regression Coefficient (β)
Estimated coefficient of the strength of relationship between the IV and DV variables.
Expressed on a standardized scale where higher absolute values indicate stronger
relationships (Scale ranges is from -1 to 1).
Parameter Estimate Choices
Raw regression estimates (b1)
Raw regression weights have the advantage of retaining the scale metric—which is
also their key disadvantage.
If the purpose of the regression analysis is forecasting, then raw parameter estimates
must be used. The researcher is interested only in prediction.
Standardized regression estimates (β1)
Standardized regression estimates have the advantage of a constant scale.
Standardized regression estimates should be used when the researcher is testing
explanatory hypotheses
3.3. Predictive Methods

With the exception of the mean and standard deviation, linear regression is possibly the most widely
used of statistical techniques. This because any of the problems that we encounter in research settings
require that we quantitatively evaluate the relationship between two variables for predictive purposes.
By predictive, I mean that the values of one variable depend on the values of a second. We might be
interested in calibrating an instrument such as a sprayer pump. We can easily measure the current or
voltage that the pump draws, but specifically want to know how much fluid it pumps at a given
operating level. Or we may want to empirically determine the production rate of a chemical product
given specified levels of reactants.

Linear regression, which is the natural extension of correlation analysis, provides a great starting
point toward these objectives.

Terms for predictive analysis:

Curve fit - This is perhaps the most general term for describing a predictive relationship between two
variables, because the "curve" that describes the two variables is of unspecified form.
Polynomial fit - A polynomial fit describes the relationship between two variables as a mathematical
series. Thus a first order polynomial fit (a linear regression) is defined as y = a + bx. A second order
(parabolic) fit is y= a + bx + cx^2, a third order (spline) fit is y = a + bx + cx^2 + dx^3, and so on...
Best fit line - The equation that best describes the y or dependent variable as a function of the x or
independent.
Linear regression and least squares linear regression - This is the method of interest. The
objective of linear regression analysis is to find the line that minimizes the sum of squared deviations
of the dependent variable about the "best fit" line. Because the method is based on least squares, it is
said to be a BLUE method, a Best Linear Unbiased Estimator.

6.1.2. Defining the Regression Model


We've already stated that the general form of the generalized linear regression is: y= a + bx. The
coefficient "a" is a constant called the y-intercept of the regression. The coefficient "b" is called the
"slope" of the regression. It describes the amount of change in y that corresponds to a given change in
x.

The slope of the linear


regression can be calculated in a number of ways:

Specifically, the slope is defined as the summed cross product of the deviations of x and y from their
respective means, divided by the sum of squares of the deviations x from it's mean. The second
relationship above is useful if these quantities have to be calculated by hand. The standard error
values of the slope and intercept can are mainly used to compute the 95% confidence intervals. If you
accept the assumptions of linear regression, there is a 95% chance that the 95% confidence interval of
the slope contains the true value of the slope, and that the 95% confidence interval for the intercept
contains the true value of the intercept.

It's interesting to note that the slope in the generalized case is equal to the linear correlation coefficient
scaled by the ratio of the standard deviations of y and x:

This explicitly defines the


relationship between linear
correlation analysis and linear
regression. Notice
that in the case of standardized regression, where sy and sx = 1, From this definition, it should be
clear that the best fit line passes through the mean values for x and y.

There are several assumptions that must be met for the linear regression to be valid:

• The random variables must both be normally distributed (bivariate normal)


and linearly related.
• The x values (independent variable) must be free of error.
• The variance of y (the dependent variable) as a function of x must be
constant. This is referred to as homoscedasticity.

6.1.3. Evaluating the Model Fit

The scatter of the y values about y estimates (denoted yhat) based on the best fit line is often referred
to as the "standard error of the regression":

Notice that two degrees of freedom are lost in the denominator: one for the slope and one for the
intercept. A more descriptive definition - and strictly correct name - for this statistic is the root mean
square error (denoted RMS or RMSE).

How much variance is explained?

Just as in linear correlation analysis, we can explicitly calculate the variance explained by the regression
model:

You should recognize


this definition as
identical to the one
used in correlation
analysis. This
relationship can
also be written in terms
of the z-scores of x and y.

Determining statistical significance

As with the other statistics that we have studied the slope and intercept are sample statistics based on data
that includes some random error, e: y + e = a + b x. We are of course actually interested in the true
population parameters which are defined without error. y = a + b x. How do we assess the significance
level of the model? In essence we want to test the null hypothesis that b=0 against one of three possible
alternative hypotheses: b>0, b<0, or b not = 0.

There are at least two ways to determine the significance level of the linear model. Perhaps the easiest
method is to calculate r, and then determine significance based on the value of r and the degrees of
freedom using a table for significance of the linear or product moment correlation coefficient. This
method is particularly useful in the standardized regression case when b=r.

The significance level of b, can also be determined by calculating a confidence interval for the slope. Just
as we did in earlier hypothesis testing examples, we determine a critical t-value based on the correct
number of degrees of freedom and the desired level of significance. It is for this reason that the random
variables x and y must be bivariate normal.

For the linear regression model the appropriate degrees of freedom is always df=n-2. The level of
significance of the regression model is determined by the user, the 95% or 99% levels are generally
used. The standard error values of the slope and intercept can be hard to interpret, but their main purpose
is to compute the 95% confidence intervals. If you accept the assumptions of linear regression, there is a
95% chance that the 95% confidence interval of the slope contains the true value of the slope, and that the
95% confidence interval for the intercept contains the true value of the intercept. The confidence interval
is then defined as the product of the critical t-value and Sb, the standard error of the slope:

whereSb is
defined as:

Interpretation.

If there is a significant slope, then b will be statistically different from zero. So if b is greater than (t-
crit)*Sb, the confidence interval does not include zero. We would thus reject the null hypothesis that b=0
at the pre-determined significance level. As (t-crit)*Sbbecomes smaller, the greater our certainty in beta,
and the more accurate the prediction of the model. If we plot the confidence interval on the slope, then
positive and negative limits of the confidence interval of the slope plot as lines that intersect at the point
defined by the mean x,y pair for the data set. In effect, this tends to underestimate the error associated
with the regression equation because it neglects the role of the intercept in controlling the position of the
line in the cartesian plane defined by the data. Fortunately, we can take this into account by calculating a
confidence interval on line.

3.4. Confidence Interval for the regression line

Just as we did in the case for the confidence interval on the slope, we can write this out explicitly as a
confidence interval for the regression line, that is defined as follows: The degrees of freedom is still df=
n-2, but now the standard error of the regression line is defined as: Because values that are further
from the mean of x and y have less probability and thus greater uncertainty, this confidence interval is
narrowest near the location of the joint x and y mean (the centroid or center of the data distribution), and
flares out at points further from centroid. While the confidence interval is curvilinear, the model is in fact
linear.

You might also like