0% found this document useful (0 votes)
22 views102 pages

BRM - Data Analysis, Interpretation and Reporting Part II

The document discusses data analysis methods including descriptive and inferential analysis. Descriptive analysis includes frequency distributions, measures of central tendency, and measures of dispersion. Inferential analysis involves hypothesis testing. Both quantitative and qualitative data analysis techniques are covered.

Uploaded by

gualnegus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views102 pages

BRM - Data Analysis, Interpretation and Reporting Part II

The document discusses data analysis methods including descriptive and inferential analysis. Descriptive analysis includes frequency distributions, measures of central tendency, and measures of dispersion. Inferential analysis involves hypothesis testing. Both quantitative and qualitative data analysis techniques are covered.

Uploaded by

gualnegus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 102

Business Research Methods

Alemseged Gerezgiher
(BSc, MBA, PhD)

05/07/24 1
Part VI (Sub-part II)
Data Analysis, Interpretation
and Reporting

05/07/24 2
Chapter Six: Data Analysis, Interpretation and
Reporting

Data Management and Support Software

Descriptive Analysis

Inferential Analysis

Hypothesis Testing

Interpretation, scientific writing and reporting

05/07/24 3
Data Analysis: Introduction
Once the data is ready for processing, the next step is to
choose appropriate analysis method and conduct the
analysis.
Data analysis depends on the nature of the variable, the type
of data and the purpose of the analysis. The following issues
will affect the data analysis part of your research endeavor.
 The type of data you have gathered, (i.e.
Nominal/Ordinal/Interval/Ratio)
 Are the data paired such as before and after treatment?
 Are they parametric or non-parametric?
 Ranks, scores, or categories are generally non-parametric data.
 Measurements that come from a population that is normally

distributed can usually be treated as parametric.


 What are you looking for? differences, correlation etc?
05/07/24 4
Data Analysis: Introduction
Simply put:
 Data analysis is the process of making meaning from the
data
 It is involves processing the data into meaningful
information
Broadly classified, data analysis involves:
 Quantitative analysis
 Qualitative analysis

05/07/24 5
Analyzing qualitative Data
• There is considerable amount of interview, focus group
discussion and/or text-based data and images that require
analysis.
• Creswell (2003) suggests that it is useful to look at the
codes that have emerged according to:
 Codes readers would expect to find;
 Codes that are uprising; and
 Codes that address a larger theoretical perspective in their research.
Then, follow the next steps
 Identifying themes
 Coding data (reducing data to manageable size)
 Developing a description from the data
 Defining themes from the data
 Connecting and interrelating themes
Analyzing qualitative…
Further activities
 Noting reflections in the margins
 Sorting and shifting through the materials to identify similar
phrases, relationships, patterns, themes, commonalities, &
differences
 Isolating patterns, processes, commonalities, & differences
and incorporating methods to further explore them into the
next wave of data collection
 Gradually developing a small set of generalizations about
what consistently appears in the data
 Confronting those generalizations with a formalized body of
knowledge in the form of constructs or theories

05/07/24 7
Quantitative analysis
The quantitative analysis uses numeric
expressions/representations and manipulations of
the collected data.
The analysis could take descriptive or
inferential form.
Based on number of variables involved,
quantitative analysis could be univariate, bivariate
and/ or multivariate analysis.

05/07/24 8
Quantitative analysis
 Descriptive vs Inferential analysis:
 Descriptive analysis: refers to statistically describing,
aggregating, and presenting the constructs of interest or
associations between these constructs.
 Inferential analysis: refers to the statistical estimation of
parameter values and testing of hypotheses (theory testing).

 With respect to the number of variables:


 Univariate analysis: only one variable is analyzed
 Bivariate analysis: two variables are analyzed
 Multivariate analysis: more than two variables are
included in the analysis process
 It also varies with the four scales of
05/07/24
measurement 9
Reliability Analysis/Test (SPSS)
It helps measure consistency of an instrument.
Internal consistency is the most commonly used measure
of reliability
Factors that increase reliability
 Number of items
 High variation among individuals being tested
 Clear instructions
 Optimal testing situation
Analyze  Scale  Reliability Analysis  select items 
Statistics  choose statistical tests  Continue  choose
from Alpha list  OK
05/07/24 11
The Normal Distribution Assumption
The Normal distribution – is a distribution that has
equal number of cases clustered around the mean. It
is the most useful distribution in statistics, and has
the following important properties:
1. Symmetry and bell-shaped
2. Mode, median, and mean coincide
3. As a corollary to (1), a fixed proportion of
observations lies between the mean and fixed units of
standard deviation.

05/07/24 12
Normal distribution…
Z-Score (Standard Normal Curve) – is a normal
curve with mean = 0 and standard deviation,
S = 1. It is used to compare scores in two or more
distributions that have different means and standard
deviations.
z = (x – x (Bar))/s, where z = number of
standard deviations, ….
If the data is normally distributed, we employ
parametric tests
If the data is categorical or if the assumption of
normality does not hold, we use non-parametric tests
05/07/24 13
Using histogram to test the normality of the
data

05/07/24 14
Checking for normality with a Q-Q plot

05/07/24 15
Analyze, Descriptive Statistics, Explore…
Plots--- Normality plots with tests

05/07/24 16
Univariate analysis (Descriptive analysis)
• The following categories of the descriptive analysis are usually
used.
• Frequency distributions
• Measures of central tendency
• Measures of dispersion
• Shape of distribution
1) Frequency distributions (tables, bar graph, pie chart, histogram)
a) Frequency table- a table of a summary of the values of a variable
and the number of times the variable assumes an given value. It
has:
• Descriptive tile
• Clear labels for columns and rows
• Appropriate categories
• Presentation of frequencies and corresponding percentages
05/07/24 17
Univariate analysis…
b) Pie charts and Bar charts- when data is nominal or
ordinal, we use pie chart or bar chart. However, only one
variable in pie chart and possibly more than one in bar
charts.
c) Histogram –Histograms are used when it is an interval
level data measurement.
We can also have line graphs to explore the variable(s).

05/07/24 18
Univariate analysis…
• Example: Frequency table (Leisure time preference)

Preference Frequency Percentage Cumulative


With friends 9 9.0 9.0
Sport activities 30 30.0 39.0
With family 40 40.0 79.0
Reading 21 21.0 100.0
Total 100 100.0

05/07/24 19
Example:
Bar Diagram: Lists the categories and presents the percent or
count of individuals who fall in each category.

05/07/24 20
Example:
Pie Chart: Lists the categories and presents the percent or
count of individuals who fall in each category.

05/07/24 21
Example:
Histogram: Overall pattern can be described by its shape,
center, and spread. The following age distribution is right
skewed. The center lies between 80 to 100. No outliers

05/07/24 22
Frequency distributions in SPSS
Frequency tables: are found under the ‘analyze’ menu
bar (Analyze ---- Descriptive statistics ---- Frequencies)
 Then, select variables and move them to ‘variable(s)’ dialog
box, choose from the options, display frequency tables, OK
Charts and graphs: two options
 Analyze ---- Descriptive statistics --- Frequencies --- charts
 Graphs --- Legacy dialogs --- charts/graphs (options)

05/07/24 23
Frequency distributions in SPSS
Analyze Descriptive statistics Frequency

24
25
Univatiate analysis…
2) Measures of central tendency
Central tendency is an estimate of the center of
a distribution of values.
There are three major estimates of central
tendency: mean, median, and mode.

05/07/24 26
Measures of central tendency…
1. Mean
 For a data set, the mean is the sum of the values divided
by the number of values. The mean of a set of numbers
x1, x2... xn is typically denoted by , pronounced "x bar".
This mean is a type of arithmetic mean. The mean
describes the central location of the data; the arithmetic
mean is the "standard" average, often simply called the
"mean".
 The other name is average
 mainly for interval variables
 very widely used and intuitively appealing

05/07/24 27
Measures of central tendency…
2. Median
 It is the middle value of the distribution when all items are
arranged in either ascending or descending order in terms of
value
 mid-point value; arrange data from lowest to highest to
identify mid value; if two mid values, take the average
th
 n  1
Med    value
 2 
 mean is sensitive to outliers but median is robust

05/07/24 28
Measures of central tendency…
3. Mode
 It is the value that occurs most frequently in the data set

3) Measures of dispersion
• It measures the amount of scatter or variationin a dataset
• Or it refers to the way values are spread around the central
tendency, for example, how tightly or how widely are the
values clustered around the mean.
• similar measures of central tendency may come from very
different distributions

05/07/24 29
Measures of dispersion...
But different dispersions

Have the same mean


Measures of dispersion…
Common measures of dispersion include minimum,
maximum, range, variance and standard deviation.
But, the most frequently used in analysis are range
and standard deviation
Range = Maximum value – Minimum value
Range is sensitive to outliers

05/07/24 31
Measures of dispersion…
Variance:
The variance is used as a measure of how far a set of
numbers are spread out from each other. It is one of
several descriptors of a probability distribution,
describing how far the numbers lie from the mean
(expected value). In particular, the variance is one of
the moments of a distribution.
n 2

 ( x  x)
i
Var ( x)  i 1

n
05/07/24 32
Measures of dispersion…
Standard deviation:
It is a widely used measurement of variability or diversity used
in statistics and probability theory. It shows how much variation
or “dispersion" there is from the average (mean, or expected
value). A low standard deviation indicates that the data points
tend to be very close to the mean, whereas high standard
deviation indicates that the data are spread out over a large
range of values. The standard deviation of X is given by:
A useful property of
standard deviation is n 2
that, unlike variance, it is
expressed in the same
 ( x  x)
i

units as the data.


SE ( x)  i 1
n
05/07/24 33
Measures of dispersion
Coefficient of variation (CV):
In probability theory and statistics, the coefficient of
variation (CV) is a normalized measure of dispersion of a
probability distribution. It is also known as unitized risk or
the variation coefficient. The coefficient of variation (CV)
is defined as the ratio of the standard deviation to the
mean :
 SD 
CV   
 Mean 

05/07/24 34
Measures of shape of distribution
4) Measures of shape of distribution
 skewness and kurtosis are the commonly used
measures of shape of distribution of a dataset.
Skweness:
It refers to symmetry or asymmetry of the
distribution.
The skewness value can be positive or negative, or
even undefined.

05/07/24 35
Measures of shape of distribution…
Skewness:
Qualitatively, a negative skew indicates that the tail on the
left side of the probability density function is longer than
the right side and the bulk of the values (possibly
including the median) lie to the right of the mean.
A positive skew indicates that the tail on the right side is
longer than the left side and the bulk of the values lie to
the left of the mean. A zero value indicates that the values
are relatively evenly distributed on both sides of the mean,
typically but not necessarily implying a symmetric
distribution.
05/07/24 36
Measures of shape of distribution…
The skewness of a random variable X is the third
standardized moment and defined as
n 3

 (x i  x)
SK  i 1

( n  1) S 3
The coefficient of Skewness is a measure for the
degree of symmetry in the variable distribution.

05/07/24 37
Measures of shape of distribution…
Kurtosis:
It refers to peakedness of the distribution.
It is a measure of the "peakedness" of the probability
distribution of a real-valued random variable.
Higher kurtosis means more of the variance is the result of
infrequent extreme deviation, as opposed to frequent
modestly sized deviations. n 4

 ( xi  x )
KU  i 1
( n  1) S 4

05/07/24 38
Measures of shape of distribution…
The coefficient of Kurtosis is a measure for the degree of
peakedness/flatness in the variable distribution.

05/07/24 39
Central tendency, dispersion and shape in SPSS
Analyze Descriptive statistics Descriptives
Options (select your interest of analysis)

40
Bivariate analysis
 How do we analyze relationships between two variables?
 Bivariate analysis is analysis of two variables to examine if
they are correlated or if there is differences between values
 analyzing relationships between two variables.
 Remember co-variation does not always imply
causation

05/07/24 41
Bivariate analysis
• Examples:
• Do men earn more income than women?
• Does educational level affect attitudes toward
participation in labour union?
• Is income level correlated with life expectancy?
• Is parental educational level correlated with student
performance?
We need to conduct hypothesis testing to arrive at
conclusive results on issues like this.

05/07/24 42
Hypotheses Testing
The following are the steps in hypothesis testing:
1. state the null hypothesis
2. choose an appropriate statistical test,
3. specify the level of statistical significance. (usually
this is o.1, 0.05 or 0.01) --- known as the α–level.
4. Decide to accept or to reject the null hypothesis
based on the findings.
We use different tests based on the nature of the dependent
and independent variables and nature of distribution of the
data.
During hypothesis testing, there is a possibility of
committing decision errors. The are two types of errors.
05/07/24 43
Hypothesis…
"Type I error"
A type one error is a false positive (true) result.
If you use a parametric test on nonparametric data then
this could trick the test into seeing a significant effect
when there isn't one.
Or , it is a situation where we reject the null hypothesis
that is true.
The probability of committing Type I error is called
significance level (P-value).
This error requires more attention and important to avoid

05/07/24 44
Hypothesis…
“Type II error”
It occurs when we accept a null hypothesis that is false.
However, this occurs if you use a nonparametric test on
parametric data then this could reduce the chance of
seeing a significant effect when there is one.
A type two error is a missed opportunity, i.e. we have
failed to detect a significant effect that truly does exist
This is least dangerous.
Summary; Using a parametric test in the wrong context
may lead to a type one error, a false positive.
Using a nonparametric test in the wrong context may lead
to a type two error, a missed opportunity.
05/07/24 45
Hypothesis…
Reading P-value
It is the basis for deciding whether or not to reject the
null hypothesis.
P-values do not simply provide you with a Yes or No
answer, they provide a sense of the strength of the
evidence against the null hypothesis.
The lower the p-value, the stronger the evidence, usually
less than 0.05 or 0.01, the null hypothesis is rejected..
It is the probability that a statistical result as extreme as
the one observed would occur if the null hypothesis were
true.
05/07/24 46
05/07/24 47
Hypothesis…
 Parametric tests
 T-test (one sample, independent sample, paired)
 One-way ANOVA
 Repeated ANOVA (for paired data)
 Pearson correlation

 There are many techniques of non-parametric tests


 Chi-square for independence
 Mann-Whitney Test
 Wilcoxon Signed Rank Test
 Kruskal-Wallis Test
 Friedman Test
 Spearman Rank Order Correlation

05/07/24 48
Hypothesis…
Nominal Ordinal Interval/Ratio Dichotomous

Nominal Contingency table Contingency Z-test; T-test or Contingency


Chi-square table F-test table
Cramer’s V Chi-square (If DV is Chi-square
Cramer’s V interval/ratio) Cramer’s V

Ordinal Contingency table Spearman’s rho Spearman’s rho (ƿ) Spearman’s rho
Chi-square (ƿ) (ƿ)
Cramer’s V

Interval/ Z-test; T-test; or Spearman’s rho Pearson’s r Spearman’s rho


Ratio F-test (ƿ) (ƿ)
(If DV)

Dichoto Contingency table Spearman’s rho Spearman’s rho (ƿ) Phi (ɸ)
mous Chi-square (ƿ)
Cramer’s V
Hypothesis…
Requirement Example of Situation Test to be Used
Compare to a target Is the average age of employees Use a one sample
more than 40 years? t-test

Compare two groups Do men earn more income than Use independent
women? samples t-test

Compare two groups with one Test scores before and after Use Paired t-test
controlled intervention training

Compare more than two groups Compare amount of income One way ANOVA
between four categories of (F-test)
educational level

Association between two Is there an association between Contingency table


categorical variables gender job grade? Chi-square

Association between two Is there an association between Pearson’s r


quantitative variables advertising & sales?
Hypothesis…
Contingency Table analysis (Cross-tabulation):
We look for differences among categories (hence
nominal or ordinal level measurement) of the
independent variables. That is, does the IV influence
the DV?
Contingency Table (Cross–tabulation) – a table of
percentage distribution with DV (in rows) and IV (in
columns).
It is a bivariate frequency distribution, where number
of cases that fall into each possible pairing of the values
or categories of the variables .
05/07/24 51
Chi-square Test
Chi-square Test (Chi is pronounced "ky“ as is in
‘sky’)-
employed to test relationships between two variables
when the data is measured at the nominal or ordinal
level.
The Chi-square test for independence can be used in
situations where you have two categorical variables.
It works with the "simplest" form data.
Data such as gender or country, or data that has been
placed in categories, such as age group.

05/07/24 52
Chi-square Test
Chi-square can be calculated as follows

χc 2 = Σ [(observed – expected)2⁄expected]
If the calculated chi-square is grater than the chi-
square obtained from the table, then we conclude
there is a relationship (that is, reject the H o).
Remember, like in all hypothesis testing, the Chi-
square assumes that there is no relationship between
the DV and IV.

05/07/24 53
Contingency Table and Chi-square in SPSS
Analyze= Custom Tables = Custom Tables =
Ok= Row and Column= Test Statistics = Tests
of independence (Chi-square) = Ok
Or
Analyze= Descriptive statistics= Crostabs=
choose DV into Rows and IV into Columns=
Statistics= Chi-square= OK

05/07/24 54
Comparing two groups: T-tests
 A t-test is a statistical hypothesis test. In such test, the test statistic
follows a Student’s T-distribution if the null hypothesis is true. The T-
statistic was introduced by W.S. Gossett under the pen name
“Student”.
 The most frequently used procedures for testing to determine
whether or not the means of two independent groups could
conceivably have come from the same population.
 If you compute means for two samples, they will almost always
differ to some degree. The job of the t-test is to see whether they
differ by chance or whether the difference is real and reliable.
 It is given by:.
x
t 
s/ n

05/07/24 55
T-test in SPSS
Parametric
Analyze Compare means  One sample Test or
Independent samples test or paired samples test
• Non-parametric
• Analyze  Nonparametric Tests  Related samples or
Independent samples or One sample  Automatically
compare observed data to be hypothesized

05/07/24 56
Comparing more than two groups: ANOVA
ANOVA (similar to Difference of Means Test) is used
to examine variations among groups (and within
members of a group) with respect to some behavior
and see if the variations are statistically significant.
Groups may be like: male/female; economically
developed/ economically developing; smokers/non-
smokers; dry-lands/wet-lands; religious/non-
religious, High, medium, low; etc.
In AVOVA, the DV has an interval/scale measure,
while the IV has nominal or ordinal measure.
05/07/24 57
ANOVA test
We use the F-test in ANOVA, given by
Fcalculated. =

Now, if Fcalc. > Ftable, then reject the Ho.

05/07/24 58
ANOVA in SPSS
Analyze, Compare Means, One-Way ANOVA...
(Parametric test)
Analyze, Nonparametric, such as Kruskal-Wallis
one-way non-parametric ANOVA
Choose Post Hoc..., Post Hoc Tests, Choose Tukey

05/07/24 59
Scatterplots/diagrams: Linearity
Scatter plot/diagram:
values of the two variables plotted on each axis
strong relationships can be identified by scatter
diagrams
Four relationships can be identified
 Positive linear
 Negative linear

 Non linear (curvi linear)

 No relationship at all

05/07/24 60
Scatter plot of a positive association
Income and livestock ownership

60
50
Livestock

40
30
20
10
0
0 200 400 600 800 1000 1200
Income
Scatter plot of a negative association
Income & illitracy rates (%)
Rate of illiteracry (%)

100
80
60
40
20
0
0 200 400 600 800 1000 1200
Income
Scatter plot of no association
Income and household size

12
10
hh size

8
6
4
2
0
0 200 400 600 800 1000 1200
income
Scatter and line graph
Positive Linear Relationship Relationship NOT Linear

Negative Linear Relationship No Relationship


Scatter plot in SPSS
Graphs  Legacy Dialogs  Scatter/Dot

05/07/24 65
Covariance and Correlations
The interest is about the association/relationship
between two variables or whether the vary together.
Example:
Does income of individuals increase as age increases??
Is the amount of sales associated with advertizing
expenditure?
Is crime related with socio-economic background?
Is student academic achievement associated with
parent’s educational level?

05/07/24 66
Covariance
Covariance:
 Covariance between X and Y refers to a measure of how
much two variables change together.
 Covariance indicates how two variables are related. A
positive covariance means the variables are positively
related, while a negative covariance means the variables are
inversely related. The formula for calculating covariance of
sample data is shown below.
n

 (x i  x )( yi  y )
Cov ( x, y )  i 1

05/07/24 67
Correlation Analysis
Correlation:
Is concerned with the relationship/association,
direction and strength of the relationship between
variables.
Correlation coefficients can be calculated to see the
direction and strength of the relationship
Depends on the nature of variables (parametric vs non-
parametric orn numeric vs non-numeric)
 ( xi  x )( yi  y )
r ( x, y )  i 1

var( xi  x ) var( yi  y )
05/07/24 68
Correlation...
The most commonly used is Pearson’s correlation coefficient
or Pearson’s r or simply correlation coefficient
Captures linear relationship between variables; non-linear
relationship are not captured
Lies between -1 & 1
 r=0: no significant relationship
 r=1: perfect positive relationship
 r=-1: perfect negative relationship

Spearman’s rho/rank correlation coefficient (ρ)


mainly for ordinal variables (parametric)
Phi (Φ)correlation between two dichotomous variables
Correlations and Covariance in SPSS
Correlation
Analyze  Correlate  Bivariate  Correlation
coefficients (choose depending on
parametric/nonparametric)
Covariance
Analyze  Correlate  Options  Cross-product
deviations and covariances

05/07/24 70
Regression Analysis
Regression analysis is a set of statistical techniques using
past observations to find (or estimate) the equation that best
summarizes the relationships among key economic
variables.
The method requires that analysts:
(1) collect data on the variables in question,
(2) specify the form of the equation relating the variables,
(3) estimate the equation coefficients, and
(4) evaluate the accuracy of the equation
Regression analysis is used to:
 Predict the value of a dependent variable based on the
value of at least one independent variable
 Explain the impact of changes in an independent variable
on the dependent variable
Regression…
Regression Analysis is Used Primarily to Model
Causality and Provide Prediction
Predict the values of a dependent (response) variable
based on values of at least one independent
(explanatory) variable
Explain the effect of the independent variables on the
dependent variable
The relationship between X and Y can be shown on a
scatter diagram

05/07/24 72
Simple Linear Regression Model
 Only one independent variable, x
 Relationship between x and y is described by a
linear function
 Changes in y are assumed to be caused by
changes in x
Regression analysis serves three major purposes:

1. Description
2. Control
3. Prediction
Population Linear Regression
The population regression model:
Population Random
Population Independe Error
Slope
nt Variable term, or
Coefficient
Dependent y residual

y  β0  β1x  ε
Variable intercept

Linear component Random Error


component
Regression…
Explanatory and Response Variables are Numeric
Relationship between the mean of the response
variable and the level of the explanatory variable
assumed to be approximately linear (straight line)
Model:
Y     x 
0 1  ~ N (0,  )
• 1 > 0  Positive Association
• 1 < 0  Negative Association
• 1 = 0  No Association
Critical Assumptions
Error term is normally distributed (Normality).
Error term has zero expected value or mean.
Error term has constant variance in each time period
and for all values of X (i.e. Homoscedasticity).
Error term’s value in one time period is unrelated to
its value in any other period (Autocorrelation).
The underlying relationship between the x variable
and the y variable is linear (Linearity)

05/07/24 76
Ordinary Least Squares (OLS) Estimations
0  Mean response when x=0 (y-
intercept)
1  Change in mean response when x
increases by 1 unit (slope)
 0, 1 are unknown parameters (like )
 0+1x  Mean response when
explanatory variable takes on the value x
Estimated Regression Model
The sample regression line provides an
estimate of the population regression line

Estimated Estimate of Estimate of the


(or the regression
predicted) y regression slope
value intercept
Independe

ŷ i  b 0  b1x nt variable

The individual random error terms ei is a random variable


have a mean of zero
Interpretation of the Slope and the Intercept
b0 is the estimated average value of
y when the value of x is zero

b1 is the estimated change in the


average value of y as a result of a
one-unit change in x
Multiple Linear Regression
In simple linear regression we studied the relationship
between one explanatory variable and one response
variable.
Now, we look at situations where several explanatory
variables works together to explain the response variable.
Formal Statement of the Model
General regression model
Y   0  1 x1   2 x2     k xk  

 0, 1, , k are parameters


• X1, X2, …,Xk are known constants
 , the error terms are independent N(o, 2)
Estimating the parameters of the model
The values of the regression parameters i are not known.
We estimate them from data.
As in the simple linear regression case, we use the least-
squares method to fit a linear function to the data.

The least-squares method chooses the b’s that make the


sum of squares of the residuals as small as possible.
yˆ  b0  b1 x1  b2 x2    bk xk
Testing for Overall Significance

Shows if Y Depends Linearly on All of the X Variables


Together as a Group
Use F Test Statistic
Hypotheses:
 H0: …k = 0 (No linear relationship)
 H1: At least one i ( At least one independent variable
affects Y )
The Null Hypothesis is a Very Strong Statement
The Null Hypothesis is Almost Always Rejected

83
Analysis of Variance and F Statistic

Explained Variation /(k  1)


F
Unexplained Variation /(n  k )

R /(k  1)
2
F
MSR
F MSE
(1  R ) /(n  k )
2

84
k = 3, no of
parameters

ANOVA
df SS MS F Significance F
Regression 2 228014.6 114007.3 168.4712 1.65411E-09
Residual 12 8120.603 676.7169
Total 14 236135.2

p-value
k -1= 2 n-1

85
The Coefficient of Determination – R2
The coefficient of determination is the proportion of
the total variance that is explained by the regression.
It is the ratio of the explained sum of squares to the total
sum of squares.

86
The Coefficient of Determination – R2

ESS RSS  ei 2
= 1- = 1-
R =
2
TSS TSS (Yi  Y ) 2

The higher R² is, the closer the estimated regression equation fits the
sample data.
•Since TSS, RSS and ESS are all non-negative (being squared deviations),
•and since ESS  TSS, R² must lie in the interval

0  R²  1
•A value of R² close to one shows a “good“ overall fit, whereas a value
near zero shows a failure of the estimated regression equation to explain
the variation in Y.

87
Multiple regression model building
Often we have many explanatory variables, and our goal
is to use these to explain the variation in the response
variable.
A model using just a few of the variables often predicts
about as well as the model using all the explanatory
variables.
Linear Regression in SPSS
Analyze  Regression  Linear  select several
options

05/07/24 89
Limited Dependent Variables

Dichotomous variables
Ordered Choice
Intensity measurement

90
Logistic regression
There are many important research topics for which the
dependent variable is "limited."
For example: voting, morbidity or mortality, and
participation data is not continuous or distributed
normally.
Binary logistic regression is a type of regression analysis
where the dependent variable is a dummy variable: coded 0
(did not vote) or 1(did vote)
 Binary models

 Discrete choice models, etc.

91
The Linear Probability Model
the linear probability model can be written as:
Y =  + X + e ; where Y = (0, 1) or
P(y = 1|x) = b0 + xb
But:
The error terms are heteroskedastic
e is not normally distributed because Y takes on only
two values
The predicted probabilities can be greater than 1 or less
than 0
An alternative is to model the probability as a function,
G(b0 + xb), where 0<G(z)<1
92
The Logit Model
A common choice for G(z) is the logistic function, which is the
cdf for a standard logistic random variable
 G(z) = exp(z)/[1 + exp(z)] = L(z)
 This case is referred to as a logit model, or a logistic regression
 The estimated probability is given as:
ln[p/(1-p)] =  + X + e or
p = 1/[1 + exp(- -  X)]

93
The Logit Model
Where:
p is the probability that the event Y occurs, p(Y=1)
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
 The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
 if you let  +  X =0, then p = .50
 as  +  X gets really big, p approaches 1
 as  +  X gets really small, p approaches 0

94
95
The Probit Model
Another choice for G(z) is the standard normal
cumulative distribution function (cdf)
 G(z) = F(z) ≡ ∫f(v)dv, where f(z) is the standard normal,
so f(z) = (2p)-1/2exp(-z2/2)
 This case is referred to as a probit model
Since discrete choice models are nonlinear models, they
cannot be estimated by OLS method
 we use maximum likelihood estimation

96
Probits and Logits
Both the probit and logit are nonlinear and require
maximum likelihood estimation
 No real reason to prefer one over the other
Both functions have similar shapes – they are increasing
in z, most quickly around 0
Traditionally we saw more use of the logit, mainly
because the logistic function was easier to compute.
Today, probit is easy to compute with standard
packages, so is also popular

97
Interpreting Coefficients
In general we care about the effect of x on P(y = 1|x),
that is, we care about ∂p/ ∂x
 For the linear case, this is easily computed as the
coefficient on x
 In the case of Logit since:
[p/(1-p)] = exp()+exp()exp(X)+exp(e)
 The slope coefficient () is interpreted as the rate of
change in the "log odds" as X changes
 exp() is the effect of the independent variable on the
"odds ratio"

98
The Likelihood Ratio Test
 Unlike the LPM, where we can compute F statistics to test
exclusion restrictions, we need a new type of test
 Maximum likelihood estimation (MLE), will always
produce a log-likelihood, L
 Just as in an F test, you estimate the restricted and
unrestricted model, then form
 LR = 2(Lur – Lr) ~ c2q

99
Goodness of Fit
Unlike the LPM, where we can compute an R2 to judge
goodness of fit, we need new measures of goodness of fit
One possibility is a pseudo R2 based on the log likelihood and
defined as 1 – Lur/Lr
Can also look at the percent correctly predicted.

100
Extensions
Unordered multiple (j>2) choices: travel mode,
treatment choice, etc., should be analyzed with the
multinomial logit model
Ordered multiple (j>2) choices: opinion/attitude
surveys, rankings,etc., should be analyzed with the
ordered logit model
Tobit Model used when the dependent variable is being
censored.
y* = xb + u, u|x ~ Normal(0,s2)
we only observe y = max(0, y*)

101
Limited dependent variable models in SPSS
Analyze  Regression  choose the model of your
interest from the list other than ‘Linear’

05/07/24 102

You might also like