BRM - Data Analysis, Interpretation and Reporting Part II
BRM - Data Analysis, Interpretation and Reporting Part II
Alemseged Gerezgiher
(BSc, MBA, PhD)
05/07/24 1
Part VI (Sub-part II)
Data Analysis, Interpretation
and Reporting
05/07/24 2
Chapter Six: Data Analysis, Interpretation and
Reporting
Descriptive Analysis
Inferential Analysis
Hypothesis Testing
05/07/24 3
Data Analysis: Introduction
Once the data is ready for processing, the next step is to
choose appropriate analysis method and conduct the
analysis.
Data analysis depends on the nature of the variable, the type
of data and the purpose of the analysis. The following issues
will affect the data analysis part of your research endeavor.
The type of data you have gathered, (i.e.
Nominal/Ordinal/Interval/Ratio)
Are the data paired such as before and after treatment?
Are they parametric or non-parametric?
Ranks, scores, or categories are generally non-parametric data.
Measurements that come from a population that is normally
05/07/24 5
Analyzing qualitative Data
• There is considerable amount of interview, focus group
discussion and/or text-based data and images that require
analysis.
• Creswell (2003) suggests that it is useful to look at the
codes that have emerged according to:
Codes readers would expect to find;
Codes that are uprising; and
Codes that address a larger theoretical perspective in their research.
Then, follow the next steps
Identifying themes
Coding data (reducing data to manageable size)
Developing a description from the data
Defining themes from the data
Connecting and interrelating themes
Analyzing qualitative…
Further activities
Noting reflections in the margins
Sorting and shifting through the materials to identify similar
phrases, relationships, patterns, themes, commonalities, &
differences
Isolating patterns, processes, commonalities, & differences
and incorporating methods to further explore them into the
next wave of data collection
Gradually developing a small set of generalizations about
what consistently appears in the data
Confronting those generalizations with a formalized body of
knowledge in the form of constructs or theories
05/07/24 7
Quantitative analysis
The quantitative analysis uses numeric
expressions/representations and manipulations of
the collected data.
The analysis could take descriptive or
inferential form.
Based on number of variables involved,
quantitative analysis could be univariate, bivariate
and/ or multivariate analysis.
05/07/24 8
Quantitative analysis
Descriptive vs Inferential analysis:
Descriptive analysis: refers to statistically describing,
aggregating, and presenting the constructs of interest or
associations between these constructs.
Inferential analysis: refers to the statistical estimation of
parameter values and testing of hypotheses (theory testing).
05/07/24 12
Normal distribution…
Z-Score (Standard Normal Curve) – is a normal
curve with mean = 0 and standard deviation,
S = 1. It is used to compare scores in two or more
distributions that have different means and standard
deviations.
z = (x – x (Bar))/s, where z = number of
standard deviations, ….
If the data is normally distributed, we employ
parametric tests
If the data is categorical or if the assumption of
normality does not hold, we use non-parametric tests
05/07/24 13
Using histogram to test the normality of the
data
05/07/24 14
Checking for normality with a Q-Q plot
05/07/24 15
Analyze, Descriptive Statistics, Explore…
Plots--- Normality plots with tests
05/07/24 16
Univariate analysis (Descriptive analysis)
• The following categories of the descriptive analysis are usually
used.
• Frequency distributions
• Measures of central tendency
• Measures of dispersion
• Shape of distribution
1) Frequency distributions (tables, bar graph, pie chart, histogram)
a) Frequency table- a table of a summary of the values of a variable
and the number of times the variable assumes an given value. It
has:
• Descriptive tile
• Clear labels for columns and rows
• Appropriate categories
• Presentation of frequencies and corresponding percentages
05/07/24 17
Univariate analysis…
b) Pie charts and Bar charts- when data is nominal or
ordinal, we use pie chart or bar chart. However, only one
variable in pie chart and possibly more than one in bar
charts.
c) Histogram –Histograms are used when it is an interval
level data measurement.
We can also have line graphs to explore the variable(s).
05/07/24 18
Univariate analysis…
• Example: Frequency table (Leisure time preference)
05/07/24 19
Example:
Bar Diagram: Lists the categories and presents the percent or
count of individuals who fall in each category.
05/07/24 20
Example:
Pie Chart: Lists the categories and presents the percent or
count of individuals who fall in each category.
05/07/24 21
Example:
Histogram: Overall pattern can be described by its shape,
center, and spread. The following age distribution is right
skewed. The center lies between 80 to 100. No outliers
05/07/24 22
Frequency distributions in SPSS
Frequency tables: are found under the ‘analyze’ menu
bar (Analyze ---- Descriptive statistics ---- Frequencies)
Then, select variables and move them to ‘variable(s)’ dialog
box, choose from the options, display frequency tables, OK
Charts and graphs: two options
Analyze ---- Descriptive statistics --- Frequencies --- charts
Graphs --- Legacy dialogs --- charts/graphs (options)
05/07/24 23
Frequency distributions in SPSS
Analyze Descriptive statistics Frequency
24
25
Univatiate analysis…
2) Measures of central tendency
Central tendency is an estimate of the center of
a distribution of values.
There are three major estimates of central
tendency: mean, median, and mode.
05/07/24 26
Measures of central tendency…
1. Mean
For a data set, the mean is the sum of the values divided
by the number of values. The mean of a set of numbers
x1, x2... xn is typically denoted by , pronounced "x bar".
This mean is a type of arithmetic mean. The mean
describes the central location of the data; the arithmetic
mean is the "standard" average, often simply called the
"mean".
The other name is average
mainly for interval variables
very widely used and intuitively appealing
05/07/24 27
Measures of central tendency…
2. Median
It is the middle value of the distribution when all items are
arranged in either ascending or descending order in terms of
value
mid-point value; arrange data from lowest to highest to
identify mid value; if two mid values, take the average
th
n 1
Med value
2
mean is sensitive to outliers but median is robust
05/07/24 28
Measures of central tendency…
3. Mode
It is the value that occurs most frequently in the data set
3) Measures of dispersion
• It measures the amount of scatter or variationin a dataset
• Or it refers to the way values are spread around the central
tendency, for example, how tightly or how widely are the
values clustered around the mean.
• similar measures of central tendency may come from very
different distributions
05/07/24 29
Measures of dispersion...
But different dispersions
05/07/24 31
Measures of dispersion…
Variance:
The variance is used as a measure of how far a set of
numbers are spread out from each other. It is one of
several descriptors of a probability distribution,
describing how far the numbers lie from the mean
(expected value). In particular, the variance is one of
the moments of a distribution.
n 2
( x x)
i
Var ( x) i 1
n
05/07/24 32
Measures of dispersion…
Standard deviation:
It is a widely used measurement of variability or diversity used
in statistics and probability theory. It shows how much variation
or “dispersion" there is from the average (mean, or expected
value). A low standard deviation indicates that the data points
tend to be very close to the mean, whereas high standard
deviation indicates that the data are spread out over a large
range of values. The standard deviation of X is given by:
A useful property of
standard deviation is n 2
that, unlike variance, it is
expressed in the same
( x x)
i
05/07/24 34
Measures of shape of distribution
4) Measures of shape of distribution
skewness and kurtosis are the commonly used
measures of shape of distribution of a dataset.
Skweness:
It refers to symmetry or asymmetry of the
distribution.
The skewness value can be positive or negative, or
even undefined.
05/07/24 35
Measures of shape of distribution…
Skewness:
Qualitatively, a negative skew indicates that the tail on the
left side of the probability density function is longer than
the right side and the bulk of the values (possibly
including the median) lie to the right of the mean.
A positive skew indicates that the tail on the right side is
longer than the left side and the bulk of the values lie to
the left of the mean. A zero value indicates that the values
are relatively evenly distributed on both sides of the mean,
typically but not necessarily implying a symmetric
distribution.
05/07/24 36
Measures of shape of distribution…
The skewness of a random variable X is the third
standardized moment and defined as
n 3
(x i x)
SK i 1
( n 1) S 3
The coefficient of Skewness is a measure for the
degree of symmetry in the variable distribution.
05/07/24 37
Measures of shape of distribution…
Kurtosis:
It refers to peakedness of the distribution.
It is a measure of the "peakedness" of the probability
distribution of a real-valued random variable.
Higher kurtosis means more of the variance is the result of
infrequent extreme deviation, as opposed to frequent
modestly sized deviations. n 4
( xi x )
KU i 1
( n 1) S 4
05/07/24 38
Measures of shape of distribution…
The coefficient of Kurtosis is a measure for the degree of
peakedness/flatness in the variable distribution.
05/07/24 39
Central tendency, dispersion and shape in SPSS
Analyze Descriptive statistics Descriptives
Options (select your interest of analysis)
40
Bivariate analysis
How do we analyze relationships between two variables?
Bivariate analysis is analysis of two variables to examine if
they are correlated or if there is differences between values
analyzing relationships between two variables.
Remember co-variation does not always imply
causation
05/07/24 41
Bivariate analysis
• Examples:
• Do men earn more income than women?
• Does educational level affect attitudes toward
participation in labour union?
• Is income level correlated with life expectancy?
• Is parental educational level correlated with student
performance?
We need to conduct hypothesis testing to arrive at
conclusive results on issues like this.
05/07/24 42
Hypotheses Testing
The following are the steps in hypothesis testing:
1. state the null hypothesis
2. choose an appropriate statistical test,
3. specify the level of statistical significance. (usually
this is o.1, 0.05 or 0.01) --- known as the α–level.
4. Decide to accept or to reject the null hypothesis
based on the findings.
We use different tests based on the nature of the dependent
and independent variables and nature of distribution of the
data.
During hypothesis testing, there is a possibility of
committing decision errors. The are two types of errors.
05/07/24 43
Hypothesis…
"Type I error"
A type one error is a false positive (true) result.
If you use a parametric test on nonparametric data then
this could trick the test into seeing a significant effect
when there isn't one.
Or , it is a situation where we reject the null hypothesis
that is true.
The probability of committing Type I error is called
significance level (P-value).
This error requires more attention and important to avoid
05/07/24 44
Hypothesis…
“Type II error”
It occurs when we accept a null hypothesis that is false.
However, this occurs if you use a nonparametric test on
parametric data then this could reduce the chance of
seeing a significant effect when there is one.
A type two error is a missed opportunity, i.e. we have
failed to detect a significant effect that truly does exist
This is least dangerous.
Summary; Using a parametric test in the wrong context
may lead to a type one error, a false positive.
Using a nonparametric test in the wrong context may lead
to a type two error, a missed opportunity.
05/07/24 45
Hypothesis…
Reading P-value
It is the basis for deciding whether or not to reject the
null hypothesis.
P-values do not simply provide you with a Yes or No
answer, they provide a sense of the strength of the
evidence against the null hypothesis.
The lower the p-value, the stronger the evidence, usually
less than 0.05 or 0.01, the null hypothesis is rejected..
It is the probability that a statistical result as extreme as
the one observed would occur if the null hypothesis were
true.
05/07/24 46
05/07/24 47
Hypothesis…
Parametric tests
T-test (one sample, independent sample, paired)
One-way ANOVA
Repeated ANOVA (for paired data)
Pearson correlation
05/07/24 48
Hypothesis…
Nominal Ordinal Interval/Ratio Dichotomous
Ordinal Contingency table Spearman’s rho Spearman’s rho (ƿ) Spearman’s rho
Chi-square (ƿ) (ƿ)
Cramer’s V
Dichoto Contingency table Spearman’s rho Spearman’s rho (ƿ) Phi (ɸ)
mous Chi-square (ƿ)
Cramer’s V
Hypothesis…
Requirement Example of Situation Test to be Used
Compare to a target Is the average age of employees Use a one sample
more than 40 years? t-test
Compare two groups Do men earn more income than Use independent
women? samples t-test
Compare two groups with one Test scores before and after Use Paired t-test
controlled intervention training
Compare more than two groups Compare amount of income One way ANOVA
between four categories of (F-test)
educational level
05/07/24 52
Chi-square Test
Chi-square can be calculated as follows
χc 2 = Σ [(observed – expected)2⁄expected]
If the calculated chi-square is grater than the chi-
square obtained from the table, then we conclude
there is a relationship (that is, reject the H o).
Remember, like in all hypothesis testing, the Chi-
square assumes that there is no relationship between
the DV and IV.
05/07/24 53
Contingency Table and Chi-square in SPSS
Analyze= Custom Tables = Custom Tables =
Ok= Row and Column= Test Statistics = Tests
of independence (Chi-square) = Ok
Or
Analyze= Descriptive statistics= Crostabs=
choose DV into Rows and IV into Columns=
Statistics= Chi-square= OK
05/07/24 54
Comparing two groups: T-tests
A t-test is a statistical hypothesis test. In such test, the test statistic
follows a Student’s T-distribution if the null hypothesis is true. The T-
statistic was introduced by W.S. Gossett under the pen name
“Student”.
The most frequently used procedures for testing to determine
whether or not the means of two independent groups could
conceivably have come from the same population.
If you compute means for two samples, they will almost always
differ to some degree. The job of the t-test is to see whether they
differ by chance or whether the difference is real and reliable.
It is given by:.
x
t
s/ n
05/07/24 55
T-test in SPSS
Parametric
Analyze Compare means One sample Test or
Independent samples test or paired samples test
• Non-parametric
• Analyze Nonparametric Tests Related samples or
Independent samples or One sample Automatically
compare observed data to be hypothesized
05/07/24 56
Comparing more than two groups: ANOVA
ANOVA (similar to Difference of Means Test) is used
to examine variations among groups (and within
members of a group) with respect to some behavior
and see if the variations are statistically significant.
Groups may be like: male/female; economically
developed/ economically developing; smokers/non-
smokers; dry-lands/wet-lands; religious/non-
religious, High, medium, low; etc.
In AVOVA, the DV has an interval/scale measure,
while the IV has nominal or ordinal measure.
05/07/24 57
ANOVA test
We use the F-test in ANOVA, given by
Fcalculated. =
05/07/24 58
ANOVA in SPSS
Analyze, Compare Means, One-Way ANOVA...
(Parametric test)
Analyze, Nonparametric, such as Kruskal-Wallis
one-way non-parametric ANOVA
Choose Post Hoc..., Post Hoc Tests, Choose Tukey
05/07/24 59
Scatterplots/diagrams: Linearity
Scatter plot/diagram:
values of the two variables plotted on each axis
strong relationships can be identified by scatter
diagrams
Four relationships can be identified
Positive linear
Negative linear
No relationship at all
05/07/24 60
Scatter plot of a positive association
Income and livestock ownership
60
50
Livestock
40
30
20
10
0
0 200 400 600 800 1000 1200
Income
Scatter plot of a negative association
Income & illitracy rates (%)
Rate of illiteracry (%)
100
80
60
40
20
0
0 200 400 600 800 1000 1200
Income
Scatter plot of no association
Income and household size
12
10
hh size
8
6
4
2
0
0 200 400 600 800 1000 1200
income
Scatter and line graph
Positive Linear Relationship Relationship NOT Linear
05/07/24 65
Covariance and Correlations
The interest is about the association/relationship
between two variables or whether the vary together.
Example:
Does income of individuals increase as age increases??
Is the amount of sales associated with advertizing
expenditure?
Is crime related with socio-economic background?
Is student academic achievement associated with
parent’s educational level?
05/07/24 66
Covariance
Covariance:
Covariance between X and Y refers to a measure of how
much two variables change together.
Covariance indicates how two variables are related. A
positive covariance means the variables are positively
related, while a negative covariance means the variables are
inversely related. The formula for calculating covariance of
sample data is shown below.
n
(x i x )( yi y )
Cov ( x, y ) i 1
05/07/24 67
Correlation Analysis
Correlation:
Is concerned with the relationship/association,
direction and strength of the relationship between
variables.
Correlation coefficients can be calculated to see the
direction and strength of the relationship
Depends on the nature of variables (parametric vs non-
parametric orn numeric vs non-numeric)
( xi x )( yi y )
r ( x, y ) i 1
var( xi x ) var( yi y )
05/07/24 68
Correlation...
The most commonly used is Pearson’s correlation coefficient
or Pearson’s r or simply correlation coefficient
Captures linear relationship between variables; non-linear
relationship are not captured
Lies between -1 & 1
r=0: no significant relationship
r=1: perfect positive relationship
r=-1: perfect negative relationship
05/07/24 70
Regression Analysis
Regression analysis is a set of statistical techniques using
past observations to find (or estimate) the equation that best
summarizes the relationships among key economic
variables.
The method requires that analysts:
(1) collect data on the variables in question,
(2) specify the form of the equation relating the variables,
(3) estimate the equation coefficients, and
(4) evaluate the accuracy of the equation
Regression analysis is used to:
Predict the value of a dependent variable based on the
value of at least one independent variable
Explain the impact of changes in an independent variable
on the dependent variable
Regression…
Regression Analysis is Used Primarily to Model
Causality and Provide Prediction
Predict the values of a dependent (response) variable
based on values of at least one independent
(explanatory) variable
Explain the effect of the independent variables on the
dependent variable
The relationship between X and Y can be shown on a
scatter diagram
05/07/24 72
Simple Linear Regression Model
Only one independent variable, x
Relationship between x and y is described by a
linear function
Changes in y are assumed to be caused by
changes in x
Regression analysis serves three major purposes:
1. Description
2. Control
3. Prediction
Population Linear Regression
The population regression model:
Population Random
Population Independe Error
Slope
nt Variable term, or
Coefficient
Dependent y residual
y β0 β1x ε
Variable intercept
05/07/24 76
Ordinary Least Squares (OLS) Estimations
0 Mean response when x=0 (y-
intercept)
1 Change in mean response when x
increases by 1 unit (slope)
0, 1 are unknown parameters (like )
0+1x Mean response when
explanatory variable takes on the value x
Estimated Regression Model
The sample regression line provides an
estimate of the population regression line
ŷ i b 0 b1x nt variable
83
Analysis of Variance and F Statistic
R /(k 1)
2
F
MSR
F MSE
(1 R ) /(n k )
2
84
k = 3, no of
parameters
ANOVA
df SS MS F Significance F
Regression 2 228014.6 114007.3 168.4712 1.65411E-09
Residual 12 8120.603 676.7169
Total 14 236135.2
p-value
k -1= 2 n-1
85
The Coefficient of Determination – R2
The coefficient of determination is the proportion of
the total variance that is explained by the regression.
It is the ratio of the explained sum of squares to the total
sum of squares.
86
The Coefficient of Determination – R2
ESS RSS ei 2
= 1- = 1-
R =
2
TSS TSS (Yi Y ) 2
The higher R² is, the closer the estimated regression equation fits the
sample data.
•Since TSS, RSS and ESS are all non-negative (being squared deviations),
•and since ESS TSS, R² must lie in the interval
0 R² 1
•A value of R² close to one shows a “good“ overall fit, whereas a value
near zero shows a failure of the estimated regression equation to explain
the variation in Y.
87
Multiple regression model building
Often we have many explanatory variables, and our goal
is to use these to explain the variation in the response
variable.
A model using just a few of the variables often predicts
about as well as the model using all the explanatory
variables.
Linear Regression in SPSS
Analyze Regression Linear select several
options
05/07/24 89
Limited Dependent Variables
Dichotomous variables
Ordered Choice
Intensity measurement
90
Logistic regression
There are many important research topics for which the
dependent variable is "limited."
For example: voting, morbidity or mortality, and
participation data is not continuous or distributed
normally.
Binary logistic regression is a type of regression analysis
where the dependent variable is a dummy variable: coded 0
(did not vote) or 1(did vote)
Binary models
91
The Linear Probability Model
the linear probability model can be written as:
Y = + X + e ; where Y = (0, 1) or
P(y = 1|x) = b0 + xb
But:
The error terms are heteroskedastic
e is not normally distributed because Y takes on only
two values
The predicted probabilities can be greater than 1 or less
than 0
An alternative is to model the probability as a function,
G(b0 + xb), where 0<G(z)<1
92
The Logit Model
A common choice for G(z) is the logistic function, which is the
cdf for a standard logistic random variable
G(z) = exp(z)/[1 + exp(z)] = L(z)
This case is referred to as a logit model, or a logistic regression
The estimated probability is given as:
ln[p/(1-p)] = + X + e or
p = 1/[1 + exp(- - X)]
93
The Logit Model
Where:
p is the probability that the event Y occurs, p(Y=1)
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
if you let + X =0, then p = .50
as + X gets really big, p approaches 1
as + X gets really small, p approaches 0
94
95
The Probit Model
Another choice for G(z) is the standard normal
cumulative distribution function (cdf)
G(z) = F(z) ≡ ∫f(v)dv, where f(z) is the standard normal,
so f(z) = (2p)-1/2exp(-z2/2)
This case is referred to as a probit model
Since discrete choice models are nonlinear models, they
cannot be estimated by OLS method
we use maximum likelihood estimation
96
Probits and Logits
Both the probit and logit are nonlinear and require
maximum likelihood estimation
No real reason to prefer one over the other
Both functions have similar shapes – they are increasing
in z, most quickly around 0
Traditionally we saw more use of the logit, mainly
because the logistic function was easier to compute.
Today, probit is easy to compute with standard
packages, so is also popular
97
Interpreting Coefficients
In general we care about the effect of x on P(y = 1|x),
that is, we care about ∂p/ ∂x
For the linear case, this is easily computed as the
coefficient on x
In the case of Logit since:
[p/(1-p)] = exp()+exp()exp(X)+exp(e)
The slope coefficient () is interpreted as the rate of
change in the "log odds" as X changes
exp() is the effect of the independent variable on the
"odds ratio"
98
The Likelihood Ratio Test
Unlike the LPM, where we can compute F statistics to test
exclusion restrictions, we need a new type of test
Maximum likelihood estimation (MLE), will always
produce a log-likelihood, L
Just as in an F test, you estimate the restricted and
unrestricted model, then form
LR = 2(Lur – Lr) ~ c2q
99
Goodness of Fit
Unlike the LPM, where we can compute an R2 to judge
goodness of fit, we need new measures of goodness of fit
One possibility is a pseudo R2 based on the log likelihood and
defined as 1 – Lur/Lr
Can also look at the percent correctly predicted.
100
Extensions
Unordered multiple (j>2) choices: travel mode,
treatment choice, etc., should be analyzed with the
multinomial logit model
Ordered multiple (j>2) choices: opinion/attitude
surveys, rankings,etc., should be analyzed with the
ordered logit model
Tobit Model used when the dependent variable is being
censored.
y* = xb + u, u|x ~ Normal(0,s2)
we only observe y = max(0, y*)
101
Limited dependent variable models in SPSS
Analyze Regression choose the model of your
interest from the list other than ‘Linear’
05/07/24 102