0% found this document useful (0 votes)
55 views

Data Preparation & Analysis

The document discusses various steps involved in data preparation and analysis using SPSS. It covers questionnaire checking, editing, coding, creating a codebook, data cleaning, and selecting appropriate univariate and multivariate statistical techniques based on the characteristics and properties of the data.

Uploaded by

Vardaan Bhaik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Data Preparation & Analysis

The document discusses various steps involved in data preparation and analysis using SPSS. It covers questionnaire checking, editing, coding, creating a codebook, data cleaning, and selecting appropriate univariate and multivariate statistical techniques based on the characteristics and properties of the data.

Uploaded by

Vardaan Bhaik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Preparation &

Analysis
with
SPSS
Data Preparation Process
Check Questionnaire

Edit

Code

Transcribe

Clean Data

Data Analysis
Questionnaire Checking
A questionnaire returned from the field may be
unacceptable for several reasons.
Editing

Editing the questionnaires involves identifying illegible, incomplete,


inconsistent or ambiguous responses:

Treatment of Unsatisfactory Results

– Returning to the Field

– Assigning Missing Values

– Discarding Unsatisfactory Respondents


Coding
Coding means assigning a code, usually a number, to each possible
response to each question. The code includes an indication of
– the column position (field) e.g. sex of a respondent
– data record that includes related fields such as sex, marital status,
age, income etc.

Coding Questions

• Fixed field codes, which mean that the number of records for each
respondent is the same and the same data appear in the same
column(s) for all respondents, are highly desirable.

• If possible, standard codes should be used for missing data. Coding of


structured questions is relatively simple, since the response options are
predetermined.

• In questions that permit a large number of responses, each possible


response option should be assigned a separate column.
Coding
Guidelines for coding unstructured questions:

• Category codes should be mutually exclusive and collectively


exhaustive.

• Only a few (10% or less) of the responses should fall into the “other”
category.

• Category codes should be assigned for critical issues even if no one


has mentioned them.

• Data should be coded to retain as much detail as possible.


Coding : Codebook
A codebook contains coding instructions and the necessary
information about variables in the data set. A codebook generally
contains the following information:

• column number
• record number
• variable number
• variable name
• question number
• instructions for coding
Data Cleaning
Consistency Checks

Consistency checks : identify data that are out of range, logically


inconsistent, or have extreme values.
Selecting a Data Analysis Strategy
Earlier Steps of the Research Process

Known Characteristics of the Data

Properties of Statistical Techniques

Background and Philosophy of the Researcher

Data Analysis Strategy


• Metric Data- Data that are on interval or ratio scale
• Non-metric Data- Data that are on nominal or ordinal scale
• Univariate Techniques- Statistical techniques appropriate for analysing
data when there is single measurement of each element in the sample.
• Multivariate Techniques- Statistical techniques appropriate for analysing
data when there are two or more measurements on each element in the
sample. It tells simultaneous relationship between two or more
phenomenon.
– Dependence Techniques- When one or more of the variables can be
identified as dependent variable & the remaining as independent
variables.
– Interdependence Techniques- The techniques that attempt to group
data based on underlying similarity. No distinction is made as to which
variables are dependent/ independent.
A Classification of Univariate Techniques
Univariate Techniques

Metric Data Non-numeric Data

One Sample Two or More One Sample Two or More


Samples Samples
* t test * Frequency
* Z test * Chi-Square
* K-S
* Runs
* Binomial
Independent Related
* Two- Group * Paired Independent Related
test t test
* Z test * Chi-Square
* One-Way * Sign
* Mann-Whitney * Wilcoxon
ANOVA * Median * McNemar
* K-S * Chi-Square
* K-W ANOVA
A Classification of Multivariate Techniques
Multivariate Techniques

Dependence Interdependence
Technique Technique

One Dependent More Than One Variable Interobject


Variable Dependent Interdependence Similarity
Variable
* Cross- * Multivariate * Factor * Cluster Analysis
Tabulation Analysis of Analysis * Multidimensional
* Analysis of Variance and Scaling
Variance and Covariance
Covariance * Canonical
* Multiple Correlation
Regression * Multiple
* Conjoint Discriminant
Analysis Analysis
Type I Error & Type II Error
• A Type I error (α) is the mistake of rejecting the null hypothesis when it is true.
• A Type II error (β) is the mistake of failing to reject the null hypothesis when it is false.
Machine is working erroneously. But, it is assumed
to be working accurately and hence, it will fill in
wrongly causing loss to company & customers.

Ho (True) Ho(False)

Accept Ho Correct Decision Type II Error (β)


(1- α)

Reject Ho
Type I Error (α) Correct Decision
(1- β )
Machine is working accurately. But,
it is assumed to be working
erroneously and hence filling will
be 23 April 2018
stopped & mechanic is called.
Hypotheses Testing
• Level of Significance (α): Risk that a researcher is willing to take of rejecting the null hypotheses when it
happens to be true. It is probability of making a Type I error (α). The higher the significance level, the higher
the probability of rejecting a null hypothesis when its true.

• Critical Region: It is the rejection region. If the value of mean falls within this region, the null hypothesis is
rejected.

• Critical value: The value of a test statistic beyond which the null hypothesis can be rejected.

• Power of Test (1- β): It is the ability of a test to reject a false null hypothesis. The probability of supporting
an alternative hypothesis that is true. High value of 1- β(near 1) means test is working fine, it is rejecting a
null hypothesis when it is false.

• One-Tailed Test : If null hypothesis is rejected only for values of the test statistic falling into one specified
tail of its sampling distribution.

• Two-Tailed Test: If the null hypothesis is rejected for values of the test statistic falling into either tail of its
sampling distribution. A deviation in either direction would reject the null hypothesis. Normally α is divided
into α/2 on one side and α/2 on the other.
One Tailed & Two Tailed Test
• A manufacturer of a light bulb wants to produce bulbs with a mean life of 1000
hours. If the lifetime is shorter, he will lose customers to the competitors; if the
lifetime is longer, he will have a very high production cost because the filaments will
be very thick. Determine the type of test.

• The wholesaler buys bulbs in large lots & does not want to accept bulbs unless
their mean life is at least 1000 hours. Determine the type of test.
One Tailed Test
Two Tailed Test
Univariate
Data
Analysis
t-tests (Cases)
One sample t-test : To test if mean of a distribution differs significantly from
some preset value

For the given marks.sav file, find if the final marks scored by students differ
significantly from the Professor’s goal of class average of 60. Design
hypothesis & test it.
t-tests (Cases)
Independent sample t-test : To test if means of a distribution of two samples
differs significantly from each other

If there are 15 customers of our brand each in Mumbai & Delhi, and they
are asked to rate our brand on a 7 point scale. 1= most disliked & 7 = most
liked.
The ratings by these 30 customers from two cities are mentioned next.
Develop a hypothesis to test if ratings by two cities are different. Also test
the hypothesis.
t-tests (Cases)
Paired sample t-test : To test if two measurements on the same sample differ
significantly

If there are 18 customers of Passion brand of garments. This set of


customers is to be monitored for their attitude towards Passion brand
before and after release of an advertising campaign. The attitude is to be
measured on a 10 point scale. 1= highly disliked, 10= highly liked.

The ratings by these 18 customers before and after the advertising


campaign are mentioned next. Develop a hypothesis to test if these ratings
by customers are different. Also test the hypothesis.
ANOVA
• Whereas t-tests compare only two distributions, analysis of variance is able to
compare many. E.g. if in case of MARKS file, we want to see whether Quiz1 scores by
men and women are different i.e. who (men or women) score higher in the quiz, a t-
test is appropriate.

If, however, we wish to see whether any of the five different ethnic groups’ scores
differ significantly from each other on the same quiz, it would require one way analysis
of variance to accomplish it.

One way ANOVA means:


Exactly one dependent variable (Continuous) e.g. quiz1 scores, here
Exactly one independent variable (Categorical) e.g. ethnicity, here, with 5 level

Two (Three) way ANOVA means: Exactly one dependent variable &
Exactly two (three) independent variable

MANOVA: Multiple dependent variables & multiple independent variables


One-Way ANOVA
• File # MARKS.sav
Dependent variable – Quiz 1 scores
Independent variable – Ethnicity (with 5 levels)

– Ho: There is no difference among students with different


ethnicities as far as quiz1 marks scored by them is concerned.

– H1: There is significant difference among students with


different ethnicities as far as quiz1 marks scored by them is
concerned.
Chi-Square Test
Graduation background of MBA students & their performance in terms of
grade is given below:

Education Background:
• B.com (1)
• B.E. (2)
• B.Sc. (3)
Ho: Graduation background of MBA students does not
• B.B.A. (4)
influence their performance in terms of grade.
• B.A. (5)
Ha: Graduation background of MBA students influence their
Grade Codes: performance in terms of grade.
• A (1)
• B (2)
• C (3)
Correlation (r)
• Degree of association between two sets of quantitative data e.g. how crop
production is correlated with rainfall?
• r varies from -1 to +1; r=0 (no correlation); r= (+/-)1 (perfect correlation)

Bivariate Correlation: Correlation between two variables


• File # MARKS.sav
• To produce correlation matrix of gender, gpa & final

Partial Correlation: Process of finding correlation between two variables after


the influence of other variables has been controlled for.
Regression
• Regression explains variation in one variable (dependent variable) based on the
variation in one or more other variables (independent variables)
• Simple regression: one dependent & one independent variable
• Multiple regression: one dependent & more than one independent variables

• File # REGRESSION.sav
It is dersired to study the effect that six different conditions (independent variables)
have on yield per hectre for a crop of wheat. The research was conducted by
accumulating data from fifteen major states in India
The six independent variables are;
X1= Rainfall (in cms)
X2= Soil type (1, low quality to 5, high quality)
X3= Quantity of fertilizer (in quintal/ sq. km of land)
X4= Land percentage being irrigated by State Agri. Deptt.
X5= Seed quality (1, low quality to 5, high quality)
X6= Percentage of automation in cultivation process
Dependent variable is Y= yield per hectre in quintals
Regression
We need to determine:

1. Is model a good fit? From ANOVA table (F-value)

2. What % of variation in dependent variable is explained by independent variables?


From Model Summary (Adjusted R square)

3. Which independent variables are good explanatory variables of dependent variable?


From Coefficients (t-values)

4. Regression Equation

You might also like