Data Preparation: March 6, 2010
Data Preparation: March 6, 2010
March 6, 2010
Data Preparation Process
Prepare Preliminary Plan of Data Analysis
Check Questionnaire
Edit
Code
Transcribe
Clean Data
Coding Questions
• Fixed field codes, which mean that the number of records for each
respondent is the same and the same data appear in the same
column(s) for all respondents, are highly desirable.
• If possible, standard codes should be used for missing data. Coding of
structured questions is relatively simple, since the response options
are predetermined.
• In questions that permit a large number of responses, each possible
response option should be assigned a separate column.
Coding
Guidelines for coding unstructured questions:
• Category codes should be mutually exclusive and collectively
exhaustive.
• Only a few (10% or less) of the responses should fall into the
“other” category.
• Category codes should be assigned for critical issues even if
no one has mentioned them.
• Data should be coded to retain as much detail as possible.
Codebook
A codebook contains coding instructions and the necessary
information about variables in the data set. A codebook
generally contains the following information:
• column number
• record number
• variable number
• variable name
• question number
• instructions for coding
Coding Questionnaires
• The respondent code and the record number appear on each
record in the data.
• The first record contains the additional codes: project code,
interviewer code, date and time codes, and validation code.
• It is a good practice to insert blanks between parts.
An Illustrative Computer File
Fields
Column Numbers
Records 1-3 4 5-6 7-8 ... 26 ... 35 77
Computer Magnetic
Disks
Memory Tapes
Transcribed Data
Data Cleaning
Consistency Checks
High School
1 to 3 years 6.39 8.65 1.35
4 years 25.39 29.24 1.15
College
1 to 3 years 22.33 29.42 1.32
4 years 15.02 12.01 0.80
5 to 6 years 14.94 7.36 0.49
7 years or more 12.18 6.90 0.57
Totals 100.00 100.00
Statistically Adjusting the Data
Variable Respecification
Nonusers 1 1 0 0
Light users 2 0 1 0
Medium users 3 0 0 1
Heavy users 4 0 0 0
Note that X1 = 1 for nonusers and 0 for all others. Likewise, X2 = 1 for
light users and 0 for all others, and X3 = 1 for medium users and 0 for all
others. In analyzing the data, X1, X2, and X3 are used to represent all
user/nonuser groups.
Statistically Adjusting the Data
Scale Transformation and Standardization
Zi = (Xi - )/sx X