Chapter 2-4
Chapter 2-4
By Gebisa Diriba
1
Chapter Two
2
What is a statistical package?
3
Software package
It may refer to one or more of the following:
• A piece of application software or utility software.
– VLC media player, Microsoft Word processor, Google
chrome, accounting applications, photo editor, etc. vs
AVAST antivirus, disk tools, backup software, etc.
• A software suite or collection of related application or utility
software.
• A package or module, a software component for accomplishing
a particular thing
• Package (package management system), a file used by a
package management system to install an application or library.
4
Application software
6
Features of packages
(Stata ,SPSS and Eviews)
• Stata: is a general-purpose and comprehensive statistical
software package created in 1985 by StataCorp.
• It is used by many businesses and academic institutions around
the world. Most of its users work in research, especially in the
fields of economics, sociology, political science, biomedicine and
epidemiology.
• Stata's capabilities include data management, statistical analysis,
graphics, simulations, and custom programming.
• The name Stata is a portmanteau of the words statistics and data.
7
Features of packages
(Stata ,SPSS and Eviews)…
EViews:
• EViews (Econometric Views) is a statistical package
for Windows, used mainly for time-series oriented
econometric analysis.
• It is developed by Quantitative Micro Software
(QMS)
10
1.2. Supports in Software Packages
15
Data
18
About
STATA is modern and general command driven package for statistical
analyses, data management and graphics.
STATA provides commands to analyze panel data (cross-sectional time-
series, longitudinal, repeated-measures, and correlated data), cross-
19
About
20
About Stata ….
It is used by economists, social scientists, medical
researchers, biostatisticians, epidemiologists,
sociologists, political scientists, geographers,
psychologists,, and other research professionals needing
to analyse data.
Stata has the ability to easily download programs
developed by other users and the ability to create your
own Stata programs that seamlessly become part of
Stata.
21
The Basic Features of Stata
2.1. The Stata windows
You can open Stata in the same way as you would most other
software packages by clicking on its icon or menu item
When you open Stata, you should see the following screen
• The Stata windows give you all the key information about the data
file you are using, recent commands, and the results of those
commands.
• Some of them open automatically when you start Stata, while others
can be opened using the Windows pull-down menu or the buttons on
the tool bar.
22
Stata windows…
When you open Stata, you will see a menu bar across
the top, a tool bar with buttons, and 3-5 windows
(the number of windows open depends on which
windows were open the last time Stata was used).
23
COMMAND WINDOW
26
REVIEW WINDOW
28
The working directory….
The working directory is the place where graphs and Stata datasets
will be saved unless you specify another directory.
By default, Stata assumes all files are in c:\data.
To change this working directory, type:
cd foldername
If the folder name contains blanks, it must be enclosed in
quotation marks.
29
STATA MENUS
30
Stata menus ….
32
DO-FILES ARE SCRIPTS OF
COMMANDS
• Stata do-files are text files where users can store and run
their commands for reuse, rather than retyping the
commands into the Command window
• Reproducibility
• Easier debugging and changing commands
• It recommend always using a do-file when using Stata
• The file extension .do is used for do-files 33
OPENING THE DO-FILE
EDITOR
35
RUNNING COMMANDS
FROM THE DO-FILE
36
COMMENTS
37
LONG LINES IN DO-FILES
38
Ways of running Stata program
39
Ways of running Stata program…
40
Ways of running Stata program…
In stata version14.0,we can use the following short cuts to open different windows of
Stata.
Crtl+1= Command
Crtl+2= Results
Crtl+3=Review
Crtl+4=Variables
Crtl+5=Properties
Crtl+6=Top graph
Crtl+7=New viewer/help
Crtl+8=Data editor
Crtl+9=New Do- file Editor
41
Stata toolbar
42
Stata toolbar….
43
Stata’s interface
Stata commands can be typed by hand (single
commands in the Stata command window or
multiple commands or programs in the Do-file
Editor) but most Stata commands can also be
accessed in a point-and-click manner by pulling
down Stata’s menus and selecting items that invoke
dialog boxes.
Stata’s Data, Graphics, and Statistics menus
provide point-and-click access to almost all Stata
commands.
44
45
OK, cancel, submit, ? and R
48
Setting memory size….
49
Setting memory size…...
53
The help system….
55
The help system....
search logit
The findit command can be used to search the Stata site and
other sites for Stata related information, including ado files.
Say that we are interested in panel data, so we search for this
program from within Stata by typing
findit panel data
The Stata viewer window appears and we are shown a
number of resources related to this key word.
56
The help system…
59
Stata Language Syntax…
60
Stata Language Syntax…
64
Getting Data into Stata …
There are various ways to enter data into Stata; the choice depends on
the nature of the input data:
Manual entry by typing
Copy and Pasting data into data editor
Inputting ASCII files using infile, insheet or infix
Use of other software to directly create a new Stata dataset from
another format (e.g. SAS, SPSS)
65
Getting Data into Stata…
66
Getting Data into Stata…
1. Using the Data Editor
• The data editor can be used to:
– enter new data (data can also be entered manually: choose the cell, type the value, and press Enter
or Tab), or
– to view or
– edit existing data.
• To open the Data Editor, click the Data Editor icon on the Tool Bar.
• If you have a dataset in memory, it is displayed in the editor.
• Otherwise you get a blank editor screen
• To enter the editor:
– Click on the Data Editor button or
– type command edit and press Enter in the Command window
• The editor is like a spreadsheet.
67
Getting Data into Stata…
68
Creating, renaming and
labeling variables
Example 1
Use Prac1 data to getting data into stata by using manual entering
Prac2 create Prac2 data in excel and copy past it on new Data
Editor
• By default Stata names var1, var2, : : : , var5 for each column you
enter data
• But we would like to rename the variables so that they match the
column titles from our dataset and give the variables descriptions
and change their formatting or variable labeling.
69
Creating, renaming and
labeling variables
71
IMPORTING DATA
73
LOADING AND SAVING .dta FILES
• Use the command save to save data in * save data, replace if it exists
save hs0, replace
Stata’s .dta format
• The replace option will overwrite an
existing file with the same name (without
replace, Stata won’t save if the file
exists)
• The extension .dta can be omitted when
using use and save 74
CLEARING MEMORY
78
PREPARING DATA FOR IMPORT
• To get data into Stata cleanly, make sure the data in your Excel file or .csv file
have the following properties
• Rectangular
• Each column (variable) should have the same number of rows
(observations)
• No graphs
• Missing data should be left as blank fields
• Variable names should contain only alphanumeric characters or _ or .
• Make as many variables numeric as possible
• Many Stata commands will only accept numeric variable 79
CHAPTER FOUR
EXPLORING DATA IN THE PACKAGE
80
SEMINAR DATASET
• Demographics – gender(1=male,
2=female), race, ses(low, middle,
high), etc
• Academic test scores
• read, write, math, science,
• Go ahead and load the dataset! 81
B R O W S I N G T H E D ATA S E T
• The list command prints observation * list read and write for
first 5 observations
to the Stata console list read write in 1/5
• Simply issuing “list” will list all
observations and variables +--------------+
| read write |
• Not usually recommended except for |--------------|
small datasets 1. | 57 52 |
• Specify variable names to list only those 2. | 68 59 |
3. | 44 33 |
variables
4. | 63 44 |
• We will soon see how to restrict to 5. | 47 52 |
certain observations +--------------+
83
SELECTING OBSERVATIONS
if select by condition
84
SELECTING BY OBSERVATION NUMBER
WITH in
• Many commands are run on a subset of the *list science for last 3
data set observations observations
li science in -3/L
• in selects by observation (row) number
• Syntax +---------+
| science |
• in firstobs/lastobs |---------|
• 30/100 – observations 30 through 100 198. | 55 |
199. | 58 |
• Negative numbers count from the end 200. | 53 |
• “L” means last observation +---------+
• -10/L – tenth observation from the last
through last observation
85
SELECTING BY CONDITION WITH if
• if selects observations that meet a * list gender, ses, and math if math
certain condition > 70
* with clean output
• gender == 1 (male) li gender ses math if math > 70,
clean
• math > 50
gender ses math
• if clause usually placed after the 13. 1 high 71
22. 1 middle 75
command specification, but before 37. 1 middle 75
the comma that precedes the list of 55. 1 middle 73
73. 1 middle 71
options 83. 1 middle 71
97. 2 middle 72
98. 2 high 71
132. 2 low 72
164. 2 low 72 86
STATA LOGICAL AND RELATIONAL
OPERATORS
88
EXPLORE YOUR DATA BEFORE
ANALYSIS
• Take the time to explore your data set before embarking on analysis
• Get to know your sample with quick summaries of variables
• Demographics of subjects
• Distributions of key variables
• Look for possible errors in variables
89
USE codebook TO INSPECT VARIABLE
VALUES
For more detailed information about the values of * inspect values of variables read gender and
prgtype
each variable, use codebook, which provides the codebook read prgtype
following:
--------------------------------
• For all variables Read reading score
• number of unique and missing values ------------------------------
type: numeric (float)
• For numeric variables range: [28,76] units: 1
• range, quantiles, means and standard unique values:30 missing .: 0/200
mean: 52.23
deviation for continuous variables std. dev: 10.2529
• frequencies for discrete variables percentiles: 10% 25% 50% 75% 90%
39 44 50 60 67
• For string variables
------------------------------------------------
• frequencies prgtype (unlabeled)
------------------------------------------------
• warnings about leading and trailing blanks type: string (str8)
unique values: 3 missing "": 0/200
tabulation: Freq. Value
105 "academic"
90
45 "general"
50 "vocati"
SUMMARIZING CONTINUOUS
VARIABLES
observations
* summarize read and math for females
• mean summarize read math if gender == 2
91
DETAILED SUMMARIES
such as: 1%
5%
34
36
28
34
10% 39 34 Obs 109
• percentiles (including the median at 25% 44 35 Sum of Wgt. 109
92
TABULATING FREQUENCIES OF
CATEGORICAL VARIABLES
scatter scatter
DATA VISUALIZATION plot
94
DATA VISUALIZATION
95
HISTOGRAMS
96
histogram OPTIONS *
97
BOXPLOTS *
98
SCATTER PLOTS
99
BAR GRAPHS TO VISUALIZE
FREQUENCIES
10
0
TWO-WAY BAR GRAPHS
10
1
TWO-WAY, LAYERED GRAPHICS
10
2
LAYERED GRAPH EXAMPLE 1
10
3
LAYERED GRAPH EXAMPLE 2
• You can also overlay separate plots by * layered scatter plots of write and read
* colored by gender
group to the same graph with different twoway (scatter write read if gender == 1,
mcolor(blue))
colors (scatter write read if gender == 2, mcolor(red))
10
4
DATA MANAGEMENT
105
DATA MANAGEMENT
• To describe the variables in the data set type:
describe or des
– Or to describe some specific variables type add
the name of the variable to the command.
• Eg: des mpg
– You can summarize specific variables
• sum varlist, detail
Example:
• sum mpg, detail
• sum mpg
• su mpg 106
DATA MANAGEMENT
• If you are only interested in a subset of your data, you can inspect it using filters. E.g. If you
are only interested in price of a particular type of car you can type:
.sum if price>=3000 & price<=4400
.sum if mpg>=16& mpg<=23
• And then you can contrast
.sum if price>=3000 |price<=4400
.sum if mpg>=16 |mpg<=23
• Interpretation of Logical Operators in STATA.
>= greater or equal to
<= less or equal to
== equal to
& and
| or
!= or ~= not equal to
> greater than
< Less than
. missing 107
DATA MANAGEMENT
• The usual arithmetic operators (+,-,*,/) are
applicable in STATA.
108
DATA MANAGEMENT
GENERATING NEW VARIABLES
• You can create a new variable by combining
new variables or by performing some
arithmetic operations. [gen, egen, recode]
• To create a ratio of two variables:
– gen mpgratio=mpg/weight
– sum mpgratio
109
Creating, transforming, and labeling variables
111
Transforming variables…
The same procedure can be applied to obtain traditional
transformations such as:
Square gen mpg2=mpg^2
Cubic gen mpg3=mpg^3
Square roots gen mpgsqrt=sqrt(mpg)
Exponential gen expmpg=exp(mpg)
Natual logs gen lnmpg=ln(mpg)
gen logmpg=log(mpg)
Base 10 genl10mpg=log10(mpg)
• Eg: gen lprice=log(price+1)
– Why +1? This helps eliminate the problem of estimating the
log of zero or missing numbers. 112
Generating variables…
• Eg: from the auto.dta data set we are using, may be interested in finding out how
many cars were repaired more than two times in 1978. Thus we create a new
variable repair =1 if the vehicle was repaired more than twice or 0 if otherwise .
• Use the command:
gen repair =1 if rep78>2
replace repair=0 if rep78<=2
• You can also create categorical variables from a set of continuous variables.
tab mpg
gen mpgcat=1 if mpg<15
replace mpgcat=2 if mpg>=16& mpg<26
replace mpgcat=3 if mpg>26 & mpg<=35
replace mpgcat=4 if mpg>35
tab mpgcat
113
Missing values in stata
• Missing numeric values in Stata are represented by .
• Missing string values in Stata are represented by “”
(empty quotes)
• You can check for missing by testing for equality to . (or
“” for string variables)
• You can also use the missing() function
• When using estimation commands, generally,
observations with missing on any variable used in the
command will be dropped from the analysis
• Syntax list varlist if var == .
114
Missing values in stata …
• Example import ‘Practical 2 Data Entry’ from excel dataset
and save it
• Use the following command to check the missing values
of the variable gnppc
• li country gnppc if gnppc == .
country gnppc
7 Bosnia and Herzegovina .
116
Hint for Practical exercise 3
1. Use codebook or list codebook rep78
commands to missing .: 5/74
li rep78 if rep78 == .
inspect the missed
rep78
value 3 .
2. Use the list 7 .
45 .
command to identify 51 .
the observations for 64 .
which the values are The car repair values are missed for 3rd, 7th, 45th
missed 51th, 64 th observations.
replace rep78 =0 if rep78 ==.
3. Use the replace
(5 real changes made)
command to replace codebook rep78
the missing values missing .: 0/74 117
Renaming and recoding variables
rename
changes the name of a variable
Syntax: rename old_name new_name
recode
• This command changes the values of a categorical variable
according to the rules specified. The syntax is:
recode varname oldvalue=newvalue oldvalue=newvalue … [if
exp] [in range]
recode foreign 0=1 1=2
Recode rep78 .=9 *=7
118
Recoding values of a categorical variable ….
119
LABELING VALUES
begins at 1
mstatus| Freq. Percent Cum.
-------+---------------------------
1 | 5 5.00 50.00
2 | 5 22.50 100.00
-------+---------------------------- 123
Total | 10 100.00
DATASET OPERATIONS
124
KEEPING AND DROPPING VARIABLES
To be clear, keep if and drop if select observations, while keep and drop
select variables
* keep observation if reading > 40
keep if read > 40
summ read
Variable |Obs Mean Std. Dev. Min Max
---------+----------------------------------------
read |178 54.23596 8.96323 41 76
* now drop if math outside range [30,70]
drop if math < 30 | math > 70
summ math
+---------------------+