software material
software material
Introduction to Stata
What is Stata?
It is a multi-purpose statistical package to explore, summarize and analyze datasets
It has capability for handling and manipulating large data sets (e.g. millions of
observations), and it has ever growing capabilities for handling panel and time-series
regression analysis.
Stata Interface
I. Stata Windows
Command window: to submit command to stata. It supports basic text editing, copying
Results window: contains all the commands and their textual results
Review window: shows the history of commands that have been entered. It displays
successful commands in black and unsuccessful commands, along with their error codes,
in red.
variables window: shows the list of variables in the dataset, along with selected
1
II. The Stata Tool Bars
Contain buttons that provide quick access to Stata’s more commonly used features
Stata’s Data, Graphics, and Statistics menus provide point-and-click access to almost
The dialogs for many commands have the by /if/in and Weights tabs.
These provide access to Stata’s commands and qualifiers for controlling the estimation
Getting Started
2
If you are using Stata version 11 or earlier, and you want to read in a big dataset, then
before reading in your data, you must tell Stata to make available enough computer
If you get a message while using Stata 11 or earlier that there is not enough memory,
For example, “no room to add more observations…”, then you need to manually set the
memory higher.
You can type, for example,
clear or drop_all
to set the memory to a large enough amount, type
set mem 700m or something higher
1. click on “file” on the menu bar. In the file drop down menu, click on “import” and then
2. Open data editor by just typing “edit” or clicking on the menu bar. Then copy from excel,
right click in any of the cell in the data editor and then, past.
To save a dataset that has been already in use (overwriting the original data file),
Log file
3
A log file is simply a record of your Results window. It records all commands and all
Because it writes the file to disk while it writes the Results window, it also protects you
Do-file
Do-file is a file containing a list of commands for Stata to run (called a batch file or a
script in other settings). It gets its name from the term do-file.
Do-file Editor has advanced features that can help in writing such files; it can also be
used to build up a series of commands that can then be submitted to Stata all at once.
A do-file can be launched by either clicking on the Do-file editor toolbar button or by typing
doedit in the command window.
4
2. DATA MANAGEMENT
Stata will not allow empty columns or rows in the middle of your data set.
Naming variables
Variable names can have up to 32 characters,
but many commands print only 12, and shorter names are easier to type.
Stata names are case sensitive, Age and age are different variables!
5
It helps to use short lowercase names and single words or abbreviations rather than multi-
word names,
Renaming variables
Labeling variables
Variables can be labeled using the following Stata syntax
where var1 is the variable to be labeled; and description is the label of var1
The various levels of a categorical variable can be labeled using the following two Stata
syntaxes together:
label define var1 1 “name of the first category” 2 “name of the second category”
label values var1 var1
Where var1 is the name of the categorical variable; and 1 and 2 are the levels of the
categorical variable.
Example: A variable called gender has two categories – 1 for male and 2 for female.
The categories of gender can be labeled as follows:
label define gender 1 male 2 female
label values gender gender
6
The most common command for creating new variables is generate.
Example: Generate a variable called income which is the sum of farm income
(fincome) and nonfarm income (nfincome):
to generate square root of X from X: gen name of the new variable == sqrt(X)
7
Keeping and dropping variable
Your data set may contain variables you are not interested in or you don’t want to
analyze.
It’s a good idea to get rid of these first – that way, they won’t use up valuable
You can tell Stata to either keep what you want or drop what you don’t want – the end
results will be the same.
The syntax is
keep variables to remain
drop variables to remove
keep if var>= 0
drop if var < 0
check that all the variables and observations are present and in the correct format.
“browse and edit” commands start a pop-up window in which you can examine the
raw data.
If the dataset is large, you can use some options to make the output of list more
tractable.
Assert
With large datasets, it often is impossible to check every single observation using
list or browse
true or false.
For example, you might want to check whether all values in the math variable are
nonnegative as they should be:
Syntax: assert math !< 0 or assert math >= 0
If the statement is true, assert does not yield any output on the screen.
If it is false, assert gives an error message and the number of contradictions.
Describe
The describe command produces a summary of the dataset in memory or of the data stored in a
Stata-format dataset.
Syntax: describe
describe varlist, memory_option
Describe data in file
describe varlist using “location and name of the file”, file_options
Summarize
This provides summary statistics, such as means, standard deviations, and so on.
Syntax: summarize or
9
Summarize, detail
Tabulate
The tabulate command is a versatile command that can be used, for example, to produce a
Inspect
The inspect command is a way to eyeball the distribution of a variable, including as
it does a mini-histogram.
Syntax: inspect varlist
Correlations
Correlation measures association/relationship between variables.
The correlate command displays the correlation matrix or covariance matrix for a
group of variables.
syntax: pwcorr list of variables, star (5or 1or 10.i.e level of sig)
10
3. Application to Crossectional Analysis
Hypothesis Testing
ttest varname == # : Test the hypothesis that the mean of a variable is equal to some
number, which you type the number, instead of the sign #.
ttest varname1 == varname2 :Test the hypothesis that the mean of one variable equals
the mean of another variable.
ttest varname, by(groupvar) :Test the hypothesis that the mean of a single variable is
the same for all groups. The groupvar must be a variable with a distinct value for each
group. For example, groupvar might be gender, to see if the mean of a variable is the
same for male & female
Confidence Intervals
ci varname :Confidence interval for the mean of varname (using asymptotic normal
distribution).
ci varname, level(#) : Confidence interval at #%. For example, use 99 for a 99% confidence
interval.
OLS Regression
regress yvar xvarlist: Regress the dependent variable yvar on the independent variables
xvarlist. For example: regress y x or regress y x1 x2 x3.
regress yvar xvarlist, robust : regress but this time compute robust standard errors.
regress yvar xvarlist, robust level(#): Regress with robust standard errors, and this time change
the confidence interval to #% (e.g. use 99 for a 99% confidence interval)
11
regress yvar xvarlist i.Race: Regress the dependent variable (yvar) on the continuous
independent variables (xvarlist) & categorical independent variable (Race). For example: regress
y x i.Race
, or regress y x1 x2 x3 i.Race
Post-Estimation Commands
Commands described here work after OLS regression.
predict yhat: After a regression, create a new variable, having the name you enter here, that
contains for each observation the predicted value of the dependent variable.
predict name of the new variable, residuals : After a regression, create a new variable, having
the name you enter here, that contains for each observation its residual
Post-Estimation Tests
1. Heteroskedasticity Tests
Syntax: hettest
2. Functional Form (specification error) Test
Syntax: ovtest
3. Multicollinierity Test
Syntax: vif
12
If you are interested in computing the odds ratio
logit y x1 x2 x3 i.Race, or
13