Introduction To Stata: Ucla Idre Statistical Consulting Group
Introduction To Stata: Ucla Idre Statistical Consulting Group
• Stata is an easy to use but powerful data analysis software package that
features strong capabilities for:
• Statistical analysis
• Data management and manipulation
• Data visualization
• Stata offers a wide array of statistical tools that include both standard methods
and newer, advanced methods, as new releases of Stata are distributed annually
STATA: ADVANTAGES
• Order and then download Stata directly from their website, but be sure to use
GradPlan pricing, available to UCLA students
• Order using GradPlan
• Flavors of Stata are IC, SE and MP
• IC ≤ SE ≤ MP, regarding size of dataset allowed, number of processors used, and
cost
• Stata is also installed in various Library computer labs around campus and can
be used through their Virtual Desktop
• See our webpage for more information about using Stata at UCLA
NAVIGATING STATA’S
cd change working directory
INTERFACE
COMMAND WINDOW
• Stata do-files are text files where users can store and run their commands for
reuse, rather than retyping the commands into the Command window
• Reproducibility
• Easier debugging and changing commands
• We recommend always using a do-file when using Stata
• The file extension .do is used for do-files
OPENING THE DO-FILE
EDITOR
• The command use loads Stata .dta files * read from hard drive; do not execute
use "C:/path/to/myfile.dta“
• Usually these will be stored on a hard drive, but .dta
* load data over internet
files can also be loaded over the internet (using a
use https://stats.idre.ucla.edu/stat/data/hs0
web address)
* save data, replace if it exists
• Use the command save to save data in save hs0, replace
Stata’s .dta format
• The replace option will overwrite an existing file
with the same name (without replace, Stata won’t
save if the file exists)
• The extension .dta can be omitted when using
use and save
CLEARING MEMORY
• Because Stata will only hold one data set in * clear data from memory
clear
memory at a time, memory must be cleared
* load data but clear memory first
before new data can be loaded use https://stats.idre.ucla.edu/stat/data/hs0, clear
• The clear command removes the dataset
from memory
• Data import commands like use will often
have a clear option which clears memory
before loading the new dataset
IMPORTING EXCEL DATA SETS
• Stata can read in data sets stored in many other * import excel file; change path below before executing
import excel using "C:\path\myfile.xlsx", sheet(“mysheet") firstrow
formats clear
• The command import excel is used to import
Excel data
• An Excel filename is required (with path, if not
located in working directory) after the keyword
using
• Use the sheet() option to open a particular
sheet
• Use the firstrow option if variable names are
on the first row of the Excel sheet
IMPORTING .csv DATA SETS
• Comma-separated values files are also * import csv file; change path below before executing
import delimited using "C:\path\myfile.csv", clear
commonly used to store data
• Use import delimited to read in .csv
files (and files delimited by other characters
such as tab or space)
• The syntax and options are very similar to
import excel
• But no need for sheet() or firstrow
options (first row is assumed to be variable
names in .csv files)
USING THE MENU TO
IMPORT EXCEL AND .CSV
DATA
• To get data into Stata cleanly, make sure the data in your Excel file or .csv file have the
following properties
• Rectangular
• Each column (variable) should have the same number of rows (observations)
• No graphs, sums, or averages in the file
• Missing data should be left as blank fields
• Missing data codes like -999 are ok too (see command mvdecode)
• Variable names should contain only alphanumeric characters or _ or .
• Make as many variables numeric as possible
• Many Stata commands will only accept numeric variables
HELP FILES AND STATA
help command open help page for command
SYNTAX
HELP FILES
• Precede a command name (and certain topic *open help file for command summarize
help summarize
names) with help to access its help file.
• The list command prints observation to the * list read and write for first 5 observations
li read write in 1/5
Stata console
• Simply issuing “list” will list all +--------------+
| read write |
observations and variables |--------------|
1. | 57 52 |
• Not usually recommended except for small
2. | 68 59 |
datasets 3. | 44 33 |
4. | 63 44 |
• Specify variable names to list only those 5. | 47 52 |
variables +--------------+
• Many commands are run on a subset of the data * list science for last 3 observations
li science in -3/L
set observations
+---------+
• in selects by observation (row) number | science |
|---------|
• Syntax 198. | 55 |
• in firstobs/lastobs 199. | 58 |
200. | 53 |
• 30/100 – observations 30 through 100 +---------+
• if selects observations that meet a certain * list gender, ses, and math if math > 70
* with clean output
condition li gender ses math if math > 70, clean
• Use the browse command to examine the ses values for students with write
score greater than 65
• Then, use the help file for the browse command to rewrite the command to
examine the ses values without labels.
• Take the time to explore your data set before embarking on analysis
• Get to know your sample with quick summaries of variables
• Demographics of subjects
• Distributions of key variables
• Look for possible errors in variables
USE codebook TO INSPECT VARIABLE
VALUES
For more detailed information about the values of each * inspect values of variables read gender and prgtype
codebook read gender prgtype
variable, use codebook, which provides the following:
• For all variables -----------------------------------------------------------------------------------------------------
read reading score
-----------------------------------------------------------------------------------------------------
percentiles:
52.23
10.2529
-----------------------------------------------------------------------------------------------------
gender (unlabeled)
continuous variables
-----------------------------------------------------------------------------------------------------
tabulation:
2
Freq.
91
Value
1
missing .: 0/200
-----------------------------------------------------------------------------------------------------
prgtype (unlabeled)
-----------------------------------------------------------------------------------------------------
• frequencies type:
unique values:
string (str8)
• standard deviation
* summarize read and math for females
• min and max summarize read math if gender == 2
• Use the detail option with * detailed summary of read for females
summarize read if gender == 2, detail
• variance 75%
90%
57
68
71
73 Variance 101.16
95% 68 73 Skewness .3234174
• skewness
99% 73 76 Kurtosis 2.500028
• kurtosis
TABULATING FREQUENCIES OF
CATEGORICAL VARIABLES
of a variable low |
middle |
47
95
23.50
47.50
23.50
71.00
high | 58 29.00 100.00
------------+-----------------------------------
• useful for variables with a limited Total | 200 100.00
Total |
58
200
29.00
100.00
100.00
------------+-----------------------------------
TWO-WAY TABULATIONS
• Use the tab command to determine the numeric code for “Asians” in the race
variable
• Then use summarize to estimate the mean of the variable science for Asians
histogram histogram
• Use the option normal with histogram * histogram of write with normal density
* and intervals of length 5
to overlay a theoretical normal density hist write, normal width(5)
• Boxplots are another popular option for * boxplot of all test scores
graph box read write math science socst
displaying distributions of continuous
variables
• They display the median, the interquartile
range, (IQR) and outliers (beyond 1.5*IQR)
• You can request boxplots for multiple
variables on the same plot
SCATTER PLOTS
• Bar graphs are often used to visualize * bar graph of count of ses
graph bar (count), over(ses)
frequencies
• graph bar produces bar graphs in Stata
• its syntax is a bit tricky to understand
• For displays of frequencies (counts) of each
level of a variable, use this syntax:
graph bar (count), over(variable)
TWO-WAY BAR GRAPHS
• Layered graph of scatter plot and lowess plot * layered graph of scatter plot and lowess curve
twoway (scatter write read) (lowess write read)
(best fit curve)
LAYERED GRAPH EXAMPLE 2
• You can also overlay separate plots by group * layered scatter plots of write and read
* colored by gender
to the same graph with different colors twoway (scatter write read if gender == 1, mcolor(blue)) ///
(scatter write read if gender == 2, mcolor(red))
• Use if to select groups
• the mcolor() option controls the color of the
markers
EXERCISE 3
• Use the scatter command to create a scatter plot of math on the x-axis vs
write on the y-axis
• Use the help file for scatter to change the shape of the markers to
triangles.
DATA MANAGEMENT
generate create variable
• Variables often do not arrive in the form that * generate a sum of 3 variables
generate total = math + science + socst
we need (5 missing values generated)
• Use generate (often abbreviated gen or * it seems 5 missing values were generated
* let's look at variables
g) to create variables, usually from summarize total math science socst
operations on existing variables Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
• sums/differences/products/means of variables total | 195 156.4564 24.63553 96 213
math | 200 52.645 9.368448 33 75
science | 195 51.66154 9.866026 26 74
• squares of variables socst | 200 52.405 10.73579 26 71
• Missing numeric values in Stata are represented * list variables when science is missing
li math science socst if science == .
by .
* same as above, using missing() function
• Missing string values in Stata are represented by li math science socst if missing(science)
“” (empty quotes)
+------------------------+
• You can check for missing by testing for equality | math science socst |
to . (or “” for string variables) |------------------------|
9. | 54 . 51 |
• You can also use the missing() function 18. | 60 . 56 |
37. | 75 . 66 |
• When using estimation commands, generally, 55. | 73 . 66 |
76. | 43 . 31 |
observations with missing on any variable used in +------------------------+
the command will be dropped from the analysis
REPLACING VALUES
• Use replace to replace values of existing * replace total with just (math+socst)
* if science is missing
variables replace total = math + socst if science == .
• egen (extended generate) creates variables using * egen with function rowmean
* is mean of all non-missing
generates variable that
values of those *
a wide array of functions, which include: variables
egen meantest = rowmean(read math science socst)
• statistical functions that accept multiple variables as summarize meantest read math science socst
arguments
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
• e.g. means across several variables meantest | 200 52.28042 8.400239 32.5 70.66666
read | 200 52.23 10.25294 28 76
math | 200 52.645 9.368448 33 75
• functions that accept a single variable, but do not science |
socst |
195
200
51.66154
52.405
9.866026
10.73579
26
26
74
71
• See the help file for egen to see a full list of Variable | Obs Mean Std. Dev. Min Max
available functions
-------------+---------------------------------------------------------
zread | 200 -1.84e-09 1 -2.363225 2.31836
RENAMING AND RECODING
VARIABLES
• Value labels give text descriptions to the numerical * schtyp before labeling values
tab schtyp
values of a variable.
public/priv |
• To create a new set of value labels use label ate school | Freq. Percent Cum.
------------+-----------------------------------
define 1 |
2 |
168
32
84.00
16.00
84.00
100.00
------------+-----------------------------------
• Syntax: label define labelname # label…, Total | 200 100.00
• remember that some Stata commands require * we see that prog is a numeric with labels (blue)
numeric variables * while the old variable prog is string (red)
browse prog prgtype
• encode will use alphabetical order to order the
numeric codes
• encode will convert the original string values
into a set of value labels
• encode will create a new numeric variable,
which must be specified in option
gen(varname)
ENCODING STRING VARIABLES INTO
NUMERIC (2)
• We are about to make changes to the dataset * save dataset, overwrite existing file
save hs1, replace
that cannot easily be reversed, so we should
save the data before continuing
KEEPING AND DROPPING VARIABLES
• keep preserves the selected variables and * drop variable prgtype from dataset
drop prgtype
drops the rest
* keep just id read and math
• Use keep if you want to remove most of the keep id read math
variables but keep a select few
• drop removes the selected variables and
keeps the rest
• Use drop if you want to remove a few
variables but keep most of them
KEEPING AND DROPPING
OBSERVATIONS
• To be clear, keep if and drop if select Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
observations, while keep and drop select read | 178 54.23596 8.96323 41 76
• Use sort to order the observations by one * now sort by read and then math
sort read math
or more variables li in 1/5
• Use gsort with + or – before each variable * sort descending read then ascending math
gsort -read +math
to specify ascending and descending order, li in 1/5
respectively +-------------------+
| id read math |
|-------------------|
1. | 61 76 60 |
2. | 103 76 64 |
3. | 34 73 57 |
4. | 93 73 62 |
5. | 95 73 71 |
+-------------------+
EXERCISE 5
• Reload the hs0 data set fresh using the following command:
use https://stats.idre.ucla.edu/stat/data/hs0, clear
• Subset the dataset to observations with write score greater than or equal to 60.
Then remove all variables except for id and write. Save this as a Stata dataset
called “highwrite”
• Reload the hs0 dataset, subset to observations with write score less than 60,
remove all variables except id and write, and save this dataset as “lowwrite”
• Reload the hs0 dataset. Drop the write variable. Save this dataset as
“nowrite”.
append add more observations
COMBINING DATASETS
merge add more variables, join by
matching variable
APPENDING DATASETS
• Let’s append together two of the datasets we just * first load highwrite
created in the previous exercise use highwrite, clear
• Let’s merge our dataset of id and write with the * merge in nowrite dataset using id to
dataset “nowrite” using id as the merge variable link
merge 1:1 id using nowrite
• merge syntax:
• 1-to-1: merge 1:1 idvar using dtaname
Result # of obs.
• 1-to-many: merge 1:m idvar using dtaname -----------------------------------------
not matched 0
• many-to-1: merge m:1 idvar using dtaname matched 200 (_merge==3)
-----------------------------------------
• Note that idvar can be multiple variables used to match
• Let’s try this 1-to-1merge
• Stata will output how many observations were
successfully and unsuccessfully merged
BASIC STATISTICAL ANALYSIS
mean means and confidence intervals
ttest t-tests
• Please load the dataset hs1, which is dataset hs0 altered by our data
management commands, using the following syntax:
use https://stats.idre.ucla.edu/stat/data/hs1, clear
MEANS AND CONFIDENCE INTERVALS
(1)
• The mean command provides a 95% read | 52.23 .7249921 50.80035 53.65965
--------------------------------------------------------------
shows the pairwise correlation between each socst | 0.6175 0.5996 0.5299 0.4529 1.0000
pair of variables
• If you need p-values for correlations, use the
command pwcorr
MODEL ESTIMATION COMMAND
SYNTAX
• Linear regression, or ordinary least squares regression, models the effects of one or more
predictors, which can be continuous or categorical, on a normally-distributed outcome
• Syntax: regress depvar indepvarlist, where depvar is the name of the
dependent variable, and indepvarlist is a list of independent variables
• To be safe, precede independent variables names with i. to denote categorical predictors and c.
to denote continuous predictors
• For categorical predictors with the i. prefix, Stata will automatically create dummy 0/1 indicator
variables and enter all but one (the first, by default) into the regression
• Let’s run a linear regression of the dependent variable write predicted by independent
variables math (continuous) and ses (categorical)
LINEAR REGRESSION EXAMPLE
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | .6115218 .0588735 10.39 0.000 .495415 .7276286
|
ses |
middle | -.5499235 1.346566 -0.41 0.683 -3.205542 2.105695
high | 1.014773 1.52553 0.67 0.507 -1.993786 4.023333
|
_cons | 20.54836 3.093807 6.64 0.000 14.44694 26.64979
------------------------------------------------------------------------------
ESTIMATING STATISTICS BASED ON A
MODEL
• Stata provides excellent support for estimating and testing additional statistics after a
regression model has been run
• Stata refers to these as “postestimation” commands, and they can be used after most
regression models
• To see which commands can be issued as follow-ups to a model estimation command, use:
help model_command postestimation
Where model_command is a Stata model command
e.g. for regress, try help regress postestimation
• Examples: model predictions, joint tests of coefficients or linear combination of statistics,
marginal estimates
POSTESTIMATION EXAMPLE 1:
PREDICTION
• Predicted value of dependent variable (default) * first 5 predicted values and residuals with
observed write
• Residuals (difference between observed and li pred res write in 1/5
------------------------------------------------------------------------------
highmath | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | 1.253272 .050584 5.59 0.000 1.157949 1.356442
1.female | .4330014 .1823694 -1.99 0.047 .1896638 .9885398
_cons | 1.19e-06 2.82e-06 -5.76 0.000 1.14e-08 .0001237
------------------------------------------------------------------------------
EXERCISE 7
• Use the tab command to run a chi-square test of independence to test for
association between ses and race.
• Fisher's exact test is often used in place of the chi-square test of independence
when the (expected) cell sizes are small. Use the help file for tabulate
twoway (which is just the tabulate command for 2 variables) to run a
Fisher's exact test to test the association between ses and race. How does the p-
value compare to the result of the chi-square test?
ADDITIONAL STATA MODELING
COMMANDS
A FEW OF STATA’S ADDITIONAL
REGRESSION COMMANDS