0% found this document useful (0 votes)
29 views

Introduction to Stata Software,MaU, 2022

The document outlines a training plan for using Stata software in economics, covering topics such as the Stata interface, data management, statistical analysis, and regression models. It emphasizes the importance of exercises, group work, and familiarity with Stata for effective learning. Additionally, it provides details on the different versions of Stata, their capabilities, and essential commands for exploring and managing datasets.

Uploaded by

gedeojabat6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Introduction to Stata Software,MaU, 2022

The document outlines a training plan for using Stata software in economics, covering topics such as the Stata interface, data management, statistical analysis, and regression models. It emphasizes the importance of exercises, group work, and familiarity with Stata for effective learning. Additionally, it provides details on the different versions of Stata, their capabilities, and essential commands for exploring and managing datasets.

Uploaded by

gedeojabat6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 93

Training on Economics Software

Applications: Introduction to Stata©

Tolasa Alemayehu
Economics Department
Mattu University

Nov, 2022
Mattu , Ethiopia
A PLAN (CONTENT) OF TRAINING
1.Introduction
 The Stata Interface
 Exploring and Examining Datasets
 Storing Commands and Outputs

2. Data management
 Creating, Modifying and Defining Variables
 Appending and Merging Datasets
 Collapsing Data Sets

3. Describing Data
 Summary Statistics
 Statistical Tests
 Graphics

4. Analysis of Regression Models


 Steps in Empirical Analysis
 Structure of Economic Data
 Regression Models: Cross Section , Time series and Panel
A PLAN (CONTENT OF TRAINING)
 Exercises will be given for every section
 Working in group is advisable
 Good to have seats in such a way that at least one in
your right or left side has some acquaintances with
Stata
1. introduction
 What is Stata? Why Use Stata?
 Types of Stata
 Stata (pronounced “stah-tah). Version 1

Born in 1985.
 Stata is not an abbreviation but rather a

corruption of the word Statistics.


 Stata is a general purpose software and a

command-driven package (i.e. not


specialized like DAD, Eviews, GAMs, SPSS,
Matlab, nlogit, etc.)
◦ Cross section, panel, and time series
data analysis (Especially suited for the
former two )
Why should I use Stata?
 Stata preferred to other packages as, “a very interactive
package, which makes you feel like you are talking to it
and does exactly what you are telling it to do.”i,.e.,
• Handling and manipulating large data sets (e.g.
millions of observations!)
• Growing capabilities for handling panel and time-
series regression analysis.
• There are improvements in computing speed,
capabilities and functionality.
• Constantly being updated or advanced by users with a
specific need
• Fast and easy to use.
Types (size) of stata
 There are four different types (sizes) available for
each version of Stata:
 1. Stata MP (Multi Processor),the most powerful,
 2. Stata SE (Special Edition),
 3. Stata Intercooled (IC) and
 4. Stata Small.
 The main difference between these versions is the

maximum number of variables, regressors and


observations that can be handled.
 It is important to know these types if one is to make a

good choice of what to buy.


Stata Typ Maximum Maximum Maximum Remarks
e Number of Number of Number of
Variables Regressors Observation
s

· Runs on multiple CPUs


Stata/ or cores, from 2 to 64 but
MP can also run on single
32,767 10,998 2,147,583,647* core. The number of
cores depends on the
licence.
· Fastest version of Stata

· Run on single core.


· Can run on multiple core
Stata/SE 32,767 10,998 2,147,583,647* computers but uses only
single core.

· Run on single core.


· Can run on multiple core
Stata/IC 2,047 798 2,147,583,647* computers but uses only
single core

· Run on single core.


· Can run on multiple core
Small Sta 99 99 1,200 computers but uses only
ta single core
Menu bar

Toolsbar

Results window
Variables window
5 windows
Variables window

Review Window
Properties window

Commands window
The Stata Interface: Windows, Toolbar, Menus, and Dialogs
 Windows
 The Stata windows give you all the key information about

the data file you are using, recent commands, and the results
of those commands.
 The five main windows are the Review, Results, Command,

Variables, and Properties windows.


 There are other, more specialized windows such as the

Viewer, Data Editor, Variables Manager, Do-file Editor,


Graph, and Graph Editor Windows.
 Some of them open automatically when you start Stata,

while others can be opened using the Windows pull-down


menu or the buttons on the tool bar.
 Stata windows are:
• Stata Results To see recent commands and output
• Stata Command To enter a command
• Stata Browser To view the data file
• Stata Editor To edit the data file
• Stata Viewer To get help on how to use Stata
• Variables To see a list of variables
• Review To see recent commands
• Stata Do-file Editor: To write or edit a program
Menus
Stata displays 8 drop-down menus across the top of the outer window, from left to right:
File
Open open a Stata data file (use)
Save/Save as save the Stata data in memory to disk
Do execute a do-file
Filename copy a filename to the command line
Print print log or graph
Exit quit Stata
Edit
Copy/Paste copy text among the Command, Results, and Log windows
Copy Table copy table from Results window to another file
Table copy options what to do with table lines in Copy Table
Prefs Various options for setting preferences. For example, you can save
a particularly layout of the different Stata windows or change the
colors used in Stata windows.
Data
Graphics
Statistics build and run Stata commands from menus
User menus for user-supplied Stata commands (download from Internet)
Window bring a Stata window to the front
Help Stata command syntax and keyword searches
Button bar
The buttons on the button bar are from left to right
Open files from some specific directory use
Save the Stata data in memory to disk: save
Print a log or graph
Open a log, or suspend/close an open log: log

New Do file: Editor Do edit


Edit the data in memory: edit
Browse the data in memory: browse
Important Short cuts
 keyboard commands are quicker to use
than the buttons. The most useful ones are:
 Control-O Open file
 Control-S Save file
 Control-C Copy
 Control-X Cut
 Control-V Paste
 Control-Z Undo
 Control-F Find
 Control-H Find and Replace
1.2. Exploring and Examining Datasets
1.2.1. Exploring Data Files
Common Stata Syntax
• Stata commands follow the same syntax:

[by varilist1:] command [varlist2] [if exp] [in range]


[weight], [options]
• Items inside of the squares brackets are either options

or not available for every command.


• This syntax applies to all Stata commands
 Logical operators used in Stata

~ Not
== Equal
~= not equal
!= not equal
> greater than
>= greater than or equal
< less than
<= less than or equal
& And
| Or
1.2.2. Examining dataset
 Using the command window:
a. Stata file (.dta): use command
b. Excel file (.xlsx): import command
c. CSV file (csv): insheet command
d. .SPSS file: usespss command
 Log file:Stata can save the file in one of 2 d/t formats.

a. Stata Markup and Control Language (SMCL) format


(SMCL format is recommended because SMCL files
can be translated into a variety of formats readable by
applications other than Stata)
b. 2. Log format
• Use
– This command opens an existing Stata data file.
– The syntax is:
• use filename [, clear ] : opens new file
• use [varlist] [if exp] [in range] using filename [, clear]
opens selected parts of file
– If there is no path, Stata assumes it is in the current folder.
– You can use a path name such as: use C:\...\ERHScons1999
– If the path name has spaces, you must use double quotes:
use .”d:\my data\ERHScons1999”
– You can open selected variables of a file using a variable list.
– You can open selected records of a file using if or in.
Examining dataset
Here are some examples of the use command:
• use ERHScons1999 : opens this file for analysis.
• use ERHScons1999 if q1a == 1: opens data from region 1
• use ERHScons1999 in 5/25: opens records 5 through 25

of file
• We can also combine the if and the in commands
• use q1a hhid hhsize cons using ERHScons1999:

This opens 3 variables from ERHScons1999


• use ERHScons1999, clear: clears memory before
opening the new file
Examining dataset
 Clear: The clear command deletes all files, variables, and
labels from the memory to get ready to use a new data file
◦ You can clear memory using the clear command or by
using it as part of the use command. This command does
not delete any data saved to the hard-drive
 Exit: Differs from the Clear command

◦ Closes Stata and other relevant windows


 If data was entered in other formats such as excel,

importing in Stata is simple


 Example: if our data set is in Excel, then use
 Import excel using “C:\Users\eea\Desktop\SD\original\

teff price.xlsx”, sheet(addis) firstrow clear


Examining dataset
 Save
– The save command will save the dataset as a .dta file
under the name you choose. Editing the dataset changes
data in the computer's memory, it does not change the
data that is stored on the computer's disk.
 save “C:\...\ ERHScons1999.dta ”, replace
The replace option allows you to save a changed file to the
disk, replacing the original file.
– Stata is worried that you will accidentally overwrite
your data file.
– You need to use the replace option to tell Stata that you
know that the file exists and you want to replace it.
Examining dataset
• Edit
• This command use to open window called data editor
window that allow us to view all observation in the
memory.
• You can change the data using data editor window but it
is not recommend to edit data using this window
• It is better to correct errors in the data using a Do-file
program that can be saved.
• Browse
• This window is exactly like the Stata editor window
except that you can’t change the data
Examining dataset
• Describe
– This command provides a brief description of the
data file.
– You can use “des” or “d” and Stata will understand.
– The output includes:
• the number of variables
• the number of observations (records)
• the size of the file
• the list of variables and their characteristics
• Storage types: String vs numeric
Examining dataset
 list
◦ This command lists values of variables in data set.
◦ The syntax is:
 list [varlist] [if exp] [in range]
 examples:

◦ list lists entire dataset


◦ list in 1/10 lists observations 1 through 10
◦ list hhsize q1a food lists selected variables
◦ list hhsize sex in 1/20 lists observations 1-20 for
selected variables
Examining dataset
• list with “if” condition
– This command is used to select certain records in carrying

out a command
• command if exp

Examples:
– list hhid q1a food if food >1200 list data if food is > 1200
– list if q1a < 6 lists cases in region is 1 through 5
– Browse hhid q1a food if food >=1200 browse data if

food consumption is above 1200


• Note that “if” statements always use ==, not a single =.

Also note that | indicates “or” while & indicates “and”


Examining dataset

 list with “in”


◦ We also use in to select records based on the case number.
◦ The syntax is: command in exp
For example:
◦ list in 10 list observation number 10
◦ summarize in 10/20 summarize obs 10-20
 codebook

◦ The codebook command is a great tool for getting a quick


overview of the variables in the data file.
◦ It produces a kind of electronic codebook from the data
file, displaying information about variables' names, labels
and values
◦ Codes/ codebook/ codebook hhid q1a food
Examining dataset
 Inspect
 It is a command for getting a quick overview of a data file.
◦ inspect command displays information about the values of
variables and is useful for checking data accuracy
 inspect
 inspect hhid q1a food
• assert
• count
– count command can be used to show the number of
observations that satisfying if options.
– If no conditions are specified, count displays the number of
observations in the data.
• Count: 1452
• count if q1a==3: 466
1.3. STORING: Outputs, Commands & Data
 The following topics are covered:
◦ Using the Do-file Editor
◦ log using
◦ log off
◦ log on
◦ log close
◦ set log type to move tables from Stata to Word and Excel
 Using the Do-file Editor

The Do-file Editor allows you to store a program (a set of


commands),
◦ It makes it easier to check and fix errors,
◦ It allows you to run the commands later,
◦ It lets you show others how you got your result, and
◦ It allows you to collaborate with others on the analysis.
STORING: Outputs, Commands and Data
 In general, any time you are running more than 10 commands
to get a result, it is easier and safer to use a Do-file.
 To open the Do-file Editor, you can click on Windows/Do-file
Editor or click on the envelope on the Tool Bar.
 To run the commands in a Do-file,
 you can click on the Do button.
 If you want to run one or just a few commands rather than the
whole file, mark the commands and click on the Do button
 Note: If you would like to add a note to a do file, but do not
want Stata to execute your notes, /* */ is used for more than
one line and * just for a line
 If one needs to put comments following commands use // with
space from command and write the comments
STORING: Outputs, Commands and
Data
 Saving the Output
◦ Stata Results window does not keep all the output you
generate.
◦ It only stores about 300-600 lines, and when it is full, it
begins to delete the old results as you add new results.
◦ Thus, we need to use log to save the output
 log using
◦ This command creates a file with a copy of all the
commands and output from Stata. The syntax is:
log using filename [, append replace [ text | smcl ] ]
 Append: adds the output to an existing file
 Replace: replaces an existing file with the output
STORING: Outputs, Commands and Data
 Here are some examples:

log using "C:\Users\eea\Desktop\SD\results.smcl”


log using "C:\Users\eea\Desktop\SD\results.smcl , replace
log using "C:\Users\eea\Desktop\SD\results.smcl, append
 log off: This command temporarily turns off the logging of output,

 log on: This command is used to restart the logging,


 log close: is used to turn off the logging and save the file.

 Storing data

 Save

 Save, replace

 Examples

Save "C:\Users\eea\Desktop\SD\verion1.dta”
Save “C:\Users\eea\Desktop\SD\verion2.dta”, replace
Getting help in Stata
• Help: The help command gives you information about any Stata
command or topic
• help [command]

For example,
• help tabulate: gives a description of the tabulate command
• help summarize gives a description of the summarize
• search: a keyword search and Useful when one does not know

stata commands
 Example : search ols
 hsearch : not restricted to key words
 E.g. hsearch weak instruments
 netsearch: when connected to internet

◦ E.g. netsearch outreg2


2. Data Management in Stata
 Some Organizing Tips
 Adding Notes to Datasets and Variables
 Creating and Modifying Variables
 Defining, labeling and renaming Variables
 Appending and Merging Data Sets
 Collapsing Data Sets
 Additional Help on Stata
 Exercises
First, be organized
 Be organized in your data management
 Always use do files for your research project
 Know the Stata version you are working with

◦ What if I do not know the Stata version?


 Save your outputs

◦ capture log
◦ log using commands
 Create a shorter way of writing your directories

◦ The global command


Adding notes on your data set
 You can add notes on your data set
 Example

◦ note: This data contains some variables generated


by Economics staff
 To read notes,

◦ Note
 Notes can be written for variables

◦ Note food: Is this per capital or per week? Please


check.
 To delete notes

◦ notes drop q2_area in 1


◦ notes drop _dta in 2
CREATING NEW VARIABLES
 When new variables are created, they are in memory &
they will appear in the Data Browser
 However, that they will not be saved on the hard-disk

unless you use the save command.


 generate

◦ This command is used to create a new variable.


◦ It is similar to “compute” in SPSS.
 The syntax is;

generate newvar = exp [if exp]


where “exp“ is an expression like
“price*quant” or “1000*kg”
CREATING NEW VARIABLES
 You can use “gen“ or “g” as an abbreviation for
“generate“
 If the expression is an equality or inequality, the

variable will take the values 0 if the expression is false


and 1 if it is true
 If you use “if“, the new variable will have missing

values when the “if“ statement is false


 For example,
 use "$original\ERHScons1999_old.dta", clear
CREATING NEW VARIABLES
• generate age2= ageh*ageh
 create age squared variable
• gen conspercap=food/hhsize
 Creates consumption per capita
• gen consperad=food/aeu
 Creates consumption per adult
• gen highcons =(rconsae>2000)
 Indicates those with consumption of greater than 200
 To know the number of these households
CREATING NEW VARIABLES
• tab highcons
 save "$final\ERHScons1999.dta", replace
 replace : This command is used to change the definition

of an existing variable.
 The syntax is the same:

replace oldvar = exp [if exp] [in exp]


replace cons=. if cons<0: replaces negative consumption
with missing value
 tabulate … generate : This command is useful for

creating a set of dummy variables (variables with a


value of 0 or 1) depending on the value of an existing
categorical variable.
CREATING NEW VARIABLES
 The syntax is:
tabulate old variable, generate(newvariable)
tab q1a, gen(region)
 This creates 6 new variables:

region1=1 if q1a=1 and 0 otherwise


region6=1 if q1a =8 and 0 otherwise
 egen : This is an extended version of “generate” [extended

generate] to create a new variable by aggregating the


existing data.
 The syntax is:

egen newvar=fcn(arguments) [if exp] [in range] , by(var)


CREATING NEW VARIABLES
Functions
 mean() mean
 median() median
 max () standard deviation
 min() standardize variables
 sum () sums
 egen average = mean(cons): creates variable of average

consumption over entire sample


 egen median= median(cons), by(sex): creates variable of median

consumption for each sex


 egen regav = mean(cons), by(region): creates variable of mean

consumption for each region


 egen avecon=mean(cons), by( q1c)

 egen highavecon=(cons> avecon)


CREATING NEW VARIABLES
 Some operators used in Stata
+ addition > greater than
- Subtraction < less than
* Multiplication >= greater than or equal
/ division <= less than or equal
^ power == equal
 Logical ~= not equal
~ not != not equal
| or
& and
 The Variables Manager is a tool for managing properties
of variables both individually and in groups.
 It can be used to create variable and value labels, rename

variables, change display formats, and manage notes.


 It has the ability to filter and group variables as well as to

create variable lists.


 Labeling variable is: label variable var1 "description"
 Labeling the various levels of a categorical variable can be

labeled using the following two Stata syntaxes together:


label define and label value.
 Example: gender has two categories 1 for male and 2 for

female. if gender can be labeled as: label define gender 1


“male” 2 “female”
 label values gender gender
MODIFYING VARIABLES
 We begin with an explanation of how to label data in Stata.
 Then see how to format variables.

◦ rename variable
◦ label variable
◦ Keep/ drop and order/sort
◦ label define/values
 rename variables: This command is used to rename variables

in order to give other variable name. The command is


rename old_variable new_variable
Example: Generate a dummy for the region variable and rename
the new dummy variables
 Label variable: this helps us give a short description of the

variable. Command: label variable yield “output per hectare”


MODIFYING VARIABLES
 We can subset data by keeping or dropping variables, or by
keeping and dropping observations
◦ keep and drop variables
 The keep command is used to keep variables in the list while
dropping other variables
 The drop command is used to delete variables in the list while
keeping other keep and drop observations
 The keep if command is used to keep observations if condition is
met and vice versa for drop.
 If there are many variables to drop and few ones to keep, then

apply keep
 However, if there are many variables to keep and only few to

drop use drop


MODIFYING VARIABLES

Examples
◦ drop pwhole_mixed pretail_mixed
◦ keep pwhole_white pretail_white pwhole_red pretail_red
Note: The two commands are the same in this case
 Sort: This command arranges the observations of the current
data into ascending order based on the values of the variables
listed
 Variable ordering: This command helps us to organize
variables in a way that makes sense by changing the order of
the variables
 order x y z: Puts x first y second z third
 sort x : Puts data in ascending order of the variable x
Appending datasets
 Appending datasets
 Often we don’t have all the info
that we need in one dataset, and
we have to append two or more
datasets into one
 merge two or more datasets
into one
 There are several types of
“appending” “merging”
datasets…
 As long as the variables in the
files are the same and the only
thing you need to do is to add
observations, this is vertical
combination.
 For this we use the append
command.
Appending datasets
 Appending data files
◦ concatenates two datasets, that is, stick them together
vertically, one after another
use "$final\tprice_addis.dta", clear
append using "$final\tprice_dire.dta“
save "$final\tprice_all.dta", replace
◦ The append command does not require that the two
datasets contain the same variables.
◦ But it highly recommended to use identical list of
variables for append command to avoid missing
values from one dataset
Defining Variables
 label define: This command gives a name to a set of value
labels. For example, instead of numbering the regions, we can
assign a label to each region. The syntax is:
label define lblname # "label" # "label" # “label“ [, add modify]
 Where: lblname is the name given to the set of value labels
◦ # are the value numbers
◦ “label”are the value labels
◦ add means add these value labels to the existing set
◦ modify means o change these values in the existing set
Defining Variables
 Note that:
 You can use the abbreviation “label def“
 The double quotation marks are only necessary if there are
spaces in the labels
 Stata will not let you define an existing label unless you say
“modify” or “add“
 label values

◦ This command attaches named set of value labels to a


categorical variable.
 The syntax is:

label values varname [lblname] [, nofix]


label define reg 1"Tigray" 3"Amhara" 4"Oromia" 7"SNNP",modify
label values q1a reg
Merging and appending datasets
If the identifying variable
which appears in the files is
unique in both files, then it's
a one-to-one match.
Unique means that for each
value of this variable, there
is only one observation that
contains it.
In the figure below, country is
the identifying variable.
In both datasets, each country
has only one observation.
Merging and appending datasets
 One-to-one match merging
 The merge command sticks two datasets horizontally, one
next to the other. Before any merge, both datasets must be
sorted by identical merge variable
. use p2sec9a.dta, clear
. sort hhid item1234
. save consumption.dta, replace
.use p_r5, clear
. sort hhid item1234
. save comprice.dta, replace
. use consumption.dta, clear
. merge hhid item1234 using compri
Merging and appending datasets
 One-to-many
matching
◦ If the identifying
variable is unique in
one file, but not
unique in the other,
then it's a one-to-
many matching.
Collapsing data sets
 Collapse
◦ Sometimes we have data files that need to be
aggregated at a higher level to be useful for us.
◦ For example, we have household data but we really
interested in regional data.
◦ The collapse command serves this purpose by
converting the dataset in memory into a dataset of
means, sums, medians and percentiles
 For instance, we would like to see the mean cons in each
q1a and sex of hh head.
 collapse (mean) cons, by(q1a sex)
Additional Stata Resouces
 Don’t forget to get help for command
specific searches
◦ Help help
◦ Search
◦ Hsearch
◦ Netsearch
 http://stataproject.blogspot.com.
 http://www.stata.com/
 http://www.stata.com/statalist/
Additional Stata Resources
 Statalist is
◦ hosted at the Harvard School of Public
Health,
◦ is an email listserver
◦ Stata users including experts writing Stata
programs to users like us
◦ maintain a lively dialogue about all things
statistics and Stata.
4. Data Analysis Using Stata ©
 Describing Data with Summary

Statistics

 Applying Some Statistical Tests in

Stata

 Describing Data with Graphs

 Exercises
3.1.Basic Descriptive Statistics Using Stata
• summarize
– The summarize command produces statistics on continuous

variables like age, food, cons hhsize.


– The syntax looks like this:

summarize [varlist] [if exp] [in range] [, [detail]]


By default, it produces the following statistics:
• Number of observations
• Average (or mean)
• Standard deviation
• Minimum
• Maximum
Basic Descriptive Statistics
Using Stata
If you specify “detail” Stata gives you additional
statistics,such as
• skewness,
• kurtosis,
• the four smallest values
• the four largest values
• various percentiles.
mean = expected value (expectation) of Y = E(Y) = μY = long-
run average value of Y over many repeated occurrences of Y
variance = E[(Y – μY)2 = measure of the squared spread of the
distribution around its mean
standard deviation = σY
Basic Descriptive Statistics using Stata

E  Y  Y 
3

 
skewness =  Y3
 measure of asymmetry (lack of symmetry) of a distribution

 skewness = 0: distribution is symmetric

 skewness > (<) 0: distribution has long right (left) tail

 Skewness mathematically describes how much a distribution

deviates from symmetry

 Kurtosis =
= measure of mass in tails = measure of probability of large values
kurtosis = 3: normal distribution

kurtosis > 3: heavy tails (“leptokurtotic”)

1-62
Basic Descriptive Statistics using Stata
Basic Descriptive Statistics Using Stata

 Here are some examples:


 Summarize: gives statistics on all variables
 summarize hhsize food: gives statistics on selected variables
 summarize hhsize, detail
 summarize hhsize cons if q1a==3: gives statistics on two
variables for one region
 By: This prefix goes before a command and asks Stata to
repeat the command for each value of a variable.
 The general syntax is: by varlist: command
 Note: bysort command is most commonly used to shorten the
sorting process. example of the by prefix are:
 bysort sex: sum rconsae for sex of hh head, give stats on real
per capita consumption.
Basic Descriptive Statistics
Using Stata
 Tabulate, tab1, tab2
◦ These are three related commands that produce frequency
tables for discrete variables.
◦ They can produce one-way frequency tables (tables with
the frequency of one variable) or two-way frequency tables
(tables with a row variable and a column variable.
 Tabulate/ tab: produce a frequency table for one or two
variables
 Tab1: produces a one-way frequency table for each variable
in the variable list
 Tab2: produces all possible two variable tables from the list
of variables
Basic Descriptive Statistics
Using
You Stata
can use several options with these commands:
• Cell: gives the overall percentage for two-way tables
• Column: gives column percentages for two-way tables
• Row: gives row percentages for two-way tables
 There are many other options, including other statistical tests.
 For more information, type “help tabulate”
 Some examples of the tabulate commands are:
 tabulate q1a: produces table of frequency by region
 tabulate q1a sexh: produces a cross-tab of frequencies by region and sex of head
 tab q1a sexh
 tab1 q1a sexh: produces three tables, a frequency table for each variable
 tab2 q1a sexh
 tab2 q1a poor
 tab2 q1a sexh, cell
 tab2 q1a sexh, row
 tab2 q1a sexh, column
Statistical Tests
 ttest command
 We would like to see if the mean of hhsize equals to 6 by using
single sample t-test, testing whether the sample was drawn from a
population with a mean of 6. ttest command is used for this
purpose: ttest hhsize=6
 We are interested that if cons is close to food.
ttest cons=food
 ttest command for independent groups with pooled (equal)
variance: ttest cons, by(sexh)
 ttest command for independent groups using unequal variance:
ttest cons, by(sexh) unequal
STATISTICAL TESTS
 correlate command
◦ The correlate command displays a matrix of Pearson correlations
for the variable listed. E.g correlate cons hhsize
 Correlation vs Causation

 Any 2 variables can be correlated without being the cause of another


cov( X , Z )  XZ

corr(X,Z) = var( X ) var(Z )  X  Z = rXZ
• –1 ≤ corr(X,Z) ≤ 1
• corr(X,Z) = 1 mean perfect positive linear association
• corr(X,Z) = –1 means perfect negative linear association
• corr(X,Z) = 0 means no linear association
• Correlation coefficient is unit less, so it avoids the problems of the
covariance.
• corr(X,Z) when measured in feet same as corr(X,Z) when X & Z in
meters or pounds
1-68
PRESENTING DATA WITH GRAPH
 The Stata graph commands begin with the word graph (in some
cases this is optional).Examples:
◦ graph twoway scatterplots, line plots,
◦ graph bar bar charts
◦ graph pie pie charts
 Examples
◦ graph twoway scatter cons food
 We can show the regression line predicting cons from food using
lfit option.
◦ twoway lfit cons food
 The two graphs can be overlapped like this
◦ twoway (scatter cons hhsize) (lfit cons hhsize)
PRESENTING DATA WITH GRAPH
 Labeling graphs
scatter var1 var2, title("title") subtitle ("subtitle") xtitle
("xtitle") ytitle ("ytitle") note("note")
 Example

scatter ageh cons , title("title") subtitle("subtitle")


xtitle("xtitle") ytitle("ytitle") note("note")
 Histograms and kernel density

◦ histogram cons
◦ histogram cons, normal
 kernel density

◦ kdensity cons
◦ kdensity cons, normal
4. Regression Analysis Using Stata
 Steps in Empirical Analysis
 Structure of Economic Data
 Regression Models

◦ Assumptions and their violations


 Regression Analysis Using Stata

◦ Linear Models: Cross Section


◦ Linear Models: Panel Data
◦ Non linear Models: Cross Section
 Reporting Regression Models
Steps in Empirical Analysis
 Empirical Analysis

• An empirical analysis uses data to test a theory or to estimate a


relationship.
• First step in any empirical analysis is the careful formulation of
the question of interest.
• Literature review is an important step in any empirical analysis
• In some cases a formal economic model is constructed.
• An economic model consists of mathematical equations that
describe various relationships y  f ( x1 , x2 ,...)
• Formal economic modeling is the starting point for empirical
analysis. but it is more common to use economic theory less
formally, or intuition
Steps in Empirical Analysis
• Then we need to turn the economic model into what we call
an econometric model yi  0  1 xi1   2 xi 2  ...   i
• The form of the function must be specified before we can
undertake an econometric analysis.
• We need to deal with variables that cannot reasonably be
observed.
• We must somehow account for the many factors that we
cannot even completely list
• Unobserved factors and error in measurement can be
accounted for using error term or disturbance term
• Once an econometric model has been specified, various
hypotheses of interest can be stated in terms of the unknown
parameters
Structure of Economic Data
Data Management
• Structure of Economic data

– Economic data sets come in a variety of types


– Some econometric methods can be applied with little or no
modification to many different kinds of data sets
– The special features of some data sets must be accounted for or
should be exploited
– We next describe the most important data structures encountered
in applied work
1. Cross-section
• Consists of a sample of individuals, households, firms, cities,

states, countries, or a variety of other units, taken at a given point


in time
• In a pure cross section analysis we would ignore any minor timing

differences in collecting the data


Structure of Economic Data
• An important feature of cross-sectional data is that we can
often assume that they have been obtained by random
sampling from the underlying population, which simplifies
most of the analysis
• But there could be violations of the random sample
assumptions
– Refusal to respond by some group of the respondents
– Sampling from units that are large relative to the
population (the population is not large enough to
reasonably assume the observations are independent draws)
• cross-sectional data is closely aligned with the applied
microeconomics fields, such as labor economics, state and
local public finance, industrial organization, urban
economics, demography, and health economics
Structure of Economic Data
2. Time-series
• A time series data set consists of observations on a variable or
several variables over time. Examples of time series data include
stock prices, money supply, consumer price index, gross domestic
product, annual homicide rates
• Because past events can influence future events and lags in
behavior are prevalent in the social sciences, time is an important
dimension in a time series data set
• The chronological ordering of observations in a time series conveys
potentially important information
• What makes time series more difficult to analyze than cross-
sectional data is the fact that economic observations can rarely, if
ever, be assumed to be independent across time
• Another feature of time series data that can require special attention
is the data frequency at which the data are collected
Structure of Economic Data
3. Pooled cross-sections
• Some data sets have both cross-sectional and time series features
• Pooled cross-section is a combination of several cross-section data
that are collected from the same population in different time periods
• Pooling cross sections from different years is often an effective way
of analyzing the effects of a new policies
• The idea is to collect data from the years before and after a key
policy change
4. Panel or longitudinal data
• A panel data (or longitudinal data) set consists of a time series for
each cross-sectional member in the data set
• Panel data can be collected on household, firms or geographical units
• The key feature of panel data that distinguishes it from a pooled cross
section is the fact that the same cross-sectional units (individuals,
firms, or counties are followed over a given time period
Simple Linear Regression
 In this case we only have one repressor and
a constant yi  0  1 xi   i
The Gauss–Markov
Assumptions
 There are assumptions about the error term
and the explanatory variables xi
 So-called Gauss–Markov assumptions are

A1 E i  0, i 1,2,..., n


A2  1 ,...,  n  and x1 ,..., xn  are independent
A3 V  i   , i 1,2,..., n
2

A4 cov i ,  j  0, i, j 1,2,..., n, i  j


Additional Assumptions
 The relationship of interest is linear
◦ Linearity is in parameter, not in variables
 Data are stationary (pertinent for time series

data)
◦ Distribution is the same over time
 Weak/covariance vs strong stationarity
 Data is random

 Survey design
 Nature of data
Properties of the OLS
Estimator
Under assumptions (A1)–(A4), the OLS

estimator b for has the properties:


◦ Unbiasedness of  E b 

◦ The OLS estimator b of  is the best estimator, i.e.


among the set of estimators, the OLS estimator is
one with the least variance
◦ b is a linear function of the explanatory and the
dependent variables
◦ Hence, b is BLUE for 
Multiple Regression Analysis
 Multiple regression analysis is more
amenable to ceteris paribus analysis
 It allows us to explicitly control for many

other factors which simultaneously affect


the dependent variable

yi  0  1 xi1   2 xi 2  ...   k xik   i

 Multiple regression models can


accommodate many explanatory variables
Violations of GM
Assumptions
A1 E i  0, i 1,2,..., n
A2  1 ,...,  n  and x1 ,..., xn  areindependent
A3 V  i   2 , i 1,2,..., n
A4 cov i ,  j  0, i, j 1,2,..., n, i  j

 Additional
 The relationship of interest is linear
 Data are stationary
 Data is random
Violations of GM Assumptions
 GM assumptions can be violated for a
variety of reasons
 Example 1: The assumption that there is

zero covariance between the error term and


one or more explanatory variables can be
violated due to:
◦ Omitted variables bias
◦ Measurement error
◦ Simultaneity
 Example 2: Linearity assumption may falter
◦ Most behavioral relation ships are non linear
◦ The structure of the data may require us to use
non linear data
Violations of GM Assumptions
 Example4: cross section (household data) is

usually hetroskedastic

 Example 3: Time series data are usually non

stationary

 Example 4: There might be selection problem


How to amend these violations
 Omitted variables
◦ Instrumental variable
◦ Proxy variable
◦ Simultaneous equations models:2SLS
◦ Panel data
 Non linear Models
◦ Use discrete choice models
◦ Corner solution outcomes
 Heteroskedasticity
◦ Use weighted list square
◦ Use non-standard standard errors
 E.g. robust /white standard errors
How to amend these violations
 Non stationary time series
◦ EG EC Model
◦ Johanson Approach

 Non random sample


◦ Selection models
 E.g. Heckman Selection model

 Remember that before trying to amend these


violations, we have to test their existence in a
given data set.
Simple Linear Regression Models: Cross
Section
 General Format

regress depvar indvar if/in weights, options

 The regress command performs OLS


regression and yields an analysis of
variance table, goodness of feet stats, coef.
estimates, se, t stats and p values, and
confidence intervals
 See examples
Basic Format: Linear-Cross Section
◦ The xi prefix is used to dummy code categorical
variables, and we tag these variables with an “i.”
in front of each target variable
xi: regress cons hhsize i.q1a,
robust
◦ By default, Stata selects the first category in the
categorical variable as the reference category. If
we would like to declare a certain category as
reference category
char q1a[omit] 7
xi:regress cons hhsize i.q1a,
robust
Basic Format: Linear-Panel
 Basic format for linear panel data
 xtreg depvar indvar if/in weights, options

Two things to be noted before running panel data


regression models
The dataset should be in long form (not wide form
which is the default after merging two or more
datasets)

use reshape long varlist, i(identifier) j(time


variable)
Basic Format: Linear-Panel
 Basic format for linear panel data
 xtreg depvar indvar if/in weights, options

 Two things to be noted before running panel data


regression models

The dataset should be in long form (not wide form


which is the default after merging two or more
datasets)

use reshape long varlist, i(identifier) j(time


variable)
Reporting Regression Outputs
 One can present regression outputs in a
format that we see on journals, articles etc.
 To do that

◦ Regress the models and store them separately


◦ Use estimates store to do this
◦ Combine the tables using estimates table
command
◦ See Examples

◦ If one would like to report coefficients of only


selected explanatory variables, use the
keep(varlist) option
Many Thanks for Your
Attention and Effort

You might also like