0% found this document useful (0 votes)
9 views

Chapter 2-4

Uploaded by

mermaga32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Chapter 2-4

Uploaded by

mermaga32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 129

Computer application in

Agribusiness and Value Chain mgt(AgEc


321)

By Gebisa Diriba

1
Chapter Two

Introduction to Statistical Package

2
What is a statistical package?

• Statistical packages are collections of software


designed to aid in statistical analysis and data
exploration.
• The vast majority of quantitative and statistical
analysis relies upon statistical packages.

3
Software package
It may refer to one or more of the following:
• A piece of application software or utility software.
– VLC media player, Microsoft Word processor, Google
chrome, accounting applications, photo editor, etc. vs
AVAST antivirus, disk tools, backup software, etc.
• A software suite or collection of related application or utility
software.
• A package or module, a software component for accomplishing
a particular thing
• Package (package management system), a file used by a
package management system to install an application or library.
4
Application software

• Application software is all the computer software that causes a


computer to perform useful tasks (compare with Computer viruses)
beyond the running of the computer itself.
– Examples of application software include enterprise software,
accounting software, office suites, graphics software and media
players.
• Many application programs deal principally with documents.
• Sometimes a new and popular application arises which only runs on one
platform, increasing the desirability of that platform. This is called a
killer application
5
Comprehensive vs. specialized
software packages

• Specialized software is software that is written for a specific task rather


for a broad application area.
• These programs provide facilities specifically for the purpose for which
they were designed.
• Statistical software is specialized computer programs for statistical
analysis.

6
Features of packages
(Stata ,SPSS and Eviews)
• Stata: is a general-purpose and comprehensive statistical
software package created in 1985 by StataCorp.
• It is used by many businesses and academic institutions around
the world. Most of its users work in research, especially in the
fields of economics, sociology, political science, biomedicine and
epidemiology.
• Stata's capabilities include data management, statistical analysis,
graphics, simulations, and custom programming.
• The name Stata is a portmanteau of the words statistics and data.
7
Features of packages
(Stata ,SPSS and Eviews)…

There are four major builds of each version of Stata:


 Stata/MP for multiprocessor computers (including dual-core and
multicore processors);
 Stata/SE for large databases;
 Stata/IC, which is the standard version; and
• Small Stata, which is a smaller, student version of educational
purchase only
8
Features of packages
(Stata ,SPSS and Eviews)…
SPSS
 Originally it is an acronym of Statistical Package for the Social Science
but now it stands for Statistical Product and Service Solutions
 It is used by market researchers, health researchers, survey companies,
government, education researchers, marketing organizations, and
others.
 In addition to statistical analysis, data management (case selection, file
reshaping, creating derived data) and data documentation (a metadata
dictionary is stored in the data file) are features of the base software.

 SPSS varies from other statistical packages in its modular construction.


9
 SPSS can be customized in a sense to any particular data file or analysis application.
Features of packages
(Stata ,SPSS and Eviews)…

EViews:
• EViews (Econometric Views) is a statistical package
for Windows, used mainly for time-series oriented
econometric analysis.
• It is developed by Quantitative Micro Software
(QMS)

10
1.2. Supports in Software Packages

a. Support for ANOVA


Product One-way Two-way MANOVA GLM Mixed
model
EViews Yes
GAUSS No No No
GenStat Yes Yes Yes Yes Yes
gretl
LIMDEP Yes Yes Yes Yes Yes
Mathematica Yes Yes Yes Yes
MATLAB+Statistics Yes Yes Yes Yes Yes
Toolbox
Minitab Yes Yes Yes Yes
NLOGIT Yes Yes Yes Yes Yes
R Yes Yes Yes Yes Yes
SAS Yes Yes Yes Yes Yes
SHAZAM Yes Yes No Yes
Stata Yes Yes Yes Yes Yes
Statgraphics Yes Yes Yes Yes
STATISTICA Yes Yes Yes Yes
SPSS Yes Yes Yes Yes
SYSTAT Yes Yes Yes Yes 11
b. Support for regression models

Product OLS WLS 2SLS Logistic GLM Stepwise Probit Poisson


EViews Yes Yes Yes Yes Yes Yes Yes Yes
GenStat Yes Yes Yes Yes Yes Yes Yes Yes
gretl Yes Yes Yes Yes No Yes Yes Yes
LIMDEP Yes Yes Yes Yes Yes Yes Yes Yes
Mathematica Yes Yes Yes Yes [ Yes Yes
MATLAB+Statistics Yes Yes Yes Yes Yes Yes Yes Yes
Toolbox
Minitab Yes Yes No Yes No Yes
NLOGIT Yes Yes Yes Yes Yes Yes Yes Yes
R Yes Yes Yes Yes Yes Yes Yes Yes
RATS Yes Yes Yes Yes Yes Yes Yes Yes
SAS Yes Yes Yes Yes Yes Yes Yes Yes
SHAZAM Yes Yes Yes Yes Yes Yes Yes Yes
SPSS Yes Yes Yes Yes Yes Yes
Stata Yes Yes Yes Yes Yes Yes Yes Yes
Statgraphics Yes Yes No Yes Yes Yes Yes Yes
STATISTICA Yes Yes Yes Yes Yes Yes
SYSTAT Yes Yes Yes Yes Yes Yes
12
C. Support for Time series methods
Product ARIMA GARCH Unit Cointegration VAR Multivariate
root test GARCH
test
EViews Yes Yes Yes Yes Yes Yes
GAUSS Yes Yes Yes Yes
gretl Yes Yes Yes Yes Yes
LIMDEP Yes Yes Yes Yes Yes No
Mathematica Yes Yes Yes Yes Yes
MATLAB+Econometrics Yes Yes Yes Yes Yes
Toolbox
Minitab Yes No No No No
NLOGIT Yes Yes Yes Yes Yes No
R Yes Yes Yes Yes Yes Yes
RATS Yes Yes Yes Yes Yes Yes
SAS Yes Yes Yes Yes Yes Yes
SHAZAM Yes Yes Yes Yes Yes No
Stata Yes Yes Yes Yes Yes Yes
Statgraphics Yes No No No No
STATISTICA Yes No No No No
SPSS Yes
SYSTAT Yes
13
D. Support for charts and diagrams

Software Bar chart Box Correlogram Histogram Line Scatterplot


plot chart
EViews Yes Yes Yes Yes Yes Yes
GAUSS Yes Yes Yes Yes Yes
GenStat Yes Yes Yes Yes Yes Yes
gretl Yes Yes Yes Yes Yes Yes
LIMDEP Yes Yes Yes Yes Yes Yes
Mathematica Yes Yes Yes Yes Yes
MATLAB+Statistics Yes Yes Yes Yes Yes Yes
Toolbox
Minitab Yes Yes Yes Yes Yes Yes
NLOGIT Yes Yes Yes Yes Yes Yes
R Yes Yes Yes Yes Yes Yes
RATS Yes Yes Yes Yes Yes Yes
SAS Yes Yes Yes Yes Yes Yes
SHAZAM Yes Yes Yes Yes Yes Yes
Stata Yes Yes Yes Yes Yes Yes
Statgraphics Yes Yes Yes Yes Yes Yes
STATISTICA Yes Yes Yes Yes Yes Yes
SPSS Yes Yes Yes Yes Yes Yes
SYSTAT
14
Chapter Three

Data and Data Management in the Packages

15
Data

• Most economic data is observational, or non-experimental, data (as


distinct from experimental data generated under controlled
experimental conditions).

Cross sectional data


• A cross-sectional data set consists of a sample of individuals,
households, firms, cities, states, countries or a variety of other units,
taken at a given point in time.
– Examples include the Census of population or manufactures, or a
poll or survey.
16
Time series data

• A time series data set consists of observations on a variable or


several variables over time.
– For example, GNP, employment rates, money supply collected
over a period of time, daily, weekly, monthly or every year are
examples of time series data.
• One of the important types of data used in empirical analysis is
time series data.
• In time series analysis, we analyze the past behavior of a variable
in order to predict its future behavior.
17
Panel data

• A panel data set consists of a time series for each


cross-sectional member in the data set.

18
About
 STATA is modern and general command driven package for statistical
analyses, data management and graphics.
 STATA provides commands to analyze panel data (cross-sectional time-
series, longitudinal, repeated-measures, and correlated data), cross-

sectional data, time-series data, survival-time data, cohort study,…


 STATA is user friendly.
 STATA has an extraordinary set of reference books.
 STATA has internet capabilities (installing new features, updating)

19
About

• Stata is an easy to use but powerful data


analysis software package that features strong
capabilities for:
– Statistical analysis
– Data management and manipulation
– Data visualization

20
About Stata ….
It is used by economists, social scientists, medical
researchers, biostatisticians, epidemiologists,
sociologists, political scientists, geographers,
psychologists,, and other research professionals needing
to analyse data.
Stata has the ability to easily download programs
developed by other users and the ability to create your
own Stata programs that seamlessly become part of
Stata.
21
The Basic Features of Stata
2.1. The Stata windows
You can open Stata in the same way as you would most other
software packages by clicking on its icon or menu item
When you open Stata, you should see the following screen
• The Stata windows give you all the key information about the data
file you are using, recent commands, and the results of those
commands.
• Some of them open automatically when you start Stata, while others
can be opened using the Windows pull-down menu or the buttons on
the tool bar.
22
Stata windows…

When you open Stata, you will see a menu bar across
the top, a tool bar with buttons, and 3-5 windows
(the number of windows open depends on which
windows were open the last time Stata was used).

 Each is described briefly below after screen.

23
COMMAND WINDOW

You can enter commands


directly into the
Command window
This command will load a
Stata dataset over the
internet
Go ahead and enter the
command
24
VARIABLES WINDOW

Once you have data loaded,


variables in the dataset will be
listed with their labels in the order
they appear on the dataset
Clicking on a variable name will
cause its description to appear in
the Properties Window
Double-clicking on a variable
name will cause it to appear in the
Command Window
25
PROPERTIES WINDOW

The Variables section


lists information about
selected variable
The Data section lists
information about the
entire dataset

26
REVIEW WINDOW

The Review window lists


previously issued commands
Successful commands will appear
black
Unsuccessful commands will
appear red
Double-click a command to run it
again
Hitting PageUp will also recall
previously used commands
27
WORKING DIRECTORY

At the bottom left of the Stata


window is the address of the
working directory
Stata will load from and save
files to here, unless another
directory is specified
Use the command cd to
change the working directory

28
The working directory….
The working directory is the place where graphs and Stata datasets
will be saved unless you specify another directory.
By default, Stata assumes all files are in c:\data.
To change this working directory, type:
cd foldername
If the folder name contains blanks, it must be enclosed in
quotation marks.

29
STATA MENUS

Almost all Stata users use syntax to


run commands rather than point-and-
click menus
Nevertheless, Stata provides menus
to run most of its data management,
graphical, and statistical commands
Example: two ways to create a
histogram

30
Stata menus ….

Stata Browser: To view the data file (needs to be


opened)
Stata Editor: To edit the data file (needs to be opened)
Stata Do-file Editor: To write, execute, save or edit a
program (needs to be opened)
 A Stata program (or Do-file) is simply a set of Stata
commands written by the user.
31
Stata windows…
Stata Results window: It is the big window that
help us to see recent commands and output.
• More, it includes:
 The commands that you run
 The results you obtain
 Error messages
 The help system, active links to Stata web
pages, and further output

32
DO-FILES ARE SCRIPTS OF
COMMANDS

• Stata do-files are text files where users can store and run
their commands for reuse, rather than retyping the
commands into the Command window
• Reproducibility
• Easier debugging and changing commands
• It recommend always using a do-file when using Stata
• The file extension .do is used for do-files 33
OPENING THE DO-FILE
EDITOR

Use the command


doedit to open the do-
file editor
Or click on the pencil and
paper icon on the toolbar

The do-file editor is a text file editor specialized


for Stata 34
SYNTAX HIGHLIGHTING

The do-file editor colors Stata commands


blue
Comments, which are not executed, are
usually preceded by * and are colored
green
Words in quotes (file names, string values)
are colored “red”
Stata 16 features an enhanced editor that
features tab auto-completion for Stata
commands and previously typed words

35
RUNNING COMMANDS
FROM THE DO-FILE

To run a command from the do-


file, highlight part or all of the
command, and then hit Ctrl‑D
(Mac: Shift+Cmd+D) or the
“Execute(do)” icon, the rightmost
icon on the do-file editor toolbar
Multiple commands can be
selected and executed

36
COMMENTS

Comments are not executed,


so provide a way to document
the do-file
Comments are either preceded
by * or surrounded by /* and
*/
Comments will appear in
green in the do-file editor

37
LONG LINES IN DO-FILES

Stata will normally assume that a


newline signifies the end of a
command
You can extend commands over
multiple lines by placing /// at the
end of each line except for the last
Make sure to put a space
before ///
When executing, highlight each line
in the command(s)

38
Ways of running Stata program

 Most operations can be accomplished either via the pull-


down menu system, or directly via typed commands.
These are the two common ways to operate Stata.
 Interactive mode (command): Commands can be typed directly
into the Command window and executed by pressing Enter.
 Batch mode (Do-file): Commands can be written in a separate
file (called a do-file) and executed together in one step.

39
Ways of running Stata program…

 Drop down menu: One can also execute many commands


using the drop-down menus.

 Keyboard command (Short cuts): when we use short cuts


the commands not typed. keyboard commands may be even
quicker to use than the buttons.

40
Ways of running Stata program…

In stata version14.0,we can use the following short cuts to open different windows of
Stata.
 Crtl+1= Command
 Crtl+2= Results
 Crtl+3=Review
 Crtl+4=Variables
 Crtl+5=Properties
 Crtl+6=Top graph
 Crtl+7=New viewer/help
 Crtl+8=Data editor
 Crtl+9=New Do- file Editor
41
Stata toolbar

 At the top of the screen is an icon bar menu. Some


of the menu items will be familiar whereas others are
more specific to Stata.

 You can see a description of what each item does by


running your mouse pointer over it.

42
Stata toolbar….

43
Stata’s interface
Stata commands can be typed by hand (single
commands in the Stata command window or
multiple commands or programs in the Do-file
Editor) but most Stata commands can also be
accessed in a point-and-click manner by pulling
down Stata’s menus and selecting items that invoke
dialog boxes.
Stata’s Data, Graphics, and Statistics menus
provide point-and-click access to almost all Stata
commands.
44
45
OK, cancel, submit, ? and R

Most dialog boxes in Stata will provide the same five


buttons that you see at the bottom of the summary
statistics dialog box: OK, Cancel, Submit,? and R.
Cancel: dismisses the dialog box without doing
anything.
OK: dismisses the dialog box and issues a Stata
command based on how you have filled out the fields
in the dialog box.
46
OK, cancel, submit, ? and R
Submit: issues a Stata command just like OK but
leaves the dialog box on the screen so that you can
make changes and issue another command.
?: Provides access to Stata’s help system [it will take
you to the help file on the Stata command associated
with the dialog box, in this case the summary help
file].
R: is the ‘reset’ button. Each time you open a dialog
box, it will remember how you last filled it out.
 R will reset its fields to their default values.
47
Setting memory size

Before loading your dataset, you will probably need to set


the amount of memory allocated to Stata by your computer.
Stata achieves a higher processing speed by holding data
within memory whilst performing calculations (as opposed
to accessing it from hard disk).
This means that the size of the dataset you can load into
Stata is limited by the amount of memory allocated.

48
Setting memory size….

An error message will appear if you attempt to load


datasets larger than your allocated memory.
The default memory allocated is roughly one
megabyte.
For most datasets, it will probably be necessary to
increase this allocation.

49
Setting memory size…...

A memory size of 16 Mb or slightly higher will be enough


for most purposes.
We will set the memory to 20 Mb by typing a simple
instruction into the command window at the bottom of the
Screen:
Type set mem 20m in the Command window and hit
the return key.
set mem 20m
50
Setting memory size…..

If you want to keep the memory set to 20 megabytes


permanently (until you instruct Stata otherwise) then we
can type:
set memory 20m, permanently
Notice that: comma: a comma separates the option from
the command.
This is a general syntax feature for specifying options in
Stata commands.
51
The help system:
Troubleshooting and Update

 The help command followed by a Stata command brings up


the on-line help system for that command.
 It can be used from the command line or from the help window.
 With help you must spell the full name of the command
completely and correctly.
 The help command gives you information about any Stata
command or topic
help [command]
52
The help system….

For example, help tabulate gives a description of the


tabulate command
help summarize gives a description of the summarize
command
The built-in STATA help function is very elaborate,
reliable and… helpful.

53
The help system….

 it will illustrate its function with an example.


1. Choose Help from the main menu bar and select Search…
2. Select Search documentation and FAQs.
3. Enter data and click OK or press enter
4. Stata will now search for all references to “data” among the
Stata commands, the Reference manuals, the User’s Guide,
the Stata Journal, the Stata technical Bulletin, and the FAQs
on Stata’s website.
54
The help system....
The search result is presented on the following pages
help summary
 The help contents will list all commands that can be
accessed using help command.
help contents
 The search command looks for the term in help files,
Stata Technical Bulletins and Stata FAQs.
 It can be used from the command line or from the help
window.

55
The help system....

search logit
The findit command can be used to search the Stata site and
other sites for Stata related information, including ado files.
Say that we are interested in panel data, so we search for this
program from within Stata by typing
findit panel data
The Stata viewer window appears and we are shown a
number of resources related to this key word.
56
The help system…

 Stata is composed of an practicable file and official


ado files.
 Ado stands for automatically loaded do file.
 An ado file is a Stata command that created by users
like you.
 Once installed in your computer, they work pretty
much the same way so Stata commands.
57
Update

• The update command reports on the current update level and


installs official updates to Stata.
• It helps users to be up to date with the latest Stata ado and executable
file, and copy and installs the ado files into the directory specified.
update
update ado, into(d:\ado)
• Stata files are regularly updated. It is important to make sure that
you are always running the most up to date Stata, and please do so
regularly.
58
Stata Language Syntax

 It is common Stata Syntax. Stata commands follow the same syntax:


 [by varlist:] command [varlist] [=exp] [if exp] [in range] [weight=exp]
[,options]
 With few exceptions, the basic language syntax is
[prefix :] command [varlist] [=exp] [if] [in] [weight] [using filename]
[, options]

59
Stata Language Syntax…

 Items inside of the squares brackets are either options or not


available for every command.
 This syntax applies to all Stata commands.
 In order to use by prefix, the dataset must first be sorted on
the by variable(s).
 It helps to repeat Stata command on subsets of the data.

60
Stata Language Syntax…

• The meanings of the various parts of the command


are:
• by: prefix requests that the command be executed
repeatedly for each distinct value or combination of
values of the variable(s) in the list that follows by.
• For example, if variable gender has values M and F,
then adding the prefix by gender: requests that the
command be executed twice: once for gender M, and
again for gender F.
61
Stata Language Syntax…

Varlist: denotes a list of variable names.


Stata variable names can be up to 32 characters
long, must start with a letter, and can contain
letters, numbers and the underscore (_).
Variable names are CASE-SENSITIVE!
The variable list following by specifies the by
variables (see above); the list following the
command specifies the analysis variables for the
command.
62
Stata Language Syntax…
• Command: denotes a Stata command,
and is the only part of the general form
that is always required.
• Specific commands may require
additional parameters.
• if exp: is an algebraic expression which
is used to select the observations to be
used by the command
63
Getting Data into Stata, Loading and
Saving data in STATA format

Stata data consists of observations and variables,


corresponding to rows and columns respectively in a
spreadsheet-like arrangement.
• You can only work on one Stata dataset at a time.
• The data you are working on is stored in memory.

64
Getting Data into Stata …

 There are various ways to enter data into Stata; the choice depends on
the nature of the input data:
 Manual entry by typing
 Copy and Pasting data into data editor
 Inputting ASCII files using infile, insheet or infix
 Use of other software to directly create a new Stata dataset from
another format (e.g. SAS, SPSS)

65
Getting Data into Stata…

• The most important one’s are:


 Manual entry Using the Data Editor
 Reading Data from Excel Files (Using the insheet
command)
 Copy and paste options
 Open from the already saved data

66
Getting Data into Stata…
1. Using the Data Editor
• The data editor can be used to:
– enter new data (data can also be entered manually: choose the cell, type the value, and press Enter
or Tab), or
– to view or
– edit existing data.
• To open the Data Editor, click the Data Editor icon on the Tool Bar.
• If you have a dataset in memory, it is displayed in the editor.
• Otherwise you get a blank editor screen
• To enter the editor:
– Click on the Data Editor button or
– type command edit and press Enter in the Command window
• The editor is like a spreadsheet.
67
Getting Data into Stata…

 Transferring other files into Stata format


 If data in another format (e.g. SAS, SPSS),
Stat/Transfer can be used to create a Stata dataset
directly.
 Can also handle Excel files.

68
Creating, renaming and
labeling variables

Example 1
Use Prac1 data to getting data into stata by using manual entering
Prac2 create Prac2 data in excel and copy past it on new Data
Editor
• By default Stata names var1, var2, : : : , var5 for each column you
enter data
• But we would like to rename the variables so that they match the
column titles from our dataset and give the variables descriptions
and change their formatting or variable labeling.
69
Creating, renaming and
labeling variables

Steps for renaming and labeling variable (for ex. var1)


• 1. Double-click var1, and type the exact name (gender in this case) & replace
gender with var1 in property window.
• 2. Press Tab to change focus to the Label field
• 3. Enter a worthwhile label (gender of the household head)
• 4. These are all the changes for the variable, so click the Apply button to make
the changes stick.
• Now the var1 become gender and its label changed to gender of the household head
• Apply the same procedures for var2 and var3 to convert their variable name into
fexp and mstat and their labeling into farming experience, and marital status
• Finally save it as Practical 1 data entry.
70
Prac1 and Prac2 data
Practical 1 data entry into Stata Prac2 data copy past from excel in Stata
1 5 married
0 6 not married
0 6 married
0 7 not married
0 7 married
1 8 not married
1 7 married
1 not married
1 married
0 12 not married

71
IMPORTING DATA

use load Stata dataset

save save Stata dataset

clear clear dataset from memory

import import Excel dataset excel

Import import delimited data (csv) delimited


72
STATA .dta FILES

• Data files stored in Stata’s format are known as .dta files


• Remember that coding files are “do-files” and usually have
a .do extension
• Double clicking on a .dta file in Windows will open up a the
data in a new instance of Stata (not in the current instance)
• Be careful of having many Statas open

73
LOADING AND SAVING .dta FILES

• The command use loads Stata .dta files


• Usually these will be stored on a hard
drive, but .dta files can also be loaded * load data over internet
over the internet (using a web address) use https://stats.idre.ucla.edu/stat/data/hs0

• Use the command save to save data in * save data, replace if it exists
save hs0, replace
Stata’s .dta format
• The replace option will overwrite an
existing file with the same name (without
replace, Stata won’t save if the file
exists)
• The extension .dta can be omitted when
using use and save 74
CLEARING MEMORY

• Because Stata will only hold


one data set in memory at a
time, memory must be cleared * clear data from memory
before new data can be loaded clear
• The clear command removes * load data but clear memory first
the dataset from memory use
https://stats.idre.ucla.edu/stat/data/hs0,
• Data import commands like clear
use will often have a clear
option which clears memory
before loading the new dataset
75
IMPORTING EXCEL DATA SETS

• Stata can read in data sets stored in many other formats


• The command import excel is used to import Excel data
• An Excel filename is required (with path, if not located in
working directory) after the keyword using
• Use the sheet() option to open a particular sheet
• Use the firstrow option if variable names are on the first row
of the Excel sheet
• * import excel file; change path below before executing
• Example
• import excel using "C:\path\myfile.xlsx", sheet(“mysheet") firstrow clear
76
IMPORTING .csv DATA SETS

• Comma-separated values files are also commonly used to store


data
• Use import delimited to read in .csv files (and files
delimited by other characters such as tab or space)
• The syntax and options are very similar to import excel
• But no need for sheet() or firstrow options (first row is
assumed to be variable names in .csv files)
• * import csv file; change path below before executing
• import delimited using "C:\path\myfile.csv", clear
77
USING THE MENU TO
IMPORT EXCEL AND .CSV
DATA

Because path names can be very long


and many options are often needed,
menus are often used to import data

Select File -> Import and then either


“Excel spreadsheet” or
“Text
data(delimited,*.csv, …)”
Practical 3 import prac2 data from
excel into Stata

78
PREPARING DATA FOR IMPORT

• To get data into Stata cleanly, make sure the data in your Excel file or .csv file
have the following properties
• Rectangular
• Each column (variable) should have the same number of rows
(observations)
• No graphs
• Missing data should be left as blank fields
• Variable names should contain only alphanumeric characters or _ or .
• Make as many variables numeric as possible
• Many Stata commands will only accept numeric variable 79
CHAPTER FOUR
EXPLORING DATA IN THE PACKAGE

VIEWING browse open spreadsheet of data


DATA list print data to Stata console

80
SEMINAR DATASET

• We will use a dataset consisting of 200


observations (rows) and 13 variables * download seminar dataset from
(columns) the link
use
• Each observation is a student
https://stats.idre.ucla.edu/stat/
• Variables data/hs0, clear

• Demographics – gender(1=male,
2=female), race, ses(low, middle,
high), etc
• Academic test scores
• read, write, math, science,
• Go ahead and load the dataset! 81
B R O W S I N G T H E D ATA S E T

• Once the data are loaded, we can view the


dataset as a spreadsheet using the command
browse
• The magnifying glass with spreadsheet icon
also browses the dataset

• Black columns are numeric, red columns


are strings, and blue columns are numeric
with string labels
82
LISTING OBSERVATIONS

• The list command prints observation * list read and write for
first 5 observations
to the Stata console list read write in 1/5
• Simply issuing “list” will list all
observations and variables +--------------+
| read write |
• Not usually recommended except for |--------------|
small datasets 1. | 57 52 |
• Specify variable names to list only those 2. | 68 59 |
3. | 44 33 |
variables
4. | 63 44 |
• We will soon see how to restrict to 5. | 47 52 |
certain observations +--------------+
83
SELECTING OBSERVATIONS

in select by observation number

if select by condition

84
SELECTING BY OBSERVATION NUMBER
WITH in

• Many commands are run on a subset of the *list science for last 3
data set observations observations
li science in -3/L
• in selects by observation (row) number
• Syntax +---------+
| science |
• in firstobs/lastobs |---------|
• 30/100 – observations 30 through 100 198. | 55 |
199. | 58 |
• Negative numbers count from the end 200. | 53 |
• “L” means last observation +---------+
• -10/L – tenth observation from the last
through last observation
85
SELECTING BY CONDITION WITH if

• if selects observations that meet a * list gender, ses, and math if math
certain condition > 70
* with clean output
• gender == 1 (male) li gender ses math if math > 70,
clean
• math > 50
gender ses math
• if clause usually placed after the 13. 1 high 71
22. 1 middle 75
command specification, but before 37. 1 middle 75
the comma that precedes the list of 55. 1 middle 73
73. 1 middle 71
options 83. 1 middle 71
97. 2 middle 72
98. 2 high 71
132. 2 low 72
164. 2 low 72 86
STATA LOGICAL AND RELATIONAL
OPERATORS

• == equal to * browse gender, ses, and read


* for females (gender=2) who have
• double equals used to check for read > 70
browse gender ses read if gender ==
equality 2 & read > 70
• <, >, <=, >= less than, less than or equal
to, greater than, greater than or equal to
• ! not
• != not equal
• & and
• | or
87
codebook inspect variable values

EXPLORING DATA summarize summarize distribution

tabulate tabulate frequencies

88
EXPLORE YOUR DATA BEFORE
ANALYSIS

• Take the time to explore your data set before embarking on analysis
• Get to know your sample with quick summaries of variables
• Demographics of subjects
• Distributions of key variables
• Look for possible errors in variables

89
USE codebook TO INSPECT VARIABLE
VALUES

For more detailed information about the values of * inspect values of variables read gender and
prgtype
each variable, use codebook, which provides the codebook read prgtype
following:
--------------------------------
• For all variables Read reading score
• number of unique and missing values ------------------------------
type: numeric (float)
• For numeric variables range: [28,76] units: 1
• range, quantiles, means and standard unique values:30 missing .: 0/200
mean: 52.23
deviation for continuous variables std. dev: 10.2529
• frequencies for discrete variables percentiles: 10% 25% 50% 75% 90%
39 44 50 60 67
• For string variables
------------------------------------------------
• frequencies prgtype (unlabeled)
------------------------------------------------
• warnings about leading and trailing blanks type: string (str8)
unique values: 3 missing "": 0/200
tabulation: Freq. Value
105 "academic"
90
45 "general"
50 "vocati"
SUMMARIZING CONTINUOUS
VARIABLES

• The summarize command*summarize


summarize continuous variables
read math

calculates a variable’s: Variable | Obs Mean Std. Dev. Min Max


-------------+------------------------------------------------
• number of non-missing read |
math |
200
200
52.23
52.645
10.25294
9.368448
28
33
76
75

observations
* summarize read and math for females
• mean summarize read math if gender == 2

• standard deviation Variable | Obs Mean Std. Dev. Min Max


-------------+-------------------------------------------------
• min and max read |
math |
109
109
51.73394
52.3945
10.05783
9.151015
28
33
76
72

91
DETAILED SUMMARIES

* detailed summary of read for females


• Use the detail option with summarize read if gender == 2, detail
summary to get more estimates reading score
that characterize the distribution, ------------------------------------------------------------
Percentiles Smallest

such as: 1%
5%
34
36
28
34
10% 39 34 Obs 109
• percentiles (including the median at 25% 44 35 Sum of Wgt. 109

50th percentile) 50% 50


Largest
Mean
Std. Dev.
51.73394
10.05783
75% 57 71
• variance 90%
95%
68
68
73
73
Variance
Skewness
101.16
.3234174
99% 73 76 Kurtosis 2.500028
• skewness
• kurtosis

92
TABULATING FREQUENCIES OF
CATEGORICAL VARIABLES

• tabulate (often shortened * tabulate frequencies of ses


tabulate ses

to tab) displays counts of ses | Freq. Percent Cum.

each value of a variable ------------+-----------------------------------


low | 47 23.50 23.50
middle | 95 47.50 71.00

• useful for variables with a high | 58 29.00 100.00


------------+-----------------------------------
Total | 200 100.00
limited number of levels
* remove labels
• For variables with labeled tab ses, nolabel

values, use the nolabel ses | Freq. Percent Cum.

option to display the ------------+-----------------------------------


1 | 47 23.50 23.50
2 | 95 47.50 71.00
underlying numeric values 3 | 58 29.00 100.00
------------+-----------------------------------
Total | 200 100.00
93
histogram histogram

graph box boxplot

scatter scatter
DATA VISUALIZATION plot

graph bar bar plots

twoway layered graphics

94
DATA VISUALIZATION

• Data visualization is the representation of data in visual formats such as


graphs
• Graphs help us to gain information about the distributions of variables and
relationships among variables quickly through visual inspection
• Graphs can be used to explore your data, to familiarize yourself with
distributions and associations in your data
• Graphs can also be used to present the results of statistical analysis

95
HISTOGRAMS

• Histograms plot distributions of histogram of write


*

variables by displaying counts of histogram write


values that fall into various
intervals of the variable

96
histogram OPTIONS *

• Use the option normal with * histogram of write with


normal density
histogram to overlay a * and intervals of length 5
hist write, normal width(5)
theoretical normal density
• Use the width() option to
specify interval width

97
BOXPLOTS *

• Boxplots are another popular option * boxplot of all test scores


for displaying distributions of graph box read write math
continuous variables science socst

• They display the median, the


interquartile range, (IQR) and outliers
(beyond 1.5*IQR)
• You can request boxplots for multiple
variables on the same plot

98
SCATTER PLOTS

• Explore the relationship between 2 * scatter plot of write vs


continuous variables with a scatter read
plot scatter write read

• The syntax scatter var1


var2 will create a scatter plot with
var1 on the y-axis and var2 on
the x-axis

99
BAR GRAPHS TO VISUALIZE
FREQUENCIES

* bar graph of count of ses


• Bar graphs are often used to visualize graph bar (count), over(ses)
frequencies
• graph bar produces bar graphs in Stata
• its syntax is a bit tricky to understand
• For displays of frequencies (counts) of each
level of a variable, use this syntax:
graph bar (count), over(variable)

10
0
TWO-WAY BAR GRAPHS

* frequencies of gender by ses


• Multiple over(variable)options can * asyvars colors bars by ses
graph bar (count), over(ses) over(gender) asyvars
be specified
• The option asyvars will color the bars
by the first over() variable

10
1
TWO-WAY, LAYERED GRAPHICS

• The Stata graphing command twoway produces layered graphics, where


multiple plots can be overlayed on the same graph
• Each plot should involve a y-variable and an x-variable that appear on the
y-axis and x-axis, respectively
• Syntax (generally): twoway (plottype1 yvar xvar)
(plottype2 yvar xvar)…
• plottype is one of several types of plots available to twoway, and
yvar and xvar are the variables to appear on the y-axis and x-axis
• See help twoway for a list of the many plottypes available

10
2
LAYERED GRAPH EXAMPLE 1

* layered graph of scatter plot and lowess curve


twoway (scatter write read) (lowess write read)

• Layered graph of scatter


plot and lowess plot
(best fit curve)

10
3
LAYERED GRAPH EXAMPLE 2

• You can also overlay separate plots by * layered scatter plots of write and read
* colored by gender
group to the same graph with different twoway (scatter write read if gender == 1,
mcolor(blue))
colors (scatter write read if gender == 2, mcolor(red))

• Use if to select groups


• the mcolor() option controls the
color of the markers

10
4
DATA MANAGEMENT

105
DATA MANAGEMENT
• To describe the variables in the data set type:
describe or des
– Or to describe some specific variables type add
the name of the variable to the command.
• Eg: des mpg
– You can summarize specific variables
• sum varlist, detail
Example:
• sum mpg, detail
• sum mpg
• su mpg 106
DATA MANAGEMENT

• If you are only interested in a subset of your data, you can inspect it using filters. E.g. If you
are only interested in price of a particular type of car you can type:
.sum if price>=3000 & price<=4400
.sum if mpg>=16& mpg<=23
• And then you can contrast
.sum if price>=3000 |price<=4400
.sum if mpg>=16 |mpg<=23
• Interpretation of Logical Operators in STATA.
>= greater or equal to
<= less or equal to
== equal to
& and
| or
!= or ~= not equal to
> greater than
< Less than
. missing 107
DATA MANAGEMENT
• The usual arithmetic operators (+,-,*,/) are
applicable in STATA.

• STATA allows users to tabulate variables to


know the distribution of a variable
.tabulate mpg
.tab mpg

108
DATA MANAGEMENT
GENERATING NEW VARIABLES
• You can create a new variable by combining
new variables or by performing some
arithmetic operations. [gen, egen, recode]
• To create a ratio of two variables:
– gen mpgratio=mpg/weight
– sum mpgratio

109
Creating, transforming, and labeling variables

generate create variable

replace replace values of variable

egen extended variable generation

rename rename variable

recode recode variable values

label variable give variable description

label define generate value label set

label value apply value labels to variable

encode convert string variable to


numeric
110
Generating variables
• Variables often do not arrive in the form that we
need
• Use generate (often abbreviated gen or g) to create
variables, usually from operations on existing
variables
• sums/differences/products/means of variables
• squares of variables
• If an input value to a generated variable is missing,
the result will be missing

111
Transforming variables…
The same procedure can be applied to obtain traditional
transformations such as:
Square gen mpg2=mpg^2
Cubic gen mpg3=mpg^3
Square roots gen mpgsqrt=sqrt(mpg)
Exponential gen expmpg=exp(mpg)
Natual logs gen lnmpg=ln(mpg)
gen logmpg=log(mpg)
Base 10 genl10mpg=log10(mpg)
• Eg: gen lprice=log(price+1)
– Why +1? This helps eliminate the problem of estimating the
log of zero or missing numbers. 112
Generating variables…
• Eg: from the auto.dta data set we are using, may be interested in finding out how
many cars were repaired more than two times in 1978. Thus we create a new
variable repair =1 if the vehicle was repaired more than twice or 0 if otherwise .
• Use the command:
gen repair =1 if rep78>2
replace repair=0 if rep78<=2
• You can also create categorical variables from a set of continuous variables.
tab mpg
gen mpgcat=1 if mpg<15
replace mpgcat=2 if mpg>=16& mpg<26
replace mpgcat=3 if mpg>26 & mpg<=35
replace mpgcat=4 if mpg>35
tab mpgcat

113
Missing values in stata
• Missing numeric values in Stata are represented by .
• Missing string values in Stata are represented by “”
(empty quotes)
• You can check for missing by testing for equality to . (or
“” for string variables)
• You can also use the missing() function
• When using estimation commands, generally,
observations with missing on any variable used in the
command will be dropped from the analysis
• Syntax list varlist if var == .
114
Missing values in stata …
• Example import ‘Practical 2 Data Entry’ from excel dataset
and save it
• Use the following command to check the missing values
of the variable gnppc
• li country gnppc if gnppc == .
country gnppc
7 Bosnia and Herzegovina .

• It shows the 7th observation (gnppc or GNP per capita of


Bosnia and Herzegovina) missed from the dataset
115
Practical exercise 1
• from the auto.dta dataset open the datasate by using the following
commands
. clear
. sysuse auto.dta
1. Find the numeber of missed values for a variable, rep78
2. Identify the observations for which the values are missed
3. replace the missing values with 0 for rep78
4. Check again if there is any missed values for rep78 after replacing the missed
values

116
Hint for Practical exercise 3
1. Use codebook or list codebook rep78
commands to missing .: 5/74
li rep78 if rep78 == .
inspect the missed
rep78
value 3 .
2. Use the list 7 .
45 .
command to identify 51 .
the observations for 64 .
which the values are The car repair values are missed for 3rd, 7th, 45th
missed 51th, 64 th observations.
replace rep78 =0 if rep78 ==.
3. Use the replace
(5 real changes made)
command to replace codebook rep78
the missing values missing .: 0/74 117
Renaming and recoding variables
rename
changes the name of a variable
Syntax: rename old_name new_name
recode
• This command changes the values of a categorical variable
according to the rules specified. The syntax is:
recode varname oldvalue=newvalue oldvalue=newvalue … [if
exp] [in range]
recode foreign 0=1 1=2
Recode rep78 .=9 *=7

118
Recoding values of a categorical variable ….

119
LABELING VALUES

• Value labels give text descriptions to the numerical values of a


variable.
• To create a new set of value labels use label define
• Syntax: label define labelname # label…, where
labelname is the name of the value label set, and (#
label…) is a list of numbers, each followed by its label.
• Then, to apply the labels to variables, use label values
• Syntax: label values varlist labelname, where
varlist is one or more variables, and labelname is the value
label set name
120

• Use practical dataset 3


LABELING VALUES …

* gen before labeling values


tab gen
practical dataset 3
gen fexp mstat Gen | Freq. Percent Cum.
1 5 married ------+------------------------
0 | 5 50.00 50.00
0 6 not married 1 | 5 50.00 100.00
0 6 married ------+------------------------
Total | 10 100.00
0 7 not married
0 7 married
* create and apply labels for gen
1 8 not married label define sex 0 female 1 male
1 7 married label values gen sex
1 not married tab gen
1 married
0 12 not married Gen | Freq. Percent Cum.
-------+---------------------------
female | 5 50.00 50.00
male | 5 50.00 100.00
-------+---------------------------
121
Total | 10 100.00
ENCODING STRING VARIABLES INTO
NUMERIC (1)

* encoding string prgtype


• encode converts a string into
variable into a numeric * numeric variable prog
variable encode mstat, gen(mstatus)
• remember that some Stata commands
require numeric variables * we see that mstatus is a
numeric with labels (blue)
• encode will use alphabetical order to
order the numeric codes
* while the old variable
mstat is string (red)
• encode will convert the original string
values into a set of value labels
browse mstat mstatus
mstat mstatus
• encode will create a new numeric married married
variable, which must be specified in option not
gen(varname) not married married
married married
122
ENCODING STRING VARIABLES INTO
NUMERIC (2)

* we see labels by default in tab


tab mstatus
• remember to use the
option nolabel to mstatus | Freq. Percent Cum.
------------+-----------------------
remove value labels married | 5 50.00 50.00
from tabulate not married | 5 50.00 100.00
------------+-----------------------
output Total | 10 100.00

* use option nolabel to remove the labels


• Notice that numbering tab mstatus, nolabel

begins at 1
mstatus| Freq. Percent Cum.
-------+---------------------------
1 | 5 5.00 50.00
2 | 5 22.50 100.00
-------+---------------------------- 123
Total | 10 100.00
DATASET OPERATIONS

keep keep variables, drop others

drop drop variables, keep others

keep if keep observations, drop others

drop if drop observations, keep others

sort sort by variables, ascending

gsort ascending and descending sort

124
KEEPING AND DROPPING VARIABLES

• keep preserves the selected variables and drops the rest


• Use keep if you want to remove most of the variables but keep a select
few
keep just mpg rep78 and weight
• keep mpg rep78 weight
• drop removes the selected variables and keeps the rest
• Use drop if you want to remove a few variables but keep most of them
drop variable foreign from dataset
• drop foreign
125
KEEPING AND DROPPING OBSERVATIONS

Specify if after keep or drop to preserve or remove observations by


condition

To be clear, keep if and drop if select observations, while keep and drop
select variables
* keep observation if reading > 40
keep if read > 40
summ read
Variable |Obs Mean Std. Dev. Min Max
---------+----------------------------------------
read |178 54.23596 8.96323 41 76
* now drop if math outside range [30,70]
drop if math < 30 | math > 70
summ math

Variable |Obs Mean Std. Dev. Min Max


---------+------------------------------------
math |168 52.68452 8.118243 35 70 126
SORTING DATA (1)

• Use sort to order the observations by one or more variables


• sort var1 var2 var3, for example, will sort first by var1, then by var2, then
by var3, all in ascending order

li price mpg rep78 in 1/5


-------------------------+
| price mpg rep78 |
|--------------------------|
1. | 4,099 22 3|
2. | 4,749 17 3|
3. | 3,799 22 .|
4. | 4,816 20 3|
5. | 7,827 15 4|
+-------------------------- 127
S O RT I N G D ATA ( 2 )

• Use sort to order the observations by one or more variables


• sort var1 var2 var3, for example, will sort first by var1, then
by var2, then by var3, all in ascending order
• sort price mpg rep78
• li in 1/5
-------------------------+
| price mpg rep78 |
|-------------------------|
1. | 3,291 20 3|
2. | 3,299 29 3|
3. | 3,667 24 2|
4. | 3,748 31 5|
5. | 3,798 35 5| 128
SORTING DATA (3) *

• Use gsort with + or – before each variable to specify ascending


and descending order, respectively
gsort -read +math
li in 1/5
+--------------------+
| id read math |
|----------------------|
1. | 61 76 60 |
2. | 103 76 64 |
3. | 34 73 57 |
4. | 93 73 62 |
5. | 95 73 71 |
129

+---------------------+

You might also like