0% found this document useful (0 votes)
134 views30 pages

Introduction To Stata and Data Management

Uploaded by

Hulle T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views30 pages

Introduction To Stata and Data Management

Uploaded by

Hulle T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

INTORDUCTION TO STATA

Abrham Seyoum (PhD)


abramseyom@yied.org
INTRODUCTION
• Econometrics softwares are three types
1) Menu based: E-VIEWS, GIVE WIN
2) Command based: GAUSS, GAMS, SAS, R
3) Combination of both: SPSS, STATA
WHAT IS STATA?

 It is a multipurpose statistical package to help you explore, summarize and


analyze datasets.
 A data set is collection of several pieces of information called variables
(usually arranged by columns).
 A variable can have one or several values (information for
one or several cases).
 use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
STARTING STATA

A. STATA windows
- Review
- Results
- Variables
- Command
B. Stata Toolbar
- It has different functioning buttons
C. Stata Log File
D. Stata‘s Help Future
INPUTING DATA INTO STATA USING THE DATA
EDITOR

 Click on the Data Editor button or type edit in the command window and
press return
A. Inputting Data
 Things to know about entering data in Stata

• Quotes around string variables are unnecessary


• A period (‘.’) represents a missing numeric value
• Press Tab or Return to input a missing numeric value
• Press Tab or Return to input a missing value for a string
variable
• Stata will not allow empty columns or rows in the middle
of your data set.
Go to Initial DATA
INPUTING DATA CONT…

B. Renaming Variables
 Click on the variable manager button. This brings up the Variable
Information dialog box. Enter the new name of the variable and specify a
more detailed description of the variable.
 Rules for variable names:
• Stata is case sensitive
• A variable name must be between 1 and 32 characters long
• Characters can be letters, digits, or underscores
• Spaces or other characters are not allowed
• The first character of a variable name must be a letter or an
underscore
INPUTING DATA CONT…

C. Copying and Pasting Data


1. Select the data you want to copy
Click and drag the mouse to select a range of cells
2. Copy the data to the clipboard
Pull down on the Edit menu and choose Copy
3. Paste the data from the clipboard
 Click on the top left cell of the area to which you wish to paste. Pull down
the Edit menu and choose Paste.
D. Exiting the Data Editor
 Click on the editor’s close box.
 Changes that you made in the editor are not saved until you tell Stata to
save them. Data can be saved by pulling down File and choosing Save As.
STORAGE TYPE
1. Numeric or Real
Storage closest to 0
type Minimum Maximum without being 0 bytes
--------------------------------------------------------------------------------------------------------
byte -127 100 +/-1 1
int -32,767 32,740 +/-1 2
long -2,147,483,647 2,147,483,620 +/-1 4
float -1.70141173319*10^38 1.70141173319*10^38 +/-10^-38 4
double -8.9884656743*10^307 8.9884656743*10^307 +/-10^-323 8
--------------------------------------------------------------------------------------------------------
Precision for float is 3.795x10^-8.
Precision for double is 1.414x10^-16.

2. String
storage Maximum
type length Bytes
----------------------------------
str1 1 1
str2 2 2
... . .
str244 244 244
----------------------------------
DATA FORMAT
Storage types: tells the number of digits of accuracy (in other words how many storage
bites they accomodate.
 byte %8.0g /* g refers to general & f refers to fixed, other types exist for binary,
exponential etc*/
 int %8.0g
 long %12.0g /*9 digits of accuracy*/ & /* 0 decimal levels*/
 float %9.0g /* 8 digits of accuracy*/
 double %10.0g /* 16 digits of accuracy*/
 str# %#s (string fromat) eg. Str4 %4s (indicates the maximum length of the
string is 4)
 To get the format type, use command
 format race
 To change string form in to numeric/real, use the command
 destring race, replace
LISTING DATA

A. List
 Typing list in the Command Window lists the entire data set. A subset of
variables can be listed.
B. List with in
The Stata command in restricts the list to a range of observations
 Positive numbers count from the top of the data. Negative numbers count from
the end of the data
 You can specify both a variable range and an observation range.

Example:
 Type the following commands in the Command

 list in 1 or list prog in 1


 list in –1
 list in 1/5
 listblck in 1/5
 * type findit listblck and then install the extension from the help.
C. List with if
• The Stata command if restricts the observations to meet certain criteria
using logical operators.
• The logical operators are:
< less than; > greater than
<= less than or equal; >= greater than or equal
== equal vs ~= not equal (!= can also be used)
& and | or ~ not (! can also be used)
() parentheses specify order of evaluation
Example:
1. list if science>60 & science ~=.
2. display ln(9) *you can just use in place of calculator
3. dis 3/4
DATA INSPECTION
To inspect the data
 Inspect write
It tells us what is missing, zero, positive, negative, integer etc.
Or, type findit nmissing and then install the extension from the help.
 nmissing
 npresent
Or, use codebook to describes data contents
 codebook [varlist] [if] [in] [, options]
 codebook write
 codebook race
DELETING VARIABLES AND OBSERVATIONS
A. Clear and Drop_All
 The commands clear and drop_all eliminate data from memory. drop_all
drops the data from memory. clear resets Stata.
B. Drop/keep
 The drop command allows you to drop variables and/or specific
observations.
 drop in 1/3 *this drops observations 1 through 3
 drop if math > 60
 drop id race math
 keep id race math
WORKING WITH DATA
A. Preliminaries – describe and list
 When working with an unfamiliar data set it is useful to describe the data.
The Stata command describe provides information on the number of
observations, variables, variable type, etc.
 More detailed information about the data set can be obtained using the Stata
command list.
Example:
1. describe
2. list math science in 1/10
3. sort ses
by ses: summarize read write
4. by ses, sort: summarize read write
5. bysort ses: summarize read write
WORKING WITH DATA CONT…
B. Descriptive Statistics
 The Stata command summarize provides summary statistics of the data set.
Logical operators can be combined with summarize.
Example:
1. summarize
2. sum read
3. sum read, detail
4. sum read if prog==3
5. return list
/* this gives what stata is doing and use the results, if you want, for any other
purpose*/
WORKING WITH DATA CONT…
C. Tabulate
 Frequency tables are obtained using the tabulate command.
Example:
 tabulate write
 tab race, sum(write)
 tab race, sum(write) mean
 tab sex prog, sum(math)
D. Correlation Matrices
 The correlation between variables is calculated using the Stata command
correlate.
 correlate read write
 pwcorr read write if sex == 0
CREATING NEW VARIABLES
A. Generate
 Generate allows you to create a new variable that is an algebraic expression
of other variables.
 gen lnmath = ln(math)
 gen sum= _N
 gen ID=_n
 gen id1=_n in 1/10
B. Replace/recode
 The command replace allows you to change the content of existing
variables.
 replace write = write/100
 recode math 45 46 47 48= 100
 recode math 45/48=100
EGEN & XTILE
egen
 This is an extended version of “generate”[extended generate] to create a new variable by
aggregating the existing data. It is a powerful and useful command.
 egen maxthree=rmax( math science socst)
 egen avgmath=mean(math)
 egen avgmath=mean(math), by(race)
 egen medmath= median( math), by(race)
 egen permath= pctile( math), by(race)
 egen permath= pctile ( math), by(race) p(25)
Xtile
 This command creates a new variable that indicates which category a record falls into,
when the sample is sorted by an existing variable and divided into “n” groups of equal
size.
 xtile newvar = variable [if exp] [in range] , nq(#)
 xtile writedec= write,nq(10)
 tab writedec
 The groups are not uniform across since the observations are not evenly distributed. Do
for id.
OTHER IMPORTANT COMMANDS
Merge
 You merge when you want to add more variables to an existing dataset.
(type help merge)
 What you need:
– Both files must be in Stata format
– Both files should have at least one variable in common (id)
 Step 1. You need to sort the data by the id or ids common to both files you
want to merge. For both datasets type:
 – sort [id1] [id2] …
 – save [datafile name], replace
 Step 2. Open the master data (main dataset you want to add more variables
to, for example data1.dta) and type:
 – sort [id1] [id2] … & save [datafile name], replace
 – merge [id1] [id2] … using [i.e. data2.dta]
 For example, opening a hypothetical data1.dta we type
– merge 1:1 common variables using data2.dta //or m:1; 1:m or m:m//
 To verify the merge type

– tab _merge
 Here are the codes for _merge:

_merge==1 obs. from master data


_merge==2 obs. from only one using dataset
_merge==3 obs. from at least two datasets, master or using
 If you want to keep the observations common to both datasets you can drop
the rest by typing:
 – drop if _merge!=3 /*This will drop observations where _merge is not
equal to 3 */
Collapse
 Collapse (sum) math (mean) science, by(prog)
Append
 You append when you want to add more cases (more rows to your data,
type help append for more details).
 Open the master file (i.e. data1.dta) and type:
– append using [i.e. data2.dta]
Creating Dummies
 You can create dummy variables by either using recode or using a
combination of tab/gen commands:
 tab prog, generate(prog_dum)
Generating Lag Variables
 xtset panel_id time_id
 tsset time_id
 gen lagscience=l.science
Identifying duplicates
 Duplicates report IDs
 Duplicates tag IDs, gen(dup)
 Duplicates drop IDs, force

Centiles (percentiles)
 Centile science
 Centile science, centile(50) normal
 Centile science, centile(10(10)100) normal
 Centile science, centile(20(20)100) normal
STORING COMMANDS AND OUTPUTS
How to save our results in word/excel?
Word
 Just copy from the result window and paste it then select it in word and
change the font style in to courier new (in case bigger adjust fontsize).
Excel
 Just copy the results, paste it in stata. Then select the first column, go to
data then choose text to columns, next, next, finish.
How to save reg results in memory, see out & copy in Excel?
 outreg2 using myfile, dec(2)
 outreg2 using myfile, dec(2) replace
 outreg2 using myfile, bdec(2) sdec(2) replace
 outreg2 using myfile, word excel replace
USING DO-FILE EDITOR
 Using the Stata Do-File editor: This editor works like most text editors; use
help to get more details. After you enter your program, click the do-current-file
icon do the file. When you do the file, the results are sent to the Stata Results
window and the log file. Click the run-current-file icon to run the commands,
without sending them to the Stata Results window.
 Structure of a Do File: Here is an example of what you want to put in a do
file. Note that anything following a * is not executed but is printed.
- capture log close/cap log close/log close: This command closes any open .log
file. If no log file is open, STATA just ignores this command.
 set mem 40m
 set more off

* don’t pause when output scrolls off the page


 log using myfile, replace

* log results to file myfile.log


* your commands go here
 drop _all/ clear: This command clears the memory.
 #delimit ; By default STATA assumes that each command is ended by
the carriage return (ENTER key press). If, however, a command is too long
to fit in one line you can spread it over more than one line. You do that by
letting STATA know what the command delimiter would be. The command
in the example says that a command is ended by a semicolon (;). Every
command following the delimit command has to end with a ; until the file
ends or another #delimit cr command appears which makes carriage return
again the command delimiter. Although for this particular .do file we don't
need to use the #delimit command it is done to explain the command.
 #delimit cr * delimit carriage retrun i.e. used to end #delimit ;*
GRAPHS
 Go to the Menus to show the various types of graphs than these
commands.
 Scatterplotsare good to explore possible relationships or
patterns between variables. Lets see if there is some relationship between ag
e and SAT scores. For many more bells and whistles type help scatter
in the command window.
 twoway scatter math science
 twoway scatter math science, mlabel(id)
 twoway scatter math science, mlabel(id) || lfit math science
By categories
 twoway scatter math science, mlabel(id) by(sex)
HISTOGRAMS
 Histograms are another good way to visually explore data, especially to che
ck for a normal distribution: type help histogram
 histogram headage, frequency
 histogram headage, frequency normal
 The histogram command is an effective graphical technique for showing
both the skewness and kurtosis of rconspc.
 hist income, normal
 hist income, normal bin (100)
GRAPH BOX
 Graph box draws vertical box plots. In a vertical box plot, the y axis is
numerical, and the x axis is categorical.
 The upper and lower bounds of box are defined by the 25th and 75th
percentiles of the variable, and the line within the box is the median.
 The ends of the whiskers are 5th and 95th percentile of the variable. graph
box command can be used to produce a boxplot which can help us examine
the distribution of the variable.
 If the variable is distributed normally, the median would be in the center of
the box and the end of whiskers would be equidistant from the box.
 graph box math, by( sex)
CATPLOT
 Catplot is used to graph categorical data. Since it is a user defined program
you may install it: type ssc install catplot
Examples
 tab headsex agentvisit, col row cell
 Catplot bar major agegroups, blabel(bar)
 catplot bar major agegroups, percent(agegroups) blabel(bar)
 catplot hbar agegroups major, percent(major) blabel(bar)
 catplot hbar major agegroups, percent(major sex) blabel(bar) by(sex)
EXERCISE
 Generate a variable called ‘quininc‘ that indicates the quintile of the science
score of students. Summarize the new variable by sex.
 Draw a box graph of variable ‘math‘ by each category of program.
 Generate a new variable called ‘math15‘ that divides math score of
students in to 15 groups.
 Finally, construct a do file for the data and commands you used for the
above questions.

You might also like