Stata Guide V1
Stata Guide V1
Stata Guide V1
Contents
1 Introduction 5
1.1 Why to use Stata? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Alternative tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Basic commands 12
3.1 do-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Line breaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Prepare Stata for your analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Loading and saving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 The generic Stata command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Types of commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Relational and logical operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.8 Placeholders and selection of multiple variables . . . . . . . . . . . . . . . . . . . . . . 16
3.9 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9.1 matrix define - Generic command for matrices . . . . . . . . . . . . . . . . . 17
Centro de Investigracion y Docencia Economicas (CIDE), Mexico City, Mexico. [email protected]
1
CONTENTS CONTENTS
4 System commands 23
4.1 System commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 exit - Leaving the do-file or Stata . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 query - Displaying Stata options . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 set - Change settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.4 about - Getting information of the version and license . . . . . . . . . . . . . . 23
4.1.5 update - Update your Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.6 findit - Find ado-files online . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.7 search - Search ado-files online . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.8 help - Display help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Data handling 25
5.1 Variable manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 generate - Generate a new variable . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.2 replace - Replacing a variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.3 egenerate - Generate a new variable with summary statistics . . . . . . . . . . 25
5.1.4 recode - Recoding a variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.5 label - Label your variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.6 rename - Rename a variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.7 drop - Deleting variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Describing and sorting the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.1 describe - Describe the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.2 codebook - Display the codebook . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.3 sort - Sorting the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.4 gsort - Sorting the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.5 order - Sorting the variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2
CONTENTS CONTENTS
7 Econometric analysis 39
7.1 Continuous outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.1.1 regress - OLS estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.1.2 Other estimators of continuous outcome . . . . . . . . . . . . . . . . . . . . . . 40
7.2 Categorical outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2.1 probit - Probit estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2.2 dprobit - Probit estimation with marginal effects . . . . . . . . . . . . . . . . 41
7.2.3 logit - Logit regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2.4 mlogit - Multinominal logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2.5 oprobit - Ordered probit model . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2.6 ologit - Ordered logit model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3 Count data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3.1 poisson - Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3.2 nbreg - Negative binomial regression . . . . . . . . . . . . . . . . . . . . . . . . 42
7.4 Panel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.4.1 xtset - Set-up the panel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.4.2 xtdescribe - Describe the pattern of the panel data . . . . . . . . . . . . . . . 42
7.4.3 xtreg - Panel regression: fixed and random effects . . . . . . . . . . . . . . . . 43
7.5 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.6 Extracting estimation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.7 Post estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.7.1 hettest - Breusch-Pagan / Cook-Weisberg test for heteroskedasticity . . . . . 44
7.7.2 test - Test linear hypotheses after estimation . . . . . . . . . . . . . . . . . . . 44
7.8 Saving and reusing estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.8.1 est store - Save an estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.8.2 est restore - Restore an estimation . . . . . . . . . . . . . . . . . . . . . . . 45
3
CONTENTS CONTENTS
12 Simulation 60
12.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
12.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
12.3 Other distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
12.4 Setting the random seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4
1 INTRODUCTION
1 Introduction
This document has mainly two purposes: first, it should help new users to get started with Stata and
second, it should serve more experienced users as a look-up document. In order to comply with the
first goal of the document, I start with a general introduction to the software package and introduce
then chapter by chapter more complicated notions in order to familiarize the reader with the software
first and then introduce him to the possibilities Stata offers. The second goal should be achieved by
the use of a clear structure and an extensively detailed index in the end of the document. Like soft-
ware development, this document will never achieve a final version and comments and suggestions are
always welcome. Even though I refer to some econometric models, this document is NOT a reference
for econometric analysis. The reader is supposed to understand the models I present here and know
how and when to use them.
The document is under constant review and subject to changes and extensions. Please check for
updated versions frequently. Please report all errors to [email protected].
http://www.cpc.unc.edu/research/tools/data_analysis/statatutorial/index.html
http://www.ats.ucla.edu/stat/stata/
5
2 INTERFACE, COMPONENTS AND DATA STRUCTURE
do-File
editor
Variables in
active data
Output panel
Command line
Current working
directory
the command line you see the current working directory, which is by default always the directory
where Stata is installed. Among the button in the menu, you find the do-file editor, which will
probably be the most used button. The do-file editor is a simple text-editor with syntax highlighting
1
In the new versions of Stata, the black screen is actually white and the results in black. Personally I prefer the black
version (you can activate it under Edit Preference General preferences Color scheme: Classic.) It looks a little
bit more old-fashioned but highlights very well the different results.
6
2 INTERFACE, COMPONENTS AND DATA STRUCTURE 2.2 Command based
(since Version 11). The remaining parts of the interface will be introduced later on, but they will not
play a crucial role in the way I suggest to use Stata .
2.2 Command based
Unlike other statistical packages, Stata is mostly command based and the use of user interfaces
is relatively limited. Generally, all commands could be initiated by the use of the mouse and user
interfaces, however, it is not an efficient way to do things. My suggestion is to take the time to
learn correctly the syntax-based use of Stata and to work exclusively with the do-files. Do-files
are like m-files in Matlab or Syntax-files is SPSS and allow to write down a series of commands and
programming code. Using do-files to do all - starting from loading the database, carrying out the
analysis and storing the results - permits to save time and avoid errors. Moreover, the results are
easily reproducible by you and other researchers.
Normally, every line in the do-file corresponds to one command, which is generally what you want and
avoids the need of finishing every line with a special character like in many programming languages.
However, Stata offers also the possibility to change this temporarily or permanently if you like it.
This feature is useful when you have an extremely long command which does not fit the screen at all
(see 3.1.1 for more details).
7
2.5 Data types: database and variables
2 INTERFACE, COMPONENTS AND DATA STRUCTURE
in the analysis end with the extension .do and is a simple text file you can also manipulate in other
text editors like Notepad or Notepad++.
Besides these two main file types there exists the help files (ending with .hlp) and the ado-files ending
with .ado. Both are used especially when you write your own commands, a topic I will discuss in
section 11.
Finally Stata is not limited to these file types, you can theoretically read and write any kind of files.
I will discuss a couple of examples like the output of graphics, the automatic generation of LATEX files
or the import of SPSS and SAS data (section 9)
The following table displays the most common file extensions used in Stata :
Extension Description
.dta database file in Stata format
.do Executable Stata syntax file
.ado Stata command file (each command is written in an ado-file)
.hlp Stata help file (has the same name as command)
The first three types are generally interchangeable, however I suggest to use the local whenever possible.
The local is preferable to the global since it remains set only until the end of the do-file, while the
globals are not cleared automatically in the end. This feature of the globals can be useful in some
cases, however, there is always a risk of having a global defined in a previous do-file and affecting later
on when it should not. Moreover, it could be argued that the locals have a clearer invoking command
than the scalars, since they need to be in between two very specific apostrophes always. On the other
hand, scalars have the advantage that we can display easily all stored scalar values at once using the
command scalar list.
Globals
To define a global and giving it the value of 888 you can just write
global myname=888
. If you would like to define the global based on the results of another command which stores the
value for instance as r-class-scalars, you might use
8
2 INTERFACE, COMPONENTS AND DATA STRUCTURE 2.6 Data type and format
global N=r(N)
for instance. To use them later on you have to use the dollar-symbol before the name:
display "The total number of observations is: $N"
Locals
The use of locals is essentially identical to the use of globals with a small difference in the way you
invoke the variable. Before the name, you have to write a single opening apostrophe (ASCII symbol
96) and right after the name a simple apostrophe (ASCII symbol 39). The following example defines
first a numeric local containing the age of a person and then a string local with the name. In the third
line, a small text with the information will be displayed:
local age=39
local name="Peter"
di "The age of name is age years"
Scalars
The scalars can only take numerical values and the way to invoke them is at a first glance easier,
since you just write their name. However, this might lead to confusions with the variable names, a
potentially problematic issue.
Data can be either numerical or string, which is probably the most basic distinction.
Numerical types
The following table gives a short overview of the different numerical data types in Stata :
Storage type Minimum Maximum Closest to 0 bytes
(but 6= 0)
byte -127 100 1 1
int -32767 32740 1 2
long -2147483647 2147483620 1 4
float 1.701 1038 1.701 1038 1038 4
double 8.988 10307 8.988 10307 10323 8
9
2.6 Data type and format 2 INTERFACE, COMPONENTS AND DATA STRUCTURE
String types
Regarding string variables it becomes very easy, since they are simply going from str1 being one
letter up to str244 containing a maximum of 244 letters. The format is always at least as high as the
largest entry in the database. Assume that in all but one observation, we do not have a string and
in the one observation, we have 66 symbols. In this case, 66 bytes of memory are needed for every
observation, hence it is worthy to avoid unnecessary string variables in a database.
How to choose the best type? This relative technical information on the storage type and used
bytes might frighten a bit, but in practice this is hardly a topic, since there is a wonderful command
called compress which analyzes every variable and converts it to the best storage format.
Hint 1. When you work with large datasets, combine several of them and create new variables, it
is very recommended to include the command compress just before saving your database in order to
avoid wasting memory for nothing.
Especially when working with micro data the databases can become very large. If you have to send
them to other people, you might want to consider putting them into a zip-file, which you can do under
Windows in the Explorer for instance. The size reduction you can achieve depends a lot on your
data, but it is not uncommon to reduce the size by up to 90%!! Such high compressions are especially
possible when you have a lot of missing values or even more when many long string variables are
empty.
Stata supports also date and time formats. See the help file for details.
Knowing these format types can be very useful in commands like est table, estout or tabout
10
2 INTERFACE, COMPONENTS AND DATA STRUCTURE 2.7 Missing values
2
Normally non-missing values are excluded by default for obvious reasons
11
3 BASIC COMMANDS
3 Basic commands
Before starting with commands allowing to perform econometric analysis, it is important to un-
derstand the generic Stata -command and to know several relevant commands to customize the
Stata environment and to load and save the data.
3.1 do-files
Commands should be written in a do-file, even though you could also write them directly in the com-
mand line and then pressing ENTER. The problem by doing so would be that you could hardly repro-
duce what you did before. Using a do-file is like writing a programming script, Stata goes through
it and performs each command you write in the do-file. An important point is that Stata stops the
execution on error, meaning that you can be sure that everything went well if the do file is executed
until the end. To start a new do-file simply click on the button for the do-file editor (see figure 1) and
start writing.
To execute the file you can click on the corresponding symbol ( ) you find in the editor, go through
the menu or click [CTRL]+[D] on your keyboard, which is obviously the most convenient way to do it.
Hint 2. The combination of pressing first [CTRL]+[S] and then [CTRL]+[D] might help you to save
a lot of time and nerves. The first simply saves your do-file and the second executes. By doing it
always like that, you can be sure your do-file is saved on the hard drive and even if Stata crashes
.
If you have very long commands that do not fit on the screen and you do not want them to be in one
line, you can generate a line break with three slashes ///:
alternatively, you can temporarily activate the active line break, meaning that you have to break the
lines manually by the semicolon symbol (;) like in many other programming languages. To activate
the manual line break type
# delimit ;
and to come back to the normal line break
# delimit cr;
3.1.2 Comments
You can and should comment your code which can be done with a double slash // or a star *.
The double slash works at the beginning of the line or in the middle, while the star only works to
declare a whole line as comment. For longer comments you can use the combination /* my comment */
12
3 BASIC COMMANDS 3.2 Prepare Stata for your analysis
Hint 3. This is probably the most common hint: do comment your code as much and as clear as
possible! This is not only useful when working with colleagues, but also when coming back to your
do-file after a while. It might be difficult to understand your own code when it is not commented!!
1: clear all
13
3.3 Loading and saving data 3 BASIC COMMANDS
where line one deletes all from your memory, line 2 disables the break in the output, line 3 increases
or decreases the memory to 250 megabytes, line 4 changes the working directory to C://data and the
lines 5 and 6 define two variables (source and output) containing the information of the source and
output folder you will use.
Hint 4. Always load and save databases with a do-file to avoid overwriting a database or to work on
the wrong one.
Hint 5. Always save the database in a do-file with another name than the database you open at the
beginning. Otherwise, you could not run the do-file twice, since the changes would now be in the
initial database.
If needed, you can only load a part of a database, say variables x1 x2 and x3. The command becomes
simply
use x1 x2 x3 using $source/mydata
Moreover, you can use conditions on the data, for instance an if (see section 3.6)
14
3 BASIC COMMANDS 3.5 Types of commands
data. All you find after the comma are options, some of them might be mandatory. Understanding
the logic of a Stata command is crucial and it allows to fully understand the Stata -help I present
in section 2.3
r-class This is the most general class, including most of the commands for descriptive
statistics such as summarize or tabstat. Results of this type of commands are
stored in r(). To display the whole set of results, type return list after the
command. Note that these results remain active until the next r-class command.
e-class The e-class commands are normally econometric estimations such as regress. The
results are stored in e() and can be displayed typing ereturn list. As for the
r-class, the values in e() are stored until the next e-class command is executed.
Example
For instance, if you would like to use the mean of a variable as a local you could use the following
code:
summarize income
local meanincome=r(mean)
di The mean income is: income $
where the first line is the r-class command summarize displays a series of summary statistics such
as the mean, the standard deviation and others. The second line recovers the stored data3 in r()
and saves it in the local meanincome. Finally, in the third line of this not very practical example, a
short text indicating the mean income is displayed. The following example recovers the adjusted R2
statistic from a simple OLS regression and stores it in a scalar:
Besides the e-class and r-class there exist also n-class and s-class commands, however, they are used
very rarely. An easy way to find out if a command is r-class or e-class is to look in the help file how
the results are stored. Normally towards the end of the help file a list of stored results are indicated.
3.6 Conditions
Generally, Stata commands can be conditioned on a subsample of the dataset. The condition might
take different forms and is normally introduced after the varlist and before the comma separating
the command from its options. The most important condition is the if-condition starting simply with
the word if followed by a logical condition. For instance, if you want to run a regression only for
women and you have a dummy variable female taking the value of 1 if the person is a woman, the
simple regression command becomes
3
To see all the stored values, type return list just after the summarize command
15
3.7 Relational and logical operators 3 BASIC COMMANDS
regress y x1 x2 x3 if female==1
A second way to limit the sample is the in-condition, where you can run the command for instance
for the first 100 observations (regress y x1 x2 x3 in 1/100).
Hint 6. The in-condition might be very helpful when you write a do-file containing a very computation
intensive command and you would like to run the do-file in order to check if it works. Limiting the
command to a small number of observations avoids loosing time when checking the do-file.
Hint 7. A nice and sometimes more elegant alternative to the if condition is to multiply your
expression with a logical statement. Imagine you want to compute the value of a variable only for
|Z| < 1, you can use the following command
gen kernel=(1-abs(Z)) * (abs(Z)<1)
where (abs(Z)<1) is a logical element returning 1 if the condition is satisfied and 0 otherwise, hence
the variable kernel will take the value 0 (not missing!) whenever the condition is not satisfied.
16
3 BASIC COMMANDS 3.9 Matrices
Hint 8. The order of variables in Stata has generally no specific logic and new variables are simply
added in the end of the table. Using the command order you can order the variables according to
your needs and aorder can be used to order all variables alphabetically.
3.9 Matrices
Stata has two matrix systems, one that is implemented directly in the Stata environment and
Mata, which is to some extend a language apart. In this section, I deal only with the standard matrix
package of Stata . It has to be noticed that for matrix algebra Stata is probably not the best
software, so do not expect the quality of MATLAB or R when dealing with matrices.
defines
" #
1 2 3
A=
4 5 6
Combining matrices
You can also create a matrix as a function of other matrices (as long as the dimensions match):
matrix define C = A + B
where A and B are two matrices of the same dimensions. The following table provides an overview of
matrix operations available:
Operator Symbol
transpose
negation -
Kronecker product #
division by scalar /
multiplication *
subtraction -
addition +
column join ,
row join \
17
3.9 Matrices 3 BASIC COMMANDS
Function Description
colsof(M) Returns the number of columns in matrix M
rowsof(M) Returns the number of rows in matrix M
issymmetric(M) Returns 1 if M is symmetric, otherwise 0
det(M) Returns the determinant of matrix M
trace(M) Returns the trace of matrix M
I(n) Returns the entity matrix of dimension n
inv(M) Returns the inverse matrix of M
18
3 BASIC COMMANDS 3.10 Factor variables and time series operators
which will create four new variables named mymat1, mymat2, mymat3 and mymat4 respectively.
3.10 Factor variables and time series operators
Many times, you do not want to enter variables in regressions just like they are, but in some specific
form. This is when the so-called factor variables enter the game. Basically, we are talking about
prefixes for the variables to tell Stata what to do with it.
19
3.11 Loops and programming conditions 3 BASIC COMMANDS
Instead of defining interaction terms in a new variable using the product of the two in case of having
continuous variables or combinations for categorical data, you can use the symbol #. However simply
writing
reg y x z x#z
works only if x and z are categorical variables. In this case, all possible combinations are included as
a dummy variable. To be clear about what Stata does, I would however always write
reg y x z i.x#i.z
which is not absolutely needed but recommendable. This is especially the case because the #-symbol
can also be used to create squared values and continues interaction terms. Assume now that both, x
and z are continues and you would like to estimate the model
y = + 1 x + 2 z + 3 xz +
Hint 9. Using loops instead of writing several times the same code with only slight differences helps
reducing errors. Generally the shorter your code is, the better you did your job!
20
3 BASIC COMMANDS 3.11 Loops and programming conditions
The example first defines a local variable drawn form a uniform distribution. Then it displays a text
in function of the value. The first if condition is TRUE when the value is higher than 0.99, the second
element is a else if (with a space) condition being TRUE when the value is below 0.01 and the last
element is the else-statement if the two conditions before returned FALSE.
This example simply displays all numbers from 1 to 10 on the output screen.
or
The example takes every element of the local, stores it temporarily in the new local called word and
uses it in the commands. You can also loop through variables:
or
where the loop performs the summarize command for each variable starting with the letter x. The
general syntax of foreach is
21
3.11 Loops and programming conditions 3 BASIC COMMANDS
where the italic elements can be changed. The runner refers to the local variable that changes in
every loop, the arraytype indicates what kind of array the loop should go through (for instance local,
global, varlist ) and finally the array is the array containing the elements to loop through.
where the first example displays the series 1,2,3,4 and the second 4,4.5,5
22
4 SYSTEM COMMANDS
4 System commands
4.1 System commands
There are some Stata commands, which you will not use very frequently, however, they might be of
interest in some situations. The system commands are related neither directly to the data nor to the
econometric analysis, but they allow you to adapt Stata to your needs or update it.
Hint 10. The settings are saved locally on the machine, thus changing them permanently on a server
version of Stata might be useless.
23
4.1 System commands 4 SYSTEM COMMANDS
Hint 11. Updating a server based Stata must be done on the server and not on the local machine
24
5 DATA HANDLING
5 Data handling
5.1 Variable manipulation
In Stata it is very easy to create, label and rename variables allowing you to understand your data
afterwards much better. In this section, I present some useful commands.
gen teenager=0
replace teenager=1 if age>10 & age <=20
generates a dummy variables taking the value of 1 if a persons age is bigger than 10 and smaller or
equal 20. A simpler way to define this variable is just by including the condition in the first statement
25
5.1 Variable manipulation 5 DATA HANDLING
This is probably even a good solution in this case, but imagine you want to change the values of a
categorical variable with 10 values. This is when recode becomes more efficient. To make the before
mentioned change, the command would be
recode gender (3=.)(2=1)(1=0)
Variable labels The variable label is simply a text describing a variable in general. It appears in
the variable overview and helps you to understand the content of a variable. To define or redefine
(overwrite the old) a variable label simply write
label variable income "Income of the person in US$"
to label the variable income with the text between the two quotes.
Hint 12. Try to keep variable names short and describe the content in the variable label. For
instance if you have a variable with the log annual income per capita avoid variable names like
log annual income per capita and use rather lincpc with a corresponding variable label Log annual
income per capita
26
5 DATA HANDLING 5.1 Variable manipulation
Value labels
Besides the variable label Stata is also capable to assign value label, meaning that each value of
a variable is labeled and in the database the label is shown instead of the actual value. This is useful
in the case of categorical data. Let us illustrate the value labels with a small example. Imagine a
variable lstatus taking three values: 1 for people still in education, 2 for people in the active labor
force and 3 for retired people. Using for instance the command tab we could display the frequencies,
but the table would not be very self-explaining:
Therefore it is useful to label the values. The first step is to define the label. This would be
label define lstatlabel 1 "In education" 2 "Labor force" 3 "Retired"
where we name the label lstatlabel and then indicate the value and in quotes the corresponding label.
This command only saves the label, we still have to assign it to the variable.
To assign the variable we use again the command label:
label value lstatus lstatlabel
Now the table with the frequencies becomes:
Hint 13. It is possible to assign the same label to different variables, for instance you can define a
label called dummy and assign it to all the dichotomous variables. Let us say we want to label the
variables d1, d2 and d3 with the label dummy we would use label value d1 d2 d3 dummy
Once the variables are labeled, it is also possible to automatically extract these labels. For more
details see section 11.2.6.
27
5.2 Describing and sorting the data 5 DATA HANDLING
Note that this does not alter the content of the variable nor any kind of labels.
5.1.7 drop - Deleting variables
To delete one or various variables use
drop id gender age
The example above simply drops the three variables id, gender and age.
5.2 Describing and sorting the data
where Stata indicates you first how many observations are available in the dataset (independent
of missing values!) and how many variables you have. The value size refers to the memory use of
the database (in this case almost nothing is used). After this information on the whole dataset, the
information of each variable is displayed. First the name, then the format and the display format,
followed by the assigned value label and the variable label.
28
5 DATA HANDLING 5.3 Joining several databases
will sort the data according the variable year starting with the highest value.
5.2.5 order - Sorting the variables
While sort sorts your data vertically, the command order allows you to sort the data horizontally,
meaning you can change the order of the variables. Normally this is not a very important feature, but
there are situations when it might be necessary (e.g. to have the identifier of the observations at the
beginning for easier use or to use the placeholder symbol - in estimation commands). By writing
order id name country
the variables id, name and country will be put in the beginning of the dataset, while all the remaining
variables remain unchanged!
5.2.6 aorder - Sorting the variables alphabetically
Like the command order, aorder allows you to change the order of the variables in your dataset,
however, in the alphabetical order. By simply writing
aorder
without any varlist , all the variables will be ordered alphabetically, where special symbols like
underlines ( ) come first, followed by capital letters and lower case letters.
If you indicate a varlist like
aorder ID name country
then the variables id, name and country will come first, followed by all the remaining variables in
alphabetical order.
5.3 Joining several databases
Especially when working with micro data, it is oftentimes needed to merge several databases into one.
Stata is very efficient in this kind of data handling. Two main ways of merging/appending two or
more datasets into one are to be considered. The first situation is when we have two datasets with
different variables for the same individual, firm or household. In this case we use the command merge.
The second situation is when we have two databases with the same variables but not for the same
people, thus we would like to add one to the other and we use append
29
5.3 Joining several databases 5 DATA HANDLING
Merging result
Once the merge of the databases performed, Stata displays the results and stores information in
the new variable merge by default5 . The generated variable is coded in the following way:
1 the observation was only found in the master data (no merge)
2 the observation was only found in the using data (no merge)
3 merge successful, observation found in both
4 observation found in both, missing values updates
5 observation found in both, conflict in some variables
30
5 DATA HANDLING 5.4 Changing structure of a database (panel data)
entry (to save memory) and these values might differ from one the other database.
5.4 Changing structure of a database (panel data)
long:
id year income wide:
id income2000 income2005
1 2000 5.6
1 2005 7.2 1 5.6 7.2
2 2000 8.3 2 8.3 9.1
2 2005 9.1
The command reshape can be used to change your data easily from one to the other format. If you
want to reshape from wide to long, then use
reshape long income, i(ID)
where a new variable j will be created with the years. If you want to call it years right away, you
can include additionally the option j(year).
To get back to the wide form, write
reshape wide income,i(id) j(year)
31
6 SUMMARY STATISTICS AND GRAPHICS
32
6 SUMMARY STATISTICS AND GRAPHICS 6.2 Graphs and plots
-----------+---------------------------------+----------
Total | 5,283 1,316 130 | 6,729
where the option mi (missing) indicates that you also want the missing values to be considered as
a category. The total number of observations is 6729, 2608 non-indigenous men and for instance 12
indigenous people whos gender we dont know. About 2 people we do not have any information.
corr x1 x2 x3
and Stata will provide you with a result like
| x1 x2 x3
-------------+---------------------------
x1 | 1.0000
x2 | -0.0185 1.0000
x3 | 0.7128 0.6881 1.0000
In case of preferring the covariance matrix instead of the correlation matrix, add the option covariance
to the command:
corr x1 x2 x3, cov
which will display you the following matrix:
| x1 x2 x3
-------------+---------------------------
x1 | 1.05657
x2 | -.018846 .987245
x3 | 1.03772 .9684 2.00612
33
6.2 Graphs and plots 6 SUMMARY STATISTICS AND GRAPHICS
1 | * * *
| ******* *
| * * ***
| ** * *
| * * **
| * ** ***
| **
| * *
| ** *
|
.057795 + *
+----------------------------------------------------------------+
4.16933 x3 31.902
I agree that this way of visualizing data is probably not the state-of-the art in the century of vectorial
graphics, however, it is a fast way to get a first impression. A newer - and much more advanced - way
to visualize data is the graphics command, which I will explain in the next section.
Output Syntax
kdensity x1, normal
Kernel density estimate
.15
.1
Density
.05
0
0 5 10 15 20
x1
34
6 SUMMARY STATISTICS AND GRAPHICS 6.2 Graphs and plots
Output Syntax
15
Frequency
10
0
0 5 10 15 20
x1
Men Women
Note: Artificial data
fit" 3 "Obseravtions"))
10
x1
5
0
5 10 15 20 25
x2
35
6.2 Graphs and plots 6 SUMMARY STATISTICS AND GRAPHICS
Output Syntax
0 5 10 15 20 25
Notes: Epanechnikov kernel using bandwidth h=1.25
Hint 14. Using the combination of curly brackets {} and the &-symbol, you can use greek letters in
the text you add to graphics. In the last example, is written in greek letters. Here are some examples:
Symbol Stata-Code
{&gamma}
{&phi}
{&Phi}
36
6 SUMMARY STATISTICS AND GRAPHICS 6.2 Graphs and plots
produces the graphic hereafter. The first line indicates on which variable we want to perform the
graphic (myvalue) and which database contains the coordinates. The id(id) indicates the identifier
of the unit (here Swiss cantons) and clmethod(custom) is used to customize the thresholds between
categories. These are indicated in clbreaks(....) and the option fcolor selects the color set to be
used. An overview of the color sets can be found in the help file. ocolor is used to define the color
of the border and plotregion(icolor(none)) defines the background (here empty). The following
commands are used to customize the legend: first its set to be displayed, then the style is selected and
finally the values are changed to whatever text you want.
This fully vectorial map was exported from Stata with graph export as explained in sections 6.2.4
and 8.1.1.
When you produce a graphic in Stata it is generally displayed in a new window. You can easily save
graphics in various formats using the command graph export, followed by the name of the file (with
extension!). Use the option as(format ) to indicate the format to export and replace to overwrite
an old graphic if you wish to do it. If you dont use the as() option, the file extension will be used to
determine the format. The supported graphic formats under windows are:
.ps PostScript
.eps EPS (Encapsulated PostScript)
.wmf Windows Metafile
.emf Windows Enhanced Metafile
.png PNG (Portable Network Graphics)
.tif TIFF
I suggest the use of PNG for the standard use, since the graphics are relatively small and all standard
programs can read them.
37
6.2 Graphs and plots 6 SUMMARY STATISTICS AND GRAPHICS
Hint 15. If you work with LATEX and would like to use Stata -graphics, you should export the
graphics as vector graphics in order to get the best possible quality. See section 8 on how to export
graphs to LATEX and how to use them in LATEX without loss of quality.
38
7 ECONOMETRIC ANALYSIS
7 Econometric analysis
The goal of this section is not to describe all possible estimation commands in detail, but rather to
give a short overview of commands, helping to find the needed routine. If you wish to learn more
about the command and its options, you should refer to the help file, which includes in many cases
examples. Type
help mycommand
to display the help file of the command mycommand.
option Description
robust Computes heteroskedasticity-consistent standard errors according to
White (1980). This option is available in many estimation commands
and can be invoked directly by typing for instance reg y x1 x2,
robust instead of reg y x1 x2,vce(robust).
noconstant Performs the OLS estimation without constant term.
beta Provides standardized coefficients defined as
x
=
y
39
7.1 Continuous outcome 7 ECONOMETRIC ANALYSIS
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | 1.058603 .0324676 32.60 0.000 .9948909 1.122316
x2 | .9816983 .0326528 30.06 0.000 .9176222 1.045774
_cons | .9089836 .031962 28.44 0.000 .8462631 .971704
------------------------------------------------------------------------------
where the upper left panel provides ANOVA-like information of the sum of squares, degrees of freedom
and mean sum of squares. The upper right panel provides some statistics of the model fit and the
main panel thereafter is the actual estimation. Each row refers to one regressor, starting with the
coefficient (coef.), followed by the standard errors (Std. Err.), the t-statistic (t), the p-value (P > |t|)
and the 95% confidence interval.
The following table provides a short overview of other estimators used for continuous variables.
Command Description
cnsreg Constrained linear regression model. Define first a constraint using the command
constraint:
constraint 1 x1=x2
and then use the command for the constrained linear regression model:
cnsreg y x1 x2 x3,constraints(1)
gmm Generalized method of moments estimation. See the help file for details.
heckman This command allows you to perform the OLS estimation with selection following
Heckman (1976). The command is used similar to regress, but you include the
option select(model ) where model refers to the selection model, having a dummy
variable as dependent variable, taking the value of 1 when the observation is selected
in the main equation and 0 otherwise.
ivreg Instrumental variable (IV) estimator. The syntax is relatively easy, assume that you
regress y on x1 , x2 and x3 , where x1 and x2 are assumed to be endogenous and
therefore will be instrumented by z1 ,z2 and z3 . Then the command is simply
ivreg y (x1 x2=z1 z2 z3) x3
Using the options 2sls, liml and gmm you can choose the estimator to be the two-
stage least squares, the limited-information maximum likelihood or the generalized
method of moments estimator respectively.
reg3 Three-stage least squares (3SLS) estimator for simultaneous equations. See the help
file for details.
sureg Zellners seemingly unrelated regression. See the help file for details.
tobit Tobit estimation for censored dependent variable. The syntax is similar to the
regress where you add two options ul(#) and/or ll(#) to indicate the upper and
the lower limit respectively. For example if you have data on income that is censored
due to data collection at 999999 then the command would be
tobit income educ exper, ul(999999)
treatreg Treatment-effects model
truncreg Truncated regression model. Similar to the tobit model but for truncated variables.
For instance the income is truncated at zero, since we cannot have negative incomes:
truncreg income educ exper, ll(0)
- Continued on next page -
40
7 ECONOMETRIC ANALYSIS 7.2 Categorical outcome
selmlog This is a user written command to perform OLS estimation with Selection bias correc-
tion based on the multinomial logit model. The command features correction terms
according to Lee (1983), Dubin and McFadden (1984) and Dahl (2003).
41
7.4 Panel data 7 ECONOMETRIC ANALYSIS
To get a more detailed description of the datas pattern, consider the command xtdescribe
42
7 ECONOMETRIC ANALYSIS 7.5 Time series
First general information on the panel structure is indicated, for instance the variables identifying
both dimensions with their respective value pattern. You will also be informed if the combination of
dimension identifiers identifies each observation uniquely, a very important issue for further analysis.
Finally the most common patterns in the data are displayed, where the 1 indicates that the value
is present and the dot (.) refers to missing data. In our example we have 49 observations where
information is available for all 5 years in the sample, followed by the second most common pattern
where in all but the third year information is available.
Hint 16. See the section 3.10 on factor variables and timeseries operators to learn more about the
use of lagged, forwarded and differenciated variables in panel models
For the moment, no more time series specific information is available in this document. For further
details refer to section 7.4 or check out http://dss.princeton.edu/online_help/stats_packages/
stata/time_series_data.htm
sum x1
43
7.7 Post estimation 7 ECONOMETRIC ANALYSIS
local mean=r(mean)
local sd=r(sd)
local t=mean/sd
display "The t-statistic is: t"
First, the summary statistics of the variable x1 are computed with the command summarize. This
command is a so-called r-class command, meaning that the data is stored in r(name). To see all
saved values in r() type return list9 . The second and third line simply copies the values into used
defined local variables. The fourth line then computes a new value out of the stored data and the last
line displays the result.
Some results are stored in matrices (e.g. estimation coefficients and variance-covariance matrices. See
section 3.9.5).
This command will be performed on the active estimation, which is in general the last estimation
performed.
44
7 ECONOMETRIC ANALYSIS 7.8 Saving and reusing estimations
to save the current estimation under the name myestimation1. This is very useful to produce estimation
tables afterwards (see est table (section 7.8.4) and estout (section 8.2.1)).
This example simply performs two OLS regressions with the command regress on two and three
independent variables respectively and saves the results in reg1 and reg2. Now we would like to
display the two regressions with
est table reg1 reg2
which yields to the following display:
----------------------------------------
Variable | reg1 reg2
-------------+--------------------------
weight | 4.6990649 4.7215989
length | -97.960312 -102.66518
trunk | 28.376437
_cons | 10386.541 10812.329
----------------------------------------
This is not yet the nicest estimation table ever seen! With a few additional options, you can make it
much better:
est table reg1 reg2, stats(N r2 a) b(%6.3f) star(0.1 0.05 0.01)
you already get a useful display like:
----------------------------------------
Variable | reg1 reg2
-------------+--------------------------
weight | 4.699*** 4.722***
length | -97.960** -102.665**
trunk | 28.376
_cons | 1.0e+04** 1.1e+04**
45
7.9 Marginal effects 7 ECONOMETRIC ANALYSIS
-------------+--------------------------
N | 74 74
r2_a | 0.329 0.320
----------------------------------------
legend: * p<.1; ** p<.05; *** p<.01
where you immediately see the significance levels of the coefficients and some statistics of the estima-
tion. Let us have a look at the different options:
star(n1 n2 n3) Allows you to define significance levels of the stars. In our example the
standard 10%, 5% and 1% were chosen.
b(format) Allows you to format the output of the coefficients and to limit the
amount of digits after the coma. See 2.6.4
stats(args) Enables you to display some estimation relevant statistics:
N Number of observations
aic Akaikes information criterion
bic Schwarzs Bayesian info. criterion
chi2 Value of 2
ll log likelihood
se, t, p You can display in addition to the coefficients the standard errors (se),
the t-statistics (t) and the p-value (p). Simply indicate the desired
statistic together with its format, e.g. se(%6.3g).
keep(coeflist) Only the coefficients indicated in coeflist will be shown
drop(coeflist) All coefficients in coeflist will be dropped from the table
46
7 ECONOMETRIC ANALYSIS 7.9 Marginal effects
where it is important to use the factor variable (i.sex instead of sex) in the probit. Otherwise, the
command margins does not recognize sex as being a dichotomous variable and uses the procedure
for continuous variables. The option dydx(*) is used to indicate Stata that we want to estimate
the marginal effect for all regressors. By replacing the * by some regressors, margins computes the
marginal effects only for the indicated variables. The option atmeans permits to compute the marginal
effect at the mean instead of the average marginal effect (see next paragraph).
where (.) is the cumulative normal distribution, X is the whole matrix of covariates and xj indicates
the variable for which we are computing the marginal effect.
It is also possible to compute the marginal effect at different points of the distribution. Let us consider
some examples, always assuming that the model we estimated before the margins command was
probit outcome i.sex age
In this case
margins, dydx(*)
computes the marginal effects for each observation at its observed levels of the regressors. In a second
step the average of all these marginal effects is computed giving you the average marginal effect. In
contrast,
margins, dydx(*) atmeans
defines a person that has the average characteristics and computes for this person the marginal effect
giving you the marginal effect at the mean.
47
7.9 Marginal effects 7 ECONOMETRIC ANALYSIS
Computational issues
Sometimes the computation of marginal effects can be very time consuming. If the computation
takes too much time, one might consider not to compute the standard errors (if they are not absolutely
needed), which should make the computation much faster. This can be done by adding nose as an
option.
you will get the estimated coefficients of the probit and not the marginal effects. This is because
margins does not overwrite the e-values of the probit estimation. If you want margins to do so, you
10
Alternatively, you can also use the option by(varlist ), giving you exactly the result
48
7 ECONOMETRIC ANALYSIS 7.9 Marginal effects
just have to add the option post. Hence, the correct syntax for the above example would be
Actually, the option margin is no longer needed in the command estout, however, I leave it here,
because you might have other regressions in the estout that still need the option.
The first line loads the database, on the second line I estimate a probit model with an interaction
term between two dummy variables (see 3.10 for details on factor variables). The third line computes
the four possible conditional probabilities, where the option atmeans set all other control variables to
their respective mean and the option post saves the coefficient for posterior use. Finally, the fourth
line performs a test whether the marginal effect of the interaction term is significant or not11 .
11
See Ai and Norton (2003) to see why the test on the coefficient is not useful in non-linear models
49
8 STATA MEETS LATEX
10.7%
44.4%
44.9%
male female
unknown
50
8 STATA MEETS LATEX 8.2 Exporting estimation results to LaTeX
1: estout reg1 reg2 using regtable.tex, replace style(tex) margin cells(b(fmt(%9.3f) star) se(par) ) ///
2: starlevels(* .1 ** 0.05 *** .01) ///
3: stats(N r2_a , labels(N "Adj. $ R^2$ ") fmt(%9.0f %6.3g)) mlabels("Model 1" "Model 2") ///
4: collabels(none) varlabels(_cons "Constant"weight "Weight" length "Length" trunk "Trunk space") ///
5: order(length weight trunk _cons) ///
6: prehead(\begin{center}\begin{tabular}{l*{@M}{l}} "\hline") ///
7: posthead("\hline") ///
8: prefoot("\hline") ///
9: postfoot("\hline\end{tabular}\end{center}")
Model 1 Model 2
Length -97.960** -102.665**
(39.175) (42.587)
Weight 4.699*** 4.722***
(1.122) (1.132)
Trunk space 28.376
(97.058)
Constant 10386.541** 10812.329**
(4308.159) (4574.211)
N 74 74
Adj. R2 .329 .32
The command might look a bit scary at first, but you will get used to it very quickly. Moreover,
once you defined your style correctly, you can copy-paste large parts of the code from one table to
the next. Lets go through the code with detail. The first line corresponds mainly to what we saw
for the command est table. First we include with reg1 reg2 the names of the regressions we want
to display, followed by the file to create. After the comma we use the option replace in order to be
able to re-run the script (otherwise there is an error) and the style(tex) to declare that we want
a LATEX output. The option margin does not make much sense in this example, since in the OLS
estimation all estimated effects are marginal effects. However, if you estimate for instance a dprobit
model, this option is needed to display marginal effects rather than the coefficient estimates. The
option cells is used to choose all the coefficient-related information to be displayed. In this example
the betas (b) is formatted with the 9.3f format with stars included and for the standard errors (se)
I use the option par to get the standard errors in parentheses.
The second line simply defines the significance levels of the stars and in contrast to the option saw in
est table, you can also choose other symbols than the stars.
The third line first declares the statistics of the estimation to be displayed, followed by the label of
the statistic and the format. The option mlabels allows you to give each regression a name; if you do
not specify this option, reg1 and reg2 will be displayed.
Line 4 formats the first column where the variables are indicated. Without any of these option the
variable name, as it appears in Stata , will be displayed. The first option, which is not active here,
would allow you to use the labels of the variable rather than the names and the option varlabels
allows you to change each variable name separately for the table output without changing anything
in your data. In this example I just changed all the names to capital letters and the constant term to
a nicer looking word.
The option order on line 5 allows you to change the order of the variables in the table and the remaining
4 lines are all related to the LATEX code. They allow you to put some free text (or LATEX commands)
51
8.2 Exporting estimation results to LaTeX 8 STATA MEETS LATEX
at very specific places in the table. The prehead is before the name of the regressions. I use it here
to include directly the tabular environment of LATEX using the variable @M, which is the number of
models, directly computed by the command. Hence, you do not have to adapt the number of columns
if you add a regression, this will be done automatically. If you want to include your table in a table-
environment of LATEX , you could start it here as well. All the information will simply be transferred
to the generated LATEXcode. The posthead is between the regression names and the first coefficients
and the prefoot will be placed between the coefficients and the statistics, while the postfoot is after
the statistics.
This is only one of many possible examples of the command estout and I encourage you to consult
the manual of the command.
To include the command to your latex file, simply use
\input{regtable}
The big advantage is that you can make small changes directly in your do-file and the table will be
adapted in your LATEX -paper. This command should definitely help you to avoid copying estimation
results by hand to you LATEX file, a process with some risk of making errors.
which makes a cross frequency table of the variables gender and indigenousas follows:
indigenous
female No Yes Total
No 44.4% 5.8% 50.2%
Yes 45.7% 4.1% 49.8%
Total 90.1% 9.9% 100.0%
The using crosstable.tex defines the file to be written, the replace options allows you to overwrite
an earlier version of the file. style(tex) indicates that you want to have a LATEX file, the cl1(2-4)
add a horizontal line between column 2 and 4 right after the variable indigenous. c(cell) indicates
that you want to have the percentage by cell, alternatives are freq for frequencies and col and row
for percentages according to the column and row respectively. The f(1p) option indicates the format
of the table, here the p stands for percentage. Simply f(1) would create a number with one decimal.
The h3(nil) option avoids that at the top of each column a %-sign is shown (N in case of frequencies)
and finally font(bold) makes some parts of the table bold. You could also include some pretable and
posttable LATEX -code, but this must be done with external files. Therefore I prefer not to do it in my
latex-code, thus the import becomes then:
52
8 STATA MEETS LATEX 8.2 Exporting estimation results to LaTeX
\begin{tabular}{llll}
\input{crosstable}\endrule
\end{tabular}
53
9 IMPORTING DATA IN OTHER FORMATS
54
10 STATA IS NOT ENOUGH? ADD USER-WRITTEN MODULES TO ENHANCE STATA
Command Description
ssc new Displays the newest user-written command
available on SSC.
ssc hot Displays the most downloaded modules.
ssc uninstall mypackage Uninstalls the package named mypackage
ssc install mypackage, replace Updates the package mypackage
ssc describe mypackage Displays the description of the package
Hint 17. To check which version of a command you have installed on your machine, you can simply
type
which packagename
and Stata will show you the version of the module packagename
Very useful user-written commands include: ivreg2, estout, tabout. Many of the user-written com-
mands do not necessarily help you estimating very complicated model, but they might be very useful
for some basic tasks like converting data to other formats, exporting nice-looking tables, performing
some basic statistical tests, etc...
13
http://ideas.repec.org/s/boc/bocode.html
55
11 PROGRAMMING YOUR OWN COMMAND
7: tokenize varlist
8: local depvar 1
9: macro shift
10: local xvars *
11: // BOOTSTRAP
12: if(bootstrap & bootstrap>0){
14: end
The first line drops the program (if already existent) from the memory. This is needed to enable
us to define it again. The word capture is included to avoid an execution stop in case of an error,
typically when the program does not exist yet. The second line then defines the program with the
name iop, which must be equal to the name of the ado-file (see later on). The option after the coma
refers to the class of the routine, r-class is generally a good option. Line 3 declares Stata that the
ado-file should work from version 8.2 on, hence trying to run it on an earlier version causes a problem.
The fourth line is the most important for the moment, since we declare the syntax - this is basically
the same as in the help files. After the word syntax you write varlist if you will enable the user
to provide a varlist. You can include the possibility to offer the user the if and in statements. If
you do so, you have to consider this after in the code, since Stata does not limit the routine to the
limitations given by the user automatically! After the coma you might include all the options. In this
case all options are non-mandatory, since the opening bracket is before the comma. The capital letters
indicate the minimum amount of letters the user has to write, for instance the option bootstrap will
be understood by Stata whether the user writes boot or bootst or bootstrap, but not if he writes
boo. Two types of options are available: with and without arguments. Those without arguments
generate simply a local containing the complete name of the option when the option is chosen. The
option with arguments save the arguments in a local variable with the name of the option. You have to
56
11 PROGRAMMING YOUR OWN COMMAND 11.1 Where and how to save your routine?
indicate always the nature of a argument, being for instance str for a string or varlist for a varlist .
Line 5 then converts the if and in conditions into a temporary variable I call here touse. In any
routine you use afterwards you will have to indicate
commandname varlist if touse
in order to limit your routine to the sample. An alternative is to keep only the sample you need:
marksample touse
preserve //saves the current state of the DB for later
keep if touse
[ALL YOUR PROGRAM]
restore //restores the database as it was before the preserve command
The remaining lines of the code are then more or less standard. You can use all the routines available
in Stata . The do-file end with the command end (line 14).
which splits up the text income educ exper gender into 4 elements stored in local variables called 1,
2, 3 and 4. For instance, you can now change the order:
display="3 2 4"
57
11.2 Useful commands for programming 11 PROGRAMMING YOUR OWN COMMAND
would produce exper educ gender. Additionally the local named simply * contains the whole initial
string:
di "*"
produces income educ exper gender. The interesting feature of * is the use with macro shift.
58
11 PROGRAMMING YOUR OWN COMMAND 11.2 Useful commands for programming
global variables), you can easily extract this information from the data and use it afterwards. For a
discussion on the labels and how to create them, see section 5.1.5.
Extracting the variable label
Let us start by extracting the variable label from a variable female. Simply type
local mylocal:variable label female
to save the variable label of the variable female as local called mylocal14 . To use the extracted text,
you can simply include the local, for instance by typing:
display "The variable label is:`mylocal'"
14
Of course, you can use whatever name for your local.
59
12 SIMULATION
12 Simulation
The discussion of simulation in this document is limited to the discussion of generating random
variables.
The easiest way to create a random variable is definitely the uniform variable, which can be combined
with the generate command:
gen x=uniform()
which gives a variable x U[0,1] To generate rather U[10,20] you only need some basic algebra:
gen x=10 + uniform()*(20-10)
To generate normal variable a special command is available besides the one related to a normal
generate. I suggest the use of drawnorm:
drawnorm x1 x2
will create two i.i.d variables with standard normal distribution.
drawnorm x1 x2, means(1 2) sds(4 3)
will generate two unrelated random variables:
x1 N (1, 4)
x2 N (2, 3)
One can go further by simulating random variables form a higher dimension normal distribution.
To do this, you will need to indicate either the covariance or the correlation matrix. The following
example uses the correlation matrix:
The first line defines the correlation matrix (see section 3.9.1) and the second line perform the random
draw. Note that by default as many observations as the dataset has will be drawn. This can be
changed with the option n(#)
Like the uniform distribution in section 12.1, random variables can be drawn from many different
distributions. Here is a short overview:
60
12 SIMULATION 12.4 Setting the random seed
runiform() = uniform()
rbeta(a,b) Beta distribution
rbinomial(n,p) Binomial distribution
rchi2(df) 2 distribution
rgamma(a,b) Gamma distribution ()
rhypergeometric(N,K,n) Hypergeometric distribution
rnbinomial(n, p) Negative binomial distribution
rnormal() Standard normal distribution (drawnorm preferable)
rnormal(m,s) Normal distribution (drawnorm preferable)
rpoisson(m) Poisson distribution
rt(df) Student distribution
To generate such a variable, you can simply use the generate command:
gen myvar=rpoisson(4)
generates a variable called myvar drawn from a Poisson distribution with parameter = 4.
61
REFERENCES REFERENCES
References
Ai, Chunrong and Edward C. Norton, Interaction terms in logit and probit models, Economics
Letters, 2003, 80 (1), pp. 123129.
Dahl, G. B., Mobility and the Returns to Education: Testing a Roy Model with Multiple Markets,
Econometrica, 2003, 70, 23672420.
Dubin, J.A. and D.L. McFadden, An Econometric Analysis of Residential Electric Appliance
Holdings and Consumption, Econometrica, 1984, 52, 345362.
Greene, William H., Econometric Analysis, 6 ed., Pertinence Hall, Upper Sadle River, New Jersey,
2008.
Heckman, J., The common structure of statistical models of truncation, sample selection, and
limited dependent variables and a simple estimator for such models., Annals of Economic and
Social Measurement, 1976, 5, 475492.
Lee, L.F., Generalized Econometric Models with Selectivity, Econometrica, 1983, 51, 507512.
62
Index
.do, 7 describe, 28
.dta, 7 descriptive statistics, 32
#, 20 dichotomous, 19
display
about, 23 matrix, 18
aorder, 29 display format, 10
append, 30 distribution
betas, 43 normal, 60
Breusch-Pagan test, 44 uniform, 60
do-file, 7
c., 19 comments, 12
change working directory, 13 execute, 12
classes of commands, 15 linebreak, 12
clear all, 13 Download user written, 55
clear memory, 13 dprobit, 41
cnsreg, 40 drawnorm, 60
codebook, 28 drop, 28
coefficients dummy, 19
extract, 18 define, 25
test, 44
command e-class, 15
classes, 15 egenerate, 25
types, 15 est replay, 45
comments, 12 est restore, 45
compress, 10 est store, 44
compressing database, 10 est table, 45
condition, 15 estimation
programming, 20 extract results, 43
conversion estout, 50
data to matrix, 18 exit, 23
matrix to data, 18 extract
Cook-Weisberg test, 44 betas, 43
correlate, 33 estimation results, 43
d., 19 f., 19
data factor variables, 19
description, 28 fdause, 54
format, 9 File
load, 14 File types, 7
save, 14 findit, 24
data to matrix, 18 float, 9
database foreach, 21
compressing, 10 format, 9
63
INDEX INDEX
64
INDEX INDEX
relational, 16 usespss, 54
rename, 27
replace, 25 value
reshape, 31 label, 26
restore, 58 variable
robust, 39 label, 26
robust standard errors, 39 squared, 19
variables
SAS XPORT, 54 factor, 19
save, 14 random, 60
scalar, 8 VCE options, 39
scaler, 9 VCE types, 39
search, 24 Version of module, 55
seed, 61
while, 21
selmlog, 41
working directory
set, 23
change, 13
set memory, 13
simulation, 60 xtdescribe, 42
sort, 28 xtreg, 43
spmap, 36 xtset, 42
SPSS, 54
squared variable, 19 zip file, 10
SSC, 55
string, 9
string types, 10
summarize, 32
summary statistics, 32
sureg, 40
svmat, 18
symbols
greek, 36
tabout, 52
tabstat, 32
tabulate, 32
test, 44
tobit, 40
tokenize, 57
treatreg, 40
truncreg, 40
tsset, 43
uniform distribution, 60
update, 23
use, 14
User written command, 55
65