Teach Yourself Stata

GettingStartedinDataAnalysis
usingStata
(ver.4.6)
OscarTorresReyna
DataConsultant
[email protected]
http://dss.princeton.edu/training/
PU/DSS/OTR
Listoftopics
WhatisStata?
Statascreenandgeneraldescription
Firststeps(log,memoryanddirectory)
FromSPSS/SAStoStata
ExampleofadatasetinExcel
FromExcel toStata (copyandpaste,*.csv)
Savingthedataset
Describe andsummarize (command,menu)
Rename andlabelvariables (command,menu)
Creatingnewvariables(generate)
Recodingvariables(recode)
Recodingvariablesusingegen
Changingvalues(replace)
Extractingcharactersfromregularexpressions
Valuelabelsusingthemenu
Indexing (using_nand_N)
9 Creatingidsandidsbycategories
9 Lagsandforwardvalues
9 Countdownandspecificvalues
Sorting
Deletingvariables(drop)
Merge
Append
Mergingfuzzytext(reclink)
FrequentlyusedStatacommands
Exploringdata:
9 Frequencies(tab,table)
9 Crosstabulations(withtestforassociations)
9 Descriptivestatistics(tabstat)
Examplesoffrequenciesandcrosstabulations
Creatingdummies
Graphs
9 Scatterplot
9 Histograms
9 Catplot(forcategoricaldata)
9 Bars(graphingmeanvalues)
Regression:
9 Overviewandbasicsetting
9 Correlationmatrix
9 Outputinterpretation(whattolookfor)
9 Graphmatrix
9 Savingregressioncoefficients
9 Ftest
9 Testingforlinearity
9 Testingfornormality
9 Testingforhomoskedasticity
9 Testingforomittedvariablebias
9 Testingformulticolinearity
9 Robuststandarderrors
9 Specificationerror
9 Outliers
9 Summaryofinfluenceindicators
9 Summaryofdistancemeasures
9 Interactionterms
9 Publishingregressiontable(outreg2)
Usefulsites(linksonly)
9 IsmymodelOK?
9 Icantreadtheoutputofmymodel!!!
9 TopicsinStatistics
9 Recommendedbooks
PU/DSS/OTR
What is Stata?
Itisamultipurposestatisticalpackagetohelpyouexplore,summarizeand
analyzedatasets.
Adataset isacollectionofseveralpiecesofinformationcalledvariables(usually
arrangedbycolumns).Avariablecanhaveoneorseveralvalues(informationfor
oneorseveralcases).
OtherstatisticalpackagesareSPSS,SASandR.
Stataiswidelyusedinsocialscienceresearchandthemostusedstatistical
softwareoncampus.
Features
Stata
SPSS
SAS
Learning curve
Steep/gradual
Gradual/flat
Pretty steep
Pretty steep
User interface
Programming/point-and-click
Mostly point-and-click
Programming
Programming
Very strong
Moderate
Very strong
Very strong
Powerful
Powerful
Powerful/versatile
Powerful/versatile
Very good
Very good
Good
Good
Affordable (perpetual
licenses, renew only when
upgrade)
Expensive (but not need to

renew until upgrade, long
term licenses)
Expensive (yearly
renewal)
Open source
Data manipulation
Data analysis
Graphics
Cost
PU/DSS/OTR
This is the Stata screen
PU/DSS/OTR
and here is a brief description
PU/DSS/OTR
Firststeps
Three basic procedures you may want to do first:
Set your working directory (see next slide for more info)
Create a log file (sort of Statas built-in tape recorder and where you
can retrieve the output of your work)
Using the menu go to File Log Begin (see the next slide for
more info)
In the command line type log using mylog.log. This will
create a file called mylog.log in your directory which you can
read using any word processor.
Set the correct memory allocation for your data. Some datasets more
memory, depending on the size you can type set mem 700m to open
the big ones
PU/DSS/OTR
First steps: graphic view

Three basic procedures you may want to do first: create a log file (sort of Statas built-in tape recorder and where you can
retrieve the output of your work), set your working directory, and set the correct memory allocation for your data.
ClickonSaveastype:rightbelowFilename:
andselectLog(*.log). Thiswillcreatethefile
calledLog1.log(orwhatevernameyouwantwith
extension*.log)whichcanbereadbyanyword
processororbyStata(gotoFile Log View).If
yousaveitas*.smcl(FormattedLog)onlyStata
canreadit.Itisrecommendedtosavethelogfile
as*.log
The log file will record

everything you type
including the output.
2
Shows your current working directory.
You can change it by typing
cd c:\mydirectory
Whendealingwithreallybigdatasetsyoumaywanttoincreasethememory:
set mem 700m
/*You type this in the command window */
Toestimatethesizeofthefileyoucanusetheformula:
Size(inbytes)=(8*Numberofcasesorrows*(Numberofvariables+8))
PU/DSS/OTR
FromSPSS/SAStoStata
Youcanusethecommandusespss toreadSPSSfilesinStataor
thecommandusesas toreadSASfiles.IfyouhaveafileinSAS
XPORTyoucanusefduse (orgotofileimport).ForSPSS,you
mayneedtoinstallitbytyping
ssc install usespss
Onceinstalledjusttype
usespss using c:\mydata.sav
Typehelp usespss formoredetails.
ThereisasimilarcommandforSAS(usesas),justrepeatthe
previoussteps.
ForASCIIdatapleaseseetheStatamoduleat
http://dss.princeton.edu/training/
PU/DSS/OTR
Example of a dataset in Excel.

Variables are arranged by columns and cases by rows. Each variable has more than one value
Path to the file: http://www.princeton.edu/~otorres/Stata/Students.xls

PU/DSS/OTR
1 TogofromExceltoStata yousimplycopyand
pastedataintotheStatasDataeditorwhich
youcanopenbyclickingontheiconthatlooks
likethis:
Excel to Stata (copy-and-paste)
2 Thiswindowwillopen,isthedataeditor
3 PressCtrlvtopastethe
data
PU/DSS/OTR
1 ClosethedataeditorbypressingtheXbuttonontheupperrightcorneroftheeditor
NOTE: You need to close
the data editor or data
browser to continue
working.
Saving the dataset
2 TheVariables
windowwillshowall
thevariablesinyour
data
3 Donotforgettosavethefile,inthecommandwindowtype save students, replace

Youcanalsousethemenu,gotoFile SaveAs
4 Thisiswhatyouwillseeintheoutputwindow,
thedatahasbeensavedasstudents.dta
PU/DSS/OTR
Excel to Stata (import *.csv)

YoucanalsosavetheExcelfileas*.csv (commaseparatedvalues)andimportitinStata.InExcel gotoFileSaveas.
Youmaygetthefollowingmessages,clickOKandYES
InStata gotoFileImportASCIIdatacreatedbyspreadsheet.ClickonBrowsetofindthefileandthenOK.
2
PU/DSS/OTR
Commands: describe and summarize
To start exploring the data well use two commands: describe and summarize
1 - If you type describe in
the command window you
will get a general description
of the data (press enter after
you type).
Type help describe in the

command window for more
information
2 - Type summarize to get

some basic statistics.
Zeros indicate string variables
Type help summarize in the

command window for more
information
PU/DSS/OTR
If you want to use the menu
Menu options for describe and summarize
To describe the data go to Data Describe data Describe data in memory and press OK
To summarize the data go to Data Describe data Summary statistics and press OK
PU/DSS/OTR
Exploringdata:frequencies
Frequencyreferstothenumberoftimesavalueisrepeated.Frequenciesareusedtoanalyze
categoricaldata.Thetablesbelowarefrequencytables,valuesareinascendingorder.InStatause
thecommandtab (typehelp tab formoredetails)
variable
Freq.providesarawcountofeachvalue.Inthiscase10
studentsforeachmajor.
Percentgivestherelativefrequencyforeachvalue.For
example,33.33%ofthestudentsinthisgroupareecon
majors.
Cum.isthecumulativefrequencyinascendingorderof
thevalues.Forexample,66.67%ofthestudentsare
econormathmajors.
variable
Freq.Here6studentsreadthenewspaper3daysa
week,9studentsreadit5daysaweek.
Percent.Thosewhoreadthenewspaper3daysaweek
represent20%ofthesample,30%ofthestudentsinthe
samplereadthenewspaper5daysaweek.
Cum.66.67%ofthestudentsreadthenewspaper3to5
daysaweek.
PU/DSS/OTR
Exploringdata:frequencies(usingtable)
var1
var2
table is another command to produce frequencies and statistics.

For more info and a list of all statistics type help table. Here are
some examples.
variable
The mean age of females is 23 years, for males is 27.

The mean score is 78 for females and 82 for males.
PU/DSS/OTR
Exploringdata:crosstabs
Alsoknownascontingencytables,helpyoutoanalyzetherelationshipbetweentwoormore
variables(mostlycategorical).Belowisacrosstabbetweenthevariableecostatuandgender.We
usethecommandtab (butwithtwovariablestomakethecrosstab).
Optionscol,rowgivesyouthecolumn
androwpercentages.
var1
var2
Thefirstvalueinacelltellsyouthenumberof
observationsforeachxtab.Inthiscase,90
respondentsaremaleandsaidthatthe
economyisdoingverywell,59arefemale
andbelievetheeconomyisdoingverywell
Thesecondvalueinacellgivesyourow
percentagesforthefirstvariableinthextab.
Outofthosewhothinktheeconomyisdoing
verywell,60.40%aremalesand39.60%are
females.
Thethirdvalueinacellgivesyoucolumn
percentagesforthesecondvariableinthextab.
Amongmales,14.33%thinktheeconomyis
doingverywellwhile7.92%offemaleshave
thesameopinion.
Youcanusetab1 formultiplefrequenciesortab2 torunall

possiblecrosstabscombinations.Typehelptabforfurther
details.
PU/DSS/OTR
Exploringdata:crosstabs(acloserlook)
Youcanusecrosstabstocompareresponsesamongcategoriesinrelationtoaggregate
responses.Inthetablebelowwewillseewhethermalesandfemaleshaveopinions
similartothenationalaggregate.
Asaruleofthumb,amarginoferrorof4percentagepointscanbe
usedtoindicateasignificantdifference(someuse3).
Forexample,roundingupthepercentages,11%(10.85)answervery
wellatthenationallevel.Withthemarginoferror,thisgivesarange
roughlybetween7%and15%,anythingbeyondthisrangecouldbe
consideredsignificantlydifferent(rememberthisisjustan
approximation).Itdoesnotappeartobeasignificantbiasbetween
malesandfemalesforthisanswer.
Inthefairlywellcategorywehave49%,withrangebetween45%
and53%.Theresponseformalesis54%andforfemales45%.We
couldsayherethatmalestendtobeabitmoreoptimisticonthe
economyandfemalestendtobeabitlessoptimistic.
Ifweaggregateresponses,wecouldgetabetterpicture.Inthetable
below68%ofmalesbelievetheeconomyisdoingwell(comparingto
60%atthenationallevel,while46%offemalesthingtheeconomyis
bad(comparingto39%aggregate).Malesseemtobemoreoptimistic
thanfemales.
recode ecostatu (1 2 = 1 "Well") (3 4 = 2 "Bad") (5 6=3 "Not sure/ref"), gen(ecostatu1) label(eco)

PU/DSS/OTR
Exploringdata:crosstabs(testforassociations)
Toseewhetherthereisarelationshipbetweentwovariablesyoucanchooseanumberof
tests.Someapplytonominal variablessomeotherstoordinal.Iamrunningallofthem
hereforpresentationpurposes.
Likelihoodratio2(chisquare)
X2(chisquare)
Goodman&Kruskals (gamma)
CramersV
Fishersexacttest
Kendallsb (taub)
Fornominal datausechi2,lrchi2,V
Forordinal datausegammaandtaub
Useexact insteadofchi2 when
frequenciesarelessthan5acrossthe
table.
X2(chisquare)testsforrelationshipsbetweenvariables.Thenull
hypothesis(Ho)isthatthereisnorelationship.Torejectthisweneeda
Pr<0.05(at95%confidence).Herebothchi2aresignificant.Therefore
weconcludethatthereissomerelationshipbetweenperceptionsofthe
economyandgender
CramersVisameasureofassociationbetweentwonominalvariables.It
goesfrom0to1where1indicatesstrongassociation(forrXctables).In
2x2tables,therangeis1to1.HeretheVis0.15,whichshowsasmall
association.
Gammaandtaubaremeasuresofassociationbetweentwoordinal
variables(bothhavetobeinthesamedirection,i.e.negativetopositive,
lowtohigh).Bothgofrom1to1.Negativeshowsinverserelationship,
closerto1astrongrelationship.Gammaisrecommendedwhenthere
arelotsoftiesinthedata.Taubisrecommendedforsquaretables.
Fishersexacttestisusedwhenthereareveryfewcasesinthecells
(usuallylessthan5).Itteststherelationshipbetweentwovariables.The
nullisthatvariablesareindependent.Herewerejectthenulland
concludethatthereissomekindofrelationshipbetweenvariables
PU/DSS/OTR
Exploringdata:descriptivestatistics
Forcontinuousdataweusedescriptivestatistics.Thesestatisticsareacollectionofmeasurementsof
twothings:location andvariability.Locationtellsyouthecentralvalueofyourvariables(themeanis
themostcommonmeasureofthis).Variabilityreferstothespreadofthedatafromthecentervalue
(i.e.variance,standarddeviation).Statisticsisbasicallythestudyofwhatcausessuchvariability.We
usethecommandtabstat togetthesestats.
Themean isthesumoftheobservationsdividedbythetotalnumberofobservations.
Themedian (p50inthetableabove)isthenumberinthemiddle.Togetthemedianyouhavetoorderthedata
fromlowesttohighest.Ifthenumberofcasesisoddthemedianisthesinglevalue,foranevennumberofcases
themedianistheaverageofthetwonumbersinthemiddle.
Thestandarddeviation isthesquaredrootofthevariance.Indicateshowclosethedataistothemean.Assuming
anormaldistribution,68%ofthevaluesarewithin1sdfromthemean,95%within2sdand99%within3sd
Thevariance measuresthedispersionofthedatafromthemean.Itisthesimplemeanofthesquareddistance
fromthemean.
Count (Ninthetable)referstothenumberofobservationspervariable.
Range isameasureofdispersion.Itisthedifferencebetweenthelargestandsmallestvalue,max min.
Min isthelowestvalueinthevariable.
Max isthelargestvalue inthevariable.
PU/DSS/OTR
Exploringdata:descriptivestatistics
Youcouldalsoestimatedescriptivestatisticsbysubgroups.Forexample,bygenderbelow
Typehelp tabstat formoreoptions.

PU/DSS/OTR
Examples of frequencies and crosstabulations

Frequencies (tab command)
Crosstabulations
In this sample we have 15 females and 15 males. Each represents

50% of the total cases.
Averarge SAT scores by gender and major
PU/DSS/OTR
More examples of frequencies and crosstabulations
Average SAT scores by gender and major for graduate and

undergraduate students
PU/DSS/OTR
Renaming variables and adding variable labels

Before
Renaming variables, type:
After
rename [old name] [new name]

rename
rename
rename
rename
rename
var1
var2
var3
var4
var5
id
country
party
imports
exports
Adding/changing variable labels, type:

Before
After
label variable [var name] Text
label
label
label
label
label
variable
variable
variable
variable
variable
id "Unique identifier"
country "Country name"
party "Political party in power"
imports "Imports as % of GDP"
exports "Exports as % of GDP"
PU/DSS/OTR
Menu options for rename and label variable

Renaming variables using the menu
Add/change variable labels using the menu
PU/DSS/OTR
Creating new variables

To generate a new variable use the command generate (gen for short), type
generate [newvar] = [expression]
results for the first five students
generate score2 = score/100

generate readnews2 = readnews*4
You can use generate to create constant variables. For example:

generate x = 5
generate y = 4*15
generate z = y/x
You can also use generate with string variables. For example:
generate fullname = last + , + first
label variable fullname Student full name
browse id fullname last first
PU/DSS/OTR
1.- Recoding age into three groups.
Recoding variables
2.- Use recode command, type

recode age (18 19 = 1 18 to 19) (20/28 = 2 20 to 29) (30/39 = 3 30 to 39) (else=.),
generate(agegroups) label(agegroups)
3.- The new variable is called agegroups:
PU/DSS/OTR
Recoding variables using egen

You can recode variables using the command egen and options cut/group.
egen [newvar] = cut (oldvar), at (break1, break2, break3, etc.)
Notice that the breaks show ranges. Below we type four breaks. The first starts at 18 and ends before 20, the
second starts at 20 and ends before 30, the third starts at 30 and ends before 40.
You could also use the option group, which specifies groups with equal frequency (you have to add value
labels:
egen [newvar] = cut (oldvar), group(number of groups)
For more details and options type help egen

PU/DSS/OTR
Changing variable values (replace)

Before
After
replace inc = . If inc==99
Before
After
replace inc = . If inc>5
Before
After
replace inc = 999 If inc==5
PU/DSS/OTR
Extractingcharactersfromregularexpressions
To remove strings from var1 below use the following command
gen var2=regexr(var1,"[.\}\)\*a-zA-Z]+","")
destring var2, replace
To extract strings from a combination of strings and numbers

gen var2=regexr(var1,"[.0-9]+","")
More info see: http://www.ats.ucla.edu/stat/stata/faq/regex.htm

PU/DSS/OTR
Adding value labels using the menu
Value labels using the menu: step 1
Step 1. Defining the labels
This will appear in the results window
You could also type in the command window:

label define sex 1 Female 2 Male
NOTE: Defining labels is not the same as creating variables

PU/DSS/OTR
Adding value labels using the menu
Value labels using the menu: step 2
Step 2. Assigning labels
This will appear in the results window
In the case of gender you can type

label values gender sex
PU/DSS/OTR
Indexing: creating ids
Indexing is probably one of the most useful characteristics of Stata.

Using _n, you can create a unique identifier for each case in your data, type
Check the results in the data editor, idall is equal to id
Using _N you can also create a variable with the total number of cases in your
dataset:
Check the results in the data editor:
PU/DSS/OTR
Indexing: creating ids by categories

We can create id by categories. For example, lets
create an id by major.
First we have to sort the data by the variable on

which we are basing the id (major in this case).
Then we use the command by to tell Stata that we
are using major as the base variable (notice the
colon).
Then we use browse to check the two variables.
PU/DSS/OTR
Indexing: lag and forward values
You can create lagged values with _n . Lets rename idall as months (time
variable) and will create a lagged variable containing the value of the previous
case:
If you want to lag more than one period just change [_n-1] to [_n-2] for a lag of
two periods, [_n-3] for three, etc.
A more advance alternative to create lags uses the L operand within a time
series setting (tset command must be specified first)
You can create forward values with _n:

A more advance alternative uses the F operand within a

time series setting (tset command must be specified first)
NOTE: Notice the square brackets
PU/DSS/OTR
Indexing: countdown and specific values

Combining _n and _N you can create a countdown variable.
You can create a variable based on one value of another variable. For example,
lets create a variable with the highest SAT value in the sample.
NOTE: You could get the same result without sorting by using
egen and the max function
PU/DSS/OTR
Sorting
Before
sort var1 var2
After
Gsort is another command to sort data. The difference between gsort and
sort is that with gsort you can sort in ascending or descending order, while
with sort you can sort only in ascending order. Use +/- to indicate whether you
want to sort in ascending/descending order. Here are some examples:
PU/DSS/OTR
We have created lots of variables, now we need to do some clean up. Two
commands can do this: drop and keep.
Deleting variables
Before
After
Or
Notice the dash between total and readnews2, you can use this format to indicate a list so you
do not have to type in the name of all the variables
PU/DSS/OTR
Deleting cases (selectively)

The World Values Survey (http://www.worldvaluessurvey.org/) provides data for several countries and different years
(waves). Lets say you want to use data for the United States only. Looking at the codebook the variable for country/year is
s025. A frequency of that variable (with and without labels) gives us the following:
tab s025
Click on more- to continue
tab s025, nolabel
The option nolabel gives you the numeric codes for each country/year
PU/DSS/OTR
Deleting cases (selectively) cont.

Frequencies make it difficult to determine the numeric codes for the United States. To find out these we use the command
labelbook. Type:
labelbook s025
Click on more- to continue
We want to keep data for United States (1999) only. The code is 8401999 (see above). To do this we type
drop if s025!=8401999
/*The operator != means not equal */
Verify by running the frequency for country/year again.
NOTE: you can drop cases with missing values by typing: drop if missing(var1, var2, var3, )
PU/DSS/OTR
Merge/Append
MERGE - You merge when you want to add more variables to an existing dataset.
(type help merge in the command window for more details)
What you need:
Both files must be in Stata format
Both files should have at least one variable in common (id)
Step 1. You need to sort the data by the id or ids common to both files you want to merge. For both datasets type:
sort [id1] [id2]
save [datafile name], replace
Step 2. Open the master data (main dataset you want to add more variables to, for example data1.dta) and type:
merge [id1] [id2] using [i.e. data2.dta]
For example, opening a hypothetical data1.dta we type
merge lastname firstname using data2.dta
To verify the merge type
tab _merge
Here are the codes for _merge:

_merge==1
obs. from master data
_merge==2
obs. from only one using dataset
_merge==3
obs. from at least two datasets, master or using
If you want to keep the observations common to both datasets you can drop the rest by typing:
drop if _merge!=3
/*This will drop observations where _merge is not equal to 3 */
APPEND - You append when you want to add more cases (more rows to your data, type help append for more details).
Open the master file (i.e. data1.dta) and type:
append using [i.e. data2.dta]
PU/DSS/OTR
Mergingfuzzytext(reclink)
RECLINK - Matching fuzzy text. Reclink stands for record linkage. It is a program written by Michael Blasnik to merge imperfect
string variables. For example
Data1
Data2
Princeton University
PrincetonU
Reclink helps you to merge the two databases by using a matching algorithm for these types of variables. Since it is a user
created program, you may need to install it by typing ssc install reclink. Once installed you can type help reclink
for details
As in merge, the merging variables must have the same name: state, university, city, name, etc. Both the master and the using
files should have an id variable identifying each observation.
Note: the name of ids must be different, for example id1 (id master) and id2 (id using). Sort both files by the matching (merging)
variables. The basic sytax is:
reclink var1 var2 var3 using myusingdata, gen(myscore) idm(id1) idu(id2)
The variable myscore indicates the strength of the match; a perfect match will have a score of 1. Description (from reclink help
pages):
reclink uses record linkage methods to match observations between two datasets where no perfect key fields exist -essentially a fuzzy merge. reclink allows for user-defined matching and non-matching weights for each variable and
employs a bigram string comparator to assess imperfect string matches.
The master and using datasets must each have a variable that uniquely identifies observations. Two new variables are
created, one to hold the matching score (scaled 0-1) and one for the merge variable. In addition, all of the
matching variables from the using dataset are brought into the master dataset (with newly prefixed names) to allow
for manual review of matches.
PU/DSS/OTR
Graphs:scatterplot
Scatterplotsaregoodtoexplorepossiblerelationshipsorpatternsbetweenvariables.Letsseeifthereissomerelationship betweenage
andSATscores.Formanymorebellsandwhistlestypehelp scatter inthecommandwindow.
twoway scatter age sat
twoway scatter age sat, mlabel(last)
twoway scatter age sat, mlabel(last) ||

lfit age sat
twoway scatter age sat, mlabel(last) ||

lfit age sat, yline(30) xline(1800)
PU/DSS/OTR
Graphs:scatterplot
By categories
twoway scatter age sat, mlabel(last) by(major, total)
PU/DSS/OTR
Graphs:histogram
Histogramsareanothergoodwaytovisuallyexploredata,especiallytocheckforanormal
distribution;herearesomeexamples(typehelphistograminthecommandwindowforfurther
details):
histogram age, frequency
histogram age, frequency normal
PU/DSS/OTR
Graphs:catplot
Catplot is used to graph categorical data. Since it is a user defined program you may
have to install it typing: ssc install catplot
Now, type
tab agegroups major, col row cell
catplot bar major agegroups, blabel(bar)
Note: Numbers correspond to the frequencies in the table.

PU/DSS/OTR
Graphs:catplot
catplot
bar major agegroups, percent(agegroups)
catplot
hbar agegroups major, percent(major)
blabel(bar)
blabel(bar)
PU/DSS/OTR
catplot hbar major agegroups, percent(major sex)

blabel(bar) by(sex)
Graphs:catplot
Raw counts by major and sex
Percentages by major and sex

PU/DSS/OTR
Graphs: means
Stata can also help to visually present
summaries of data. If you do not want to
type you can go to graphics in the menu.
graph hbar (mean) age (mean) averagescoregrade,

blabel(bar) by(, title(gender and major)) by(gender
major, total)
graph hbar (mean) age averagescoregrade

newspaperreadershiptimeswk, over(gender)
over(studentstatus, label(labsize(small))) blabel(bar)
title(Student indicators) legend(label(1 "Age")
label(2 "Score") label(3 "Newsp read"))
PU/DSS/OTR
Regression:apracticalapproach(intro)
In this section we will explore some basics of regression analysis.
We will run a multivariate regression and some diagnostics :
General setting and output interpretation (what to look for)

Normality
Linearity/functional form
Homoskedasticity/heteroskedasticiy
Robust standard errors
Omitted variable bias/specification error
Outliers
F-test
Interaction terms
The main references/sources for this section are:

Stock, James and Mark Watson, Introduction to Econometrics, 2003
Hamilton, Lawrence, Statistics with Stata (updated for version 9), 2006
The UCLA online tutorial http://www.ats.ucla.edu/stat/stata/
PU/DSS/OTR
Regression:apracticalapproach(overview)
Weuseregressiontoestimatetheunknowneffect ofchangingonevariableover
another(StockandWatson,2003,ch.4)
Whenwerunaregressionweassumealinearrelationshipbetweentwovariables(i.e.
X andY).Technically,itestimateshowmuchY changeswhenX changesoneunit.
InStataweusethecommandregress, type:
regress [dependent variable] [independent variable(s)]
regress y x
In a multivariate setting we type:
regress y x1 x2 x3
Before running a regression it is recommended to have a clear idea of what you
are trying to estimate (i.e. which are your dependent and independent
variables).
A regression makes sense only if there is a sound theory behind it.
PU/DSS/OTR
Regression:apracticalapproach(overview)cont.
DataandexamplesforthissectioncomefromthebookStatisticswithStata(updated
forversion9) byLawrenceC.Hamilton(chapter6).Clickheretodownloadthedataor
searchforitathttp://www.duxbury.com/highered/.Usethefilestates.dta
(educationaldatafortheU.S.).
PU/DSS/OTR
Regression:apracticalapproach(setting)
Startingquestion:AreSATscoreshigherinstatesthatspendmoremoneyoneducationcontrollingby
otherfactors?
Dependent(orpredicted,Y)variable SATscores,variablecsat indataset
Independent(orpredictor,X)variable(s) Expendituresoneducation,variableexpense
indataset.Othervariablespercent, income, high, college.
Here is a general description of the variables in the model
This is a correlation matrix for all

variables in the model. Numbers
are Pearson correlation
coefficients, go from -1 to 1.
Closer to 1 means strong
correlation. A negative value
indicates an inverse relationship
(roughly, when one goes up the
other goes down).
PU/DSS/OTR
Regression:graphmatrix
Beforerunningaregressionisalwaysrecommendedtographdependentandindependent
variablestoexploretheirrelationship.Commandgraph matrix producesaseriesof
scatterplotsforallvariables.Type:
graph matrix expense percent income high college csat
Y
Y
PU/DSS/OTR
Regression:graphmatrix
Hereisanotheroptionforthegraph.
graph matrix csat expense percent income high college, half
maxis(ylabel(none) xlabel(none))
PU/DSS/OTR
Regression:whattolookfor
Robust standard errors (to control
Lets run the regression:
for heteroskedasticity)
regress csat expense percent income high college, robust
Dependent
variable (Y)
This is the p-value of the model. It

indicates the reliability of X to
predict Y. Usually we need a pvalue lower than 0.05 to show a
statistically significant relationship
between X and Y.
Independent
variables (X)
3
6
csat = 851.56 + 0.003*expense

2.62*percent + 0.11*income + 1.63*high
+ 2.03*college
4
5
The t-values test the hypothesis that the coefficient is

different from 0. To reject this, you need a t-value greater
than 1.96 (at 0.05 confidence). You can get the t-values
by dividing the coefficient by its standard error. The tvalues also show the importance of a variable in the
model. In this case, percent is the most important.
R-square shows the amount of

variance of Y explained by X. In
this case the model explains
82.43% of the variance in SAT
scores.
Adj R2 (not shown here) shows
the same as R2 but adjusted by
the # of cases and # of variables.
When the # of variables is small
and the # of cases is very large
then Adj R2 is closer to R2. This
provides a more honest
association between X and Y.
Two-tail p-values test the hypothesis that each coefficient is different

from 0. To reject this, the p-value has to be lower than 0.05 (you
could choose also an alpha of 0.10). In this case, expense,
income, and college are not statistically significant in explaining
SAT; high is almost significant at 0.10. Percent is the only variable
that has some significant impact on SAT (its coefficient is different
from 0)
PU/DSS/OTR
Regression:exploringrelationships
Giventhepreviousresults,weneedtodosomeadjustmentssinceonlyonevariablewas
significant.Letsexplorefurthertherelationshipbetweencsat andpercent, andhigh.
scatter csat percent
scatter csat high
Thereseemtobeacurvilinearrelationshipbetweencsat andpercent, andslightlylinearbetweencsat

andhigh. Whenever we find polynomial relationships (curves) we need to add a square (or some
other higher power) version of the variable, in this case percent square will suffice .
generate percent2 = percent^2
Now the model will look like this
regress csat percent percent2 high
PU/DSS/OTR
Regression:functionalform/linearity
Asafootnote,anothergraphicalwaytoexploreapossiblelinearrelationshipbetweenvariablesortodetect
nonlinearitytodefineafunctionalformisbyusingthecommandacprplot (augmentedcomponentplusresidual
plot).Rightafterrunningatheregression:
regress csat percent high /* Notice we do not include percent2 */
acprplot percent, lowess
acprplot high, lowess
Theoptionlowess (locallyweightedscatterplotsmoothing)drawtheobservedpatterninthedatatohelpidentify
nonlinearities.Percent showsaquadraticrelation,itmakessensetoaddasquareversionofit.High showsa
polynomialpatternaswellbutgoesaroundtheregressionline(exceptontheright).Wecouldkeepitasisfornow.
Formmoredetailsseehttp://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm,and/ortypehelp acprplot and help lowess.

PU/DSS/OTR
Regression:Ftest
Beforewecontinueletstakeanotherlookattheoriginalregressionandrunsomeindividualtestsonits
coefficients.
Wehavetwotypesoftestswiththeregressionmodel:Ftest,whichteststheoverallfitofthemodel(all
coefficientsdifferentfrom0)andttest(individualcoefficientsdifferentfrom0).Youcancustomizeyour
teststocheckforotherpossiblesituations,liketwocoefficientsjointlydifferentfrom0.Inthe
regression,twovariablesrelatedtoeducationalattainmentwerenotsignificant.Wecould,however,
testwhetherthesetwohavenoeffectonSATscores(seeHamilton,2006,p.175).Letsruntheoriginal
regressionagain:
quietly regress csat expense percent income high college
Note quietly suppress the regression output
Totestthenullhypothesisthatboth coefficientsdonothaveanyeffectoncsat, type:

test high college
The p-value is 0.0451, under the 0.05 usual threshold (95% confidence) so we conclude that both variables
have indeed some effect on SAT. In a way, this is saying that both have similar effect or measuring the same
thing (which could suggest multicolinearity). We could keep high since it was borderline significant.
Some other possible tests are (see Hamilton, 2006, p.176):
test income = 1
test high = college
test income = (high + college)/100
Note: Not to be confused with ttest. Type help test and help ttest for more details
PU/DSS/OTR
Regression:output
Letstrythenewmodel.IthasnowahigherRsquared(0.92)andallthevariablesaresignificant.
The new equation is:

csat = 844.82 6.52*percent + 0.05*percent2 + 2.98*high
Percents coefficient is -6.52. So, if percent increases by one unit, csat will decrease by 6.52 units.
With a statistically significant p-value of 0.000 (which means that -6.52 is statistically different from 0),
percent has an important impact on csat controlling by other variables (holding them constant). You
could read percent2 (which explains the upward effect) the same way. The net effect of percent is the
difference between both coefficients (which is still negative).
Highs coefficient is 2.98. So, if high increases by one unit, csat will increase by 2.98 units.
The constant 844.82 means that if all variables are 0, the average csat score would be 844.82. It is
where the regression line crosses the Y axis.
PU/DSS/OTR
Regression:savingregressioncoefficients/gettingpredictedvalues
Stata temporarily stores the coefficients as _b[varname], so if you type:
You can save the coefficients as variables by typing:
gen
gen
gen
gen
percent_coeff = _b[percent]
percent_coeff = _b[percent2]
high_coeff = _b[high]
constant_coeff = _b[_cons]
How good the model is will depend on how well it predicts Y and on the validity of the tests.
There are two ways to generate the predicted values of Y (usually called Yhat) given the model:
Option A, using generate after running the regression:
generate csat_predict = _b[_cons] + _b[percent]*percent + _b[percent2]*percent2 + _b[high]*high
Option B, using predict immediately after running the regression:

predict csat_predict
label variable csat_predict "csat predicted"
PU/DSS/OTR
Regression:observedvs.predictedvalues
Now lets see how well we did, type
scatter csat csat_predict
We should expect a 45 degree pattern in the

data. Y-axis is the observed data and x-axis
the predicted data (Yhat).
In this case the model seems to be doing a
good job in predicting csat
PU/DSS/OTR
Regression:testingfornormality
A main assumption of the regression model (OLS) that guarantee the validity of all tests (p, t and F) is that residuals behave
normal. Residuals (here indicated by the letter e) are the difference between the observed values (Y) and the predicted values
(Yhat): e = Y Yhat.
In Stata you type: predict e, resid
It will generate a variable called e (residuals).
Three graphs will help us check for normality in the residuals: kdensity, pnorm and qnorm.
kdensity e, normal
A kernel density plot produces a kind of histogram for the
residuals, the option normal overlays a normal distribution to
compare. Here residuals seem to follow a normal distribution.
Below is an example using histogram.
histogram e, kdensity normal
If residuals do not follow a normal pattern then you should

check for omitted variables, model specification, linearity,
functional forms. In sum, you may need to reassess your
model/theory. In practice normality does not represent much
of a problem when dealing with really big samples.
PU/DSS/OTR
Regression:testingfornormality
Standardize normal probability plot (pnorm) checks
for non-normality in the middle range of residuals.
Again, slightly off the line but looks ok.
pnorm e
Quintile-normal plots (qnorm) check for non-normality in the

extremes of the data (tails). It plots quintiles of residuals vs
quintiles of a normal distribution. Tails are a bit off the normal.
qnorm e
A non-graphical test is the Shapiro-Wilk test for normality. It tests the hypothesis that the distribution is normal, in this case the
null hypothesis is that the distribution of the residuals is normal. Type
swilk e
The null hypothesis is that the distribution of the residuals is normal, here the p-value is 0.64 (way over the usual 0.05 threshold)
therefore we failed to reject the null. We conclude then that residuals are normally distributed.
PU/DSS/OTR
Regression:testingforhomoskedasticity
Another important assumption is that the variance in the residuals has to be homoskedastic, which means constant. Residuals
cannot varied for lower of higher values of X (i.e. fitted values of Y since Y=Xb). A definition:
The error term [e] is homoskedastic if the variance of the conditional distribution of [ei] given Xi [var(ei|Xi)], is constant for i=1n, and in particular
does not depend on x; otherwise, the error term is heteroskedastic (Stock and Watson, 2003, p.126)
When plotting residuals vs. predicted values (Yhat) we should not observe any pattern at all. In Stata we do this using rvfplot
right after running the regression, it will automatically draw a scatterplot between residuals and predicted values; and hettest
to produce a non-graphical test.
rvfplot, yline(0)
estat hettest
Residuals seem to slightly

expand at higher levels of
Yhat.
This is the Breusch-Pagan test

for heteroskedasticity. The null
hypothesis is that residuals are
homoskedastic. Here we reject
the null and concluded that
residuals are heteroskedastic.
These two tests suggest the presence of heteroskedasticity in our model. The problem with this is that we may have the wrong
estimates of the standard errors for the coefficients and therefore their t-values.
By default Stata assumes homoskedastic standard errors, so we need to adjust our model to account for heteroskedasticity. To do
this we use the option robust in the regress command.
regress csat percent percent2 high, robust
See the next slide for results
PU/DSS/OTR
Regression:robuststandarderrors
To run a regression with robust standard erros type:
regress csat percent percent2 high, robust
Notice the difference in the standard errors and the t-values. Following Stock and Watson, as a ruleof-thumb, you should always assume heteroskedasticiy in your model and use robust standard errors
by adding the option robust (or r for short) to the regression command (see Stock and Watson,
2003, chapter 4)
PU/DSS/OTR
Regression:omittedvariabletest
How do we know we have included all variables we need to explain Y?
Testing for omitted variable bias is important for our model since it is related to the assumption that the
error term and the independent variables in the model are not correlated (E(e|X) = 0)
If we are missing one variable in our model and [(1)] is correlated with the included regressor; and
[(2)] the omitted variable is a determinant of the dependent variable (Stock and Watson, 2003, p.144),
then our regression coefficients are inconsistent.
In Stata we test for omitted-variable bias using the ovtest command. After running the regression
type:
ovtest
The null hypothesis is that the model does not have omitted-variables bias, the p-value is 0.2319
higher that the usual threshold of 0.05, so we fail to reject the null and conclude that we do not need
more variables.
PU/DSS/OTR
Regression:specificationerror
Anothercommandtotestmodelspecificationislinktest.Itbasicallycheckswhetherweneedmore
variablesinourmodelbyrunninganewregressionwiththeobservedY(csat)againstYhat
(csat_predicted)andYhat-squared asindependentvariables1.
Thethingtolookforhereisthesignificanceof_hatsq. The null hypothesis is that there is no
specification error. If the p-value of _hatsq is not significant then we fail to reject the null and
conclude that our model is correctly specified.
Formoredetailsseehttp://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm,and/ortypehelp linktest.
PU/DSS/OTR
Regression:outliers
To check for outliers we use the avplots command (added-variable plots). Outliers are data points
with extreme values that could have a negative effect on our estimators. After running the regression
type:
avplots
These plots regress each variable against all others, notice the coefficients on each. All data points
seem to be in range, no outliers observed.
For more details and tests on this and influential and leverage variables please check
http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm
Also type help diagplots in the Stata command window.
PU/DSS/OTR
Regression:summaryofinfluenceindicators
DfBeta
Measures the influence (in

A case is an influential outlier if
standard errors terms) of each
observation on the coefficient of |DfBeta|> 2/SQRT(N)
a particular independent
variable (for example, x1)
Where N is the sample size.
Note: Stata estimates
standardized DfBetas.
DfFit
It is a summary measure of
leverage and high residuals.
High influence if
After running the regression type
|DfFIT| >2*SQRT(k/N)
predict dfits if e(sample), dfits
Measures how much an

observation influences the
regression model as a whole.
Covariance ratio
In SPSS: Analyze-RegressionLinear; click Save. Select under

reg y x1 x2 x3
Influence Statistics to add as a
new variable (DFB1_1) or in
dfbeta x1
syntax type
REGRESSION
/MISSING LISTWISE
Note: you could also type:
/STATISTICS COEFF OUTS
predict DFx1, dfbeta(x1)
R ANOVA
/CRITERIA=PIN(.05)
To estimate the dfbetas for all predictors just POUT(.10)
/NOORIGIN
type:
/DEPENDENT Y
dfbeta
/METHOD=ENTER X1 X2 X3
/CASEWISE PLOT(ZRESID)
OUTLIERS(3) DEFAULTS
To flag the cutoff
DFBETA
/SAVE MAHAL COOK LEVER
gen cutoffdfbeta = abs(DFx1) >
2/sqrt(e(N)) & e(sample)
DFBETA SDBETA DFFIT
SDFIT COVRATIO .
In Stata after running the regression type:
Where k is the number of

parameters (including the
intercept) and N is the sample
How much the predicted values size.
change as a result of including
and excluding a particular
observation.
To generate the flag for the cutoff type:
Measures the impact of an

observation on the standard
errors
In Stata after running the regression type
High impact if
|COVRATIO-1| 3*k/N
Where k is the number of
parameters (including the
intercept) and N is the sample
size.
Same as DfBeta above (DFF_1)
gen cutoffdfit=
abs(dfits)>2*sqrt((e(df_m)
+1)/e(N)) & e(sample)
Same as DfBeta above

(COV_1)
predict covratio if e(sample),

covratio
PU/DSS/OTR
Regression:summaryofdistancemeasures
Cooks distance
Measures how much an observation

influences the overall model or
predicted values.
High influence if
It is a summary measure of leverage

and high residuals.
.
Where N is the sample size.
Measures how much an observation

influences regression coefficients.
High influence if
In Stata after running the regression

type:
D > 4/N
predict D, cooksd
Leverage
A D>1 indicates big outlier problem
In Stata after running the regression

type:
In SPSS: Analyze-Regression-Linear;
click Save. Select under Distances
to add as a new variable (COO_1) or
in syntax type
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R
ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X1 X2 X3
/CASEWISE PLOT(ZRESID)
OUTLIERS(3) DEFAULTS DFBETA
/SAVE MAHAL COOK LEVER
DFBETA SDBETA DFFIT SDFIT
COVRATIO.
Same as above (LEV_1)
leverage h > 2*k/N

predict lev, leverage
Where k is the number of parameters
(including the intercept) and N is the
sample size.
A rule-of-thumb: Leverage goes from
0 to 1. A value closer to 1 or over 0.5
may indicate problems.
Mahalanobis distance
It is rescaled measure of leverage.
Higher levels indicate higher distance Not available

from average values.
Same as above (MAH_1)
M = leverage*(N-1)
Where N is sample size.
The M-distance follows a Chi-square

distribution with k-1 df and
alpha=0.001 (where k is the number of
independent variables).
Any value over this Chi-square value
may indicate problems.
PU/DSS/OTR
Sourcesforthesummarytables:
influenceindicatorsanddistancemeasures
Statnotes:
http://faculty.chass.ncsu.edu/garson/PA765/regress.htm#outlier2
AnIntroductiontoEconometricsUsingStata/ChristopherF.Baum,Stata
Press,2006
StatisticswithStata(updatedforversion9)/ LawrenceHamilton,
ThomsonBooks/Cole,2006
PU/DSS/OTR
Regression:multicollinearity
An important assumption for the multiple regression model is that independent variables are not perfectly
multicolinear. This is, one regressor should not be a linear function of another. When multicollinearity is
present, Stata will drop one of the variables to avoid a division by zero in the OLS procedure (see Stock and
Watson, 2003, chapter 5). A mayor problem with multicollinearity is that standand errors may be inflated. The
Stata command to check for multicollinearity is vif (variance inflation factor). Right after running the
regression type:
Avif > 10 ora1/vif < 0.10 indicatestrouble.Weknowthatpercent andpercent2 arerelatedsinceoneis

thesquareoftheother.Theyareoksincepercent hasaquadraticrelationshipwithY.High hasavif of1.03and
1/vif of0.96soweareokhere.
Letsrunanotherregressionandgetthevif.
quietly regress csat expense percent income high college
Wedonotobservemulticollinearityproblemshere.All
vifs areunder10.
Formmoredetailsseehttp://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm,and/ortypehelp vif.
PU/DSS/OTR
Regression:publishingregressionoutput(outreg2)
The command outreg2 gives you the type of presentation you see in published papers. If outreg2 is not available you need to
install it by typing
ssc install outreg2
Lets say the regression is regress csat percent percent2 high, robust
The basic syntax for outreg2 is: outreg2 using [pick a name], [type either word or excel]
After the regression type the following if you want to export the results to excel*
outreg2 using results, excel
Click here to see
the file
Or this if you want to export to word
In excel
In word
outreg2 using results, word
Click here to see the file
*See the following document for some additional info/tips http://www.fiu.edu/~tardanic/brianne.pdf
PU/DSS/OTR
Regression:publishingregressionoutput(outreg2)
You can add more models to compare. Lets say you want to add another model without percent2:
regress csat percent high, robust
Now type to export the results to excel (notice we add the append option)
outreg2 using results, word append
In excel
In word
NOTE: If you run logit/probit regression with odds ratios you need to add the option eform to export the odd ratios
Type help outreg2 for more details. If you do not see outreg2, you may have to install it by typing ssc install outreg2. If this does not work type
findit outreg2, select from the list and click install.
Note: If you get the following error message (when you use the option append or replace it means that you need to close the excel/word window.
PU/DSS/OTR
Regression:publishingregressionoutput(outreg2)continue
Foracustomizedlook,herearesomeoptions:
***Excel
outreg2 using results, bdec(2) tdec(2) rdec(2) adec(2) alpha (0.01, 0.05, 0.10)
addstat(Adj. R-squared, e(r2_a)) excel
***Word
outreg2 using results, bdec(2) tdec(2) rdec(2) adec(2) alpha (0.01, 0.05, 0.10)
addstat(Adj. R-squared, e(r2_a)) excel
For excel
For word
Click here to see the output, a excel/word window will open
Set # of
decimals for
auxiliary
statistics
Name of
the file for
the output
Click on seeout
to browse the
results
Set # of
decimals
for
coefficients
Set # of decimals
for added statistics
(addstat option)
Set # of
decimals
for the R2
Levels of
significance
Include some additional statistic, in this

case adj. R-sqr. You can select any
statistics on the return lists (e-class, rclass or s-class). After running the
regression type ereturn list for a list
of available statistics.
PU/DSS/OTR
Regression:interactionbetweendummies
Interactiontermsareneededwheneverthereisreasontobelievethattheeffectofoneindependentvariabledependsonthevalueof
anotherindependentvariable.Wewillexploreheretheinteractionbetweentwodummy(binary)variables.Intheexamplebelow there
couldbethecasethattheeffectofstudentteacherratioontestscoresmaydependonthepercentofEnglishlearnersinthedistrict*.
Dependentvariable(Y) Averagetestscore,variabletestscr indataset.

Independentvariables(X)
Binaryhi_str,where0ifstudentteacherratio(str)islowerthan20,1equalto20orhigher.
Binaryhi_el,where0ifEnglishlearners(el_pct)islowerthan10%,1equalto10%orhigher
InStata,firstgenerate hi_str = 0 if str<20.Thenreplace hi_str=1 if str>=20.

InStata,firstgenerate hi_el = 0 if el_pct<10.Thenreplace hi_el=1 if el_pct>=10.
Interactiontermstr_el = hi_str * hi_el. In Stata: generate str_el = hi_str*hi_el
Weruntheregression
regress testscr hi_el hi_str str_el, robust
Theequationistestscr_hat = 664.1 18.1*hi_el 1.9*hi_str 3.5*str_el

Theeffectofhi_str onthetestsscoresis1.9butgiventheinteractionterm(andassumingallcoefficientsaresignificant),theneteffectis
-1.9 -3.5*hi_el.Ifhi_el is0thentheeffectis1.9(whichishi_str coefficient),butifhi_el is1thentheeffectis1.93.5=5.4.
Inthiscase,theeffectofstudentteacherratioismorenegativeindistrictswherethepercentofEnglishlearnersishigher.
Seethenextslideformoredetailedcomputations.
*ThedatausedinthissectionistheCaliforniaTestScoredataset(caschool.dta)fromchapter6ofthebookIntroductiontoEconometrics fromStockandWatson,2003.Datacanbedownloadedfrom
http://wps.aw.com/aw_stock_ie_2/50/13016/3332253.cw/index.html.Foradetaileddiscussionpleaserefertotherespectivesectioninthebook.
PU/DSS/OTR
Regression:interactionbetweendummies(cont.)
Youcancomputetheexpectedvaluesoftestscoresgivendifferentvaluesofhi_str andhi_el.Toseetheeffectofhi_str given
hi_el typethefollowingrightafterrunningtheregressioninthepreviousslide.
These are different scenarios holding constant hi_el and varying
hi_str. Below we add some labels
We then obtain the average of the estimations for the test scores (for all four scenarios, notice same values for all cases).
Here we estimate the net effect of low/high student-teacher ratio holding constant the percent of
English learners. When hi_el is 0 the effect of going from low to high student-teacher ratio goes
from a score of 664.2 to 662.2, a difference of 1.9. From a policy perspective you could argue that
moving from high str to low str improve test scores by 1.9 in low English learners districts.
When hi_el is 1, the effect of going from low to high student-teacher ratio goes from a score of
645.9 down to 640.5, a decline of 5.4 points (1.9+3.5). From a policy perspective you could say
that reducing the str in districts with high percentage of English learners could improve test scores
by 5.4 points.
PU/DSS/OTR
Regression:interactionbetweenadummyandacontinuousvariable
LetsexplorethesameinteractionasbeforebutwekeepstudentteacherratiocontinuousandtheEnglishlearnersvariableasbinary.The
questionremainsthesame*.
Continuousstr,studentteacherratio.
Binaryhi_el,where0ifEnglishlearners(el_pct)islowerthan10%,1equalto10%orhigher
Interactiontermstr_el2 = str * hi_el. In Stata: generate str_el2 = str*hi_el
Wewillruntheregression
regress testscr str hi_el str_el2, robust
Theequationistestscr_hat = 682.2 0.97*str + 5.6*hi_el 1.28*str_el2

Theeffectofstr ontestscr willbemediatedbyhi_el.
Ifhi_el is0(low)thentheeffectofstr is682.2 0.97*str.
Ifhi_el is1(high)thentheeffectofstr is682.2 0.97*str+5.6 1.28*str=687.8 2.25*str
Noticethathowhi_el changesboththeinterceptandtheslopeofstr.Reducingstr byoneinlowELdistrictswillincreasetestscoresby

0.97points,butitwillhaveahigherimpact(2.25points)inhighELdistricts.Thedifferencebetweenthesetwoeffectsis 1.28whichisthe
coefficientoftheinteraction(StockandWatson,2003,p.223).
PU/DSS/OTR
Regression:interactionbetweentwocontinuousvariables
Letskeepnowbothvariablescontinuous.Thequestionremainsthesame*.
Continuousstr,studentteacherratio.
Continuous el_pct, percentofEnglishlearners.
Interactiontermstr_el3 = str * el_pct. In Stata: generate str_el3 = str*el_pct
Wewillruntheregression
regress testscr str el_pct str_el3, robust
Theequationistestscr_hat = 686.3 1.12*str - 0.67*el_pct + 0.0012*str_el3

Theeffectoftheinteractiontermisverysmall.FollowingStockandWatson(2003,p.229),algebraicallytheslopeofstr is
1.12 + 0.0012*el_pct (rememberthatstr_el3 isequaltostr*el_pct).So:
Ifel_pct=10,theslopeofstris1.108
Ifel_pct=20,theslopeofstris1.096.Adifferenceineffectof0.012points.
Inthecontinuouscasethereisaneffectbutisverysmall(andnotsignificant).SeeStockandWatson,2003,forfurtherdetails.
PU/DSS/OTR
Creatingdummies
You can create dummy variables by either using recode or using a combination of tab/gen commands:
tab major, generate(major_dum)
Check the variables window, at the end you will see

three new variables. Using tab1 (for multiple
frequencies) you can check that they are all 0 and 1
values
PU/DSS/OTR
Here is another example:

tab agregroups, generate(agegroups_dum)
Creatingdummies(cont.)
Check the variables window, at the end you will see

three new variables. Using tab1 (for multiple
frequencies) you can check that they are all 0 and 1
values
PU/DSS/OTR
Basicdatareporting
FrequentlyusedStatacommands
Gettingonlinehelp
Statacommands
help
search
Operatingsysteminterface
pwd
cd
sysdir
mkdir
dir/ls
erase
copy
type
Usingandsavingdatafromdisk
use
clear
save
append
merge
compress
InputtingdataintoStata
input
edit
infile
infix
insheet
TheInternetandUpdatingStata
codebook
Source: http://www.ats.ucla.edu/stat/stata/notes2/commands.htm
Type help [command name] in the windows command for details
Category
inspect
list
browse
count
assert
summarize
Table(tab)
tabulate
Datamanipulation
egen
recode
rename
drop
keep
sort
encode
decode
order
by
reshape
Formatting
format
label
Keepingtrackofyourwork
ado
news
generate
replace
update
net
describe
log
notes
Convenience
display
PU/DSS/OTR
Is my model OK? (links)

Regression diagnostics: A checklist
http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm
Logistic regression diagnostics: A checklist
http://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm
Times series diagnostics: A checklist (pdf)
http://homepages.nyu.edu/~mrg217/timeseries.pdf
Times series: dfueller test for unit roots (for R and Stata)
http://www.econ.uiuc.edu/~econ472/tutorial9.html
Panel data tests: heteroskedasticity and autocorrelation
http://www.stata.com/support/faqs/stat/panel.html
http://www.stata.com/support/faqs/stat/xtreg.html
http://www.stata.com/support/faqs/stat/xt.html
http://dss.princeton.edu/online_help/analysis/panel.htm
PU/DSS/OTR
I cant read the output of my model!!! (links)

Data Analysis: Annotated Output
http://www.ats.ucla.edu/stat/AnnotatedOutput/default.htm
Data Analysis Examples
http://www.ats.ucla.edu/stat/dae/
Regression with Stata
http://www.ats.ucla.edu/STAT/stata/webbooks/reg/default.htm
Regression
http://www.ats.ucla.edu/stat/stata/topics/regression.htm
How to interpret dummy variables in a regression
http://www.ats.ucla.edu/stat/Stata/webbooks/reg/chapter3/statareg3.htm
How to create dummies
http://www.stata.com/support/faqs/data/dummy.html
http://www.ats.ucla.edu/stat/stata/faq/dummy.htm
Logit output: what are the odds ratios?
http://www.ats.ucla.edu/stat/stata/library/odds_ratio_logistic.htm
PU/DSS/OTR
Topics in Statistics (links)

What statistical analysis should I use?
http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm
Statnotes: Topics in Multivariate Analysis, by G. David Garson
http://www2.chass.ncsu.edu/garson/pa765/statnote.htm
Elementary Concepts in Statistics
http://www.statsoft.com/textbook/stathome.html
Introductory Statistics: Concepts, Models, and Applications
http://www.psychstat.missouristate.edu/introbook/sbk00.htm
Statistical Data Analysis
http://math.nicholls.edu/badie/statdataanalysis.html
Stata Library. Graph Examples (some may not work with STATA 10)
http://www.ats.ucla.edu/STAT/stata/library/GraphExamples/default.htm
Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and
SPSS
http://www.indiana.edu/~statmath/stat/all/ttest/
PU/DSS/OTR
Usefullinks/Recommendedbooks
DSSOnlineTrainingSectionhttp://dss.princeton.edu/training/
UCLAResourcestolearnanduseSTATAhttp://www.ats.ucla.edu/stat/stata/
DSShelpsheetsforSTATAhttp://dss/online_help/stats_packages/stata/stata.htm
IntroductiontoStata(PDF),ChristopherF.Baum,BostonCollege,USA.A67pagedescriptionofStata,itskey
featuresandbenefits,andotherusefulinformation.http://fmwww.bc.edu/GStat/docs/StataIntro.pdf
STATAFAQwebsitehttp://stata.com/support/faqs/
PrincetonDSSLibguideshttp://libguides.princeton.edu/dss
Books
Introduction to econometrics / James H. Stock, Mark W. Watson. 2nd ed., Boston: Pearson Addison
Wesley, 2007.
Data analysis using regression and multilevel/hierarchical models / Andrew Gelman, Jennifer Hill.
Cambridge ; New York : Cambridge University Press, 2007.
Econometric analysis / William H. Greene. 6th ed., Upper Saddle River, N.J. : Prentice Hall, 2008.
Designing Social Inquiry: Scientific Inference in Qualitative Research / Gary King, Robert O.
Keohane, Sidney Verba, Princeton University Press, 1994.
Unifying Political Methodology: The Likelihood Theory of Statistical Inference / Gary King, Cambridge
University Press, 1989
Statistical Analysis: an interdisciplinary introduction to univariate & multivariate methods / Sam

Kachigan, New York : Radius Press, c1986
Statistics with Stata (updated for version 9) / Lawrence Hamilton, Thomson Books/Cole, 2006
PU/DSS/OTR

Teach Yourself Stata

Uploaded by

Document Informationclick to expand document informationLearning Stata

Document Informationclick to expand document information

Copyright:

Available Formats

Teach Yourself Stata

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Teach Yourself Stata

Uploaded by

Copyright:

Available Formats

GettingStartedinDataAnalysis

Expensive (but not need to

This is the Stata screen

and here is a brief description

First steps: graphic view

The log file will record

/*You type this in the command window */

Example of a dataset in Excel.

Path to the file: http://www.princeton.edu/~otorres/Stata/Students.xls

Excel to Stata (copy-and-paste)

Saving the dataset

3 Donotforgettosavethefile,inthecommandwindowtype save students, replace

Excel to Stata (import *.csv)

Commands: describe and summarize

Type help describe in the

2 - Type summarize to get

Zeros indicate string variables

Type help summarize in the

If you want to use the menu

Menu options for describe and summarize

table is another command to produce frequencies and statistics.

The mean age of females is 23 years, for males is 27.

Youcanusetab1 formultiplefrequenciesortab2 torunall

recode ecostatu (1 2 = 1 "Well") (3 4 = 2 "Bad") (5 6=3 "Not sure/ref"), gen(ecostatu1) label(eco)

Typehelp tabstat formoreoptions.

Examples of frequencies and crosstabulations

In this sample we have 15 females and 15 males. Each represents

Averarge SAT scores by gender and major

More examples of frequencies and crosstabulations

Average SAT scores by gender and major for graduate and

Renaming variables and adding variable labels

Renaming variables, type:

rename [old name] [new name]

Adding/changing variable labels, type:

label variable [var name] Text

Menu options for rename and label variable

Add/change variable labels using the menu

Creating new variables

results for the first five students

generate score2 = score/100

You can use generate to create constant variables. For example:

1.- Recoding age into three groups.

2.- Use recode command, type

3.- The new variable is called agegroups:

Recoding variables using egen

For more details and options type help egen

Changing variable values (replace)

replace inc = . If inc==99

replace inc = . If inc>5

replace inc = 999 If inc==5

To extract strings from a combination of strings and numbers

More info see: http://www.ats.ucla.edu/stat/stata/faq/regex.htm

Adding value labels using the menu

Value labels using the menu: step 1

Step 1. Defining the labels

This will appear in the results window

You could also type in the command window:

NOTE: Defining labels is not the same as creating variables

/You type this in the command window /

/The operator != means not equal /

/This will drop observations where _merge is not equal to 3 /