Stata Guide

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Stata Guide for Bachelor Students

Tamas Barko April 22, 2014


This guide helps students getting acquainted with basic data management, regressions and outputs in Stata. I show these techniques through replicating parts of a published research paper. The paper is Opler, Pinkowitz, Stulz and Williamson (OPSW) (1999) The determinants and implications of corporate cash holdings Journal of Financial Economics, vol. 52, issue 1, pp. 3-46. The scope of this study is the 1990 to 2010 period, including all rms that were part of the S&P500 at any point during that period. The rst text le provided (sp500 list.txt) contains the Compustat identiers -GVKEY- of these companies. The other le (variables.txt) contains the names of the variables that you need. You can use this list to download your data from WRDS. We follow OPSW in constructing the sample, except for applying the CPI and calculating industry sigmas. If I refer to a certain command in Stata, you can always check the help le typing -help command-. Stata has many functions but not everything is included in the software package. However, there are many user-written commands that you can install yourself. Simply use -ndit- to get to the installation directory. I recommend installing -egenmore- and -outreg-. It is always useful to use global macros in Stata to refer to le paths instead of typing the path every time. You can then invoke the path with ${}. *The first two filepaths are for the Mac the last two for Windows global filepath "/Users/tamasbarko/Dropbox/Stata/" global outreg "/Users/tamasbarko/Dropbox/Stata/output/" global filepath "H:\Dropbox\Stata\" global outreg "H:\Dropbox\Stata\output\" use "${filepath}original.dta", clear Unless the focus of your research is on nancials or utilities, drop these industries as they are highly regulated and can therefore distort your sample. drop indfmt drop consol drop datafmt drop popsrc drop if fyear==. destring gvkey, replace
Email:

[email protected], Oce: K934

destring sic, replace *Dropping financials drop if sic>5999 & sic<7000 *Dropping utilities drop if sic>4899 & sic<5000 *Drop ADR and foreign drop if sic==8880 | sic==8888 *Two-digit industry code tostring sic, gen(twodig) replace twodig="0"+twodig if length(twodig)==3 replace twodig=substr(twodig,1,2) destring twodig, replace If you have a missing value in your dataset, you cannot automatically set it to 0. However, if it is reasonable to assume that a missing value should be treated as 0, then you can replace it. Research and development expenses can be considered zero if missing. *Setting missing R&D to 0 as in the paper replace xrd=0 if xrd==. *Calculate variables of interest gen net_asset=at-ch gen cash_asset= ch/ net_asset gen cash_asset_tr=cash_ass replace cash_asset_tr=1 if cash_asset>1 replace cash_asset_tr=. if cash_asset==. gen size=ln(at) gen market_value=prcc_f*csho gen mb=(at-seq+market_value)/at gen rd_sales= xrd/revt gen cf_assets=(ebitda-dvc-intpn-txt)/net_asset gen nwc_assets=(wcap-ch)/net_asset gen capex_assets=capx/net_asset gen acq_assets= aqc/net_asset gen payout=(dvc+ prstkc)/net_asset gen leverage=(dltt+ dlc)/net_asset In most cases we assume that our variables are normally distributed. Since outliers can distort results, it is generally advisable to winsorize every variable at a reasonable level. Notice that I use a loop to iterate over all variables. Please review the -foreach- command to understand looping in Stata. *Winsorize at 5%

foreach v of varlist cash_asset-leverage { winsor v, gen(w_v) p(0.05) drop v rename w_v v } While not necessary, it is generally a good idea to label your variables. It will make it easier to output your results. *Labeling variables label variable net_asset "Assets minus cash" label variable market_value "Market value of equity (shares*prices)" label variable cash_asset "Cash/assets" label variable cash_asset_tr "Truncated cash/assets" label variable size "Size" label variable mb "Market-to-book value" label variable rd_sales "R&D/sales" label variable cf_assets "Cash-flow/assets" label variable nwc_assets "Net working capital/assets" label variable capex_assets "CAPEX/assets" label variable acq_assets "Acquisitions/assets" label variable payout "Dividends and repurchases/assets" label variable leverage "Leverage" order market_value, after(net_asset) sort gvkey fyear save "${filepath}company_data_sp_1990_2010.dta", replace

We replicate Table 1 of OPSW. We add quartiles and minimum and maximum values as well. There are many ways to generate a presentable output in Stata, as Stata 13 even supports generating Excel or Word les. Alternatively, you can use the -estout- or -tabout- commands, but the -le write- command gives total control over generating the output. The - n- tells Stata to start a new line, while - tab- is for going to the next column. Notice that I generate a temporary variable for each statistic using -local-. For convenience, I also round each statistic to 3 digits. To see the available statistics, check the -summarizecommand and the -detail- option. *Summary statistics - replicating table 1 use "${filepath}company_data_sp_1990_2010.dta", clear file open desc using "${outreg}summary.txt", write replace file write desc "Variable" _tab "Obs" _tab "Mean" _tab "St. Dev." _tab "25%" /* */_tab "Median" _tab "75%" _tab "Min" _tab "Max" foreach v of varlist cash_asset-leverage { summarize v, detail local label: variable label v

local mean=round(r(mean),.001) local sd=round(r(sd),.001) local q1=round(r(p25),.001) local median=round(r(p50),.001) local q3=round(r(p75),.001) local q0=round(r(min),.001) local q4=round(r(max),.001) file write desc _n "label" _tab (r(N)) _tab (mean) _tab (sd) _tab (q1) /* */ _tab (median) _tab (q3) _tab (q0) _tab (q4) } file close desc type "${outreg}summary.txt" If you run the code then you should get the same output as Table 1. In case you want to show the evolution of a certain statistic over time, use the -collapse- command. If you do everything correctly you get Figure 1. *Figure - replicating OPSW figure 2 use "${filepath}company_data_sp_1990_2010.dta", clear sort fyear preserve collapse (median) cash_asset, by(fyear) twoway area cash_asset fyear graph export "${outreg}question2.eps", as(eps) preview(on) replace restore If you want to run time-series or panel regressions, you have to set your data with either -tset- (for time series) or -xtset- (for panel). After that you can use lag -l- and dierence -d- operators. In case you are interested in the distribution of regression coecients, you have to use the -statsby- command. Figure 2 shows the distribution of coecients from an autocorrelation regression. *Figure - replicating OPSW figure 3 use "${filepath}company_data_sp_1990_2010.dta", clear xtset gvkey fyear capture by gvkey: egen counter=count(fyear) preserve statsby, clear by(gvkey): reg d.cash_asset d.l1.cash_asset if counter>4 drop if _stat_1==. histogram _stat_1 if _stat_1<2 & _stat_1>-1 graph export "${outreg}question3.eps", as(eps) preview(on) replace restore When presenting your results, you rst have to show univariate tests based on your main variable(s). To that end you have to sort your variables into 2-5 groups based on the distribution of the main variable and then you can compare statistics between those groups. One such univariate test is in Table 2. *Quartile table - replicating OPSW table 3 use "${filepath}company_data_sp_1990_2010.dta", clear 4

bysort fyear: egen cash_q=xtile(cash_asset), n(4) drop if cash_q==. gen testvar=. replace testvar=0 if cash_q==1 replace testvar=1 if cash_q==4 *Cash/asset quartile intervals forvalues x=1/4 { sum cash_asset if cash_q==x, detail local fromx=round(r(min),.01) local tox=round(r(max),.01) local obsx=r(N) } file open quartile using "${outreg}quartiles.txt", write replace file write quartile "Variable" _tab "Q1" _tab "Q2" _tab "Q3" _tab "Q4" _tab "t-statistic" _tab "from1 to to1" _tab "from2 to to2" /* file write quartile _n "Cash/assets range"

*/_tab "from3 to to3" _tab "from4 to to4" _tab foreach v of varlist cash_asset-leverage { local label: variable label v forvalues y=1/4 { summarize v if cash_q==y, detail local meany=round(r(mean),.001) local mediany=round(r(p50),.001) } ttest v, by(testvar) local tstat=round(r(t),.001) local pval=round(r(p),.001) file write quartile _n "label" _tab "mean1" _tab "mean2" /* */ _tab "mean3" _tab "mean4" _tab "tstat" file write quartile _n _tab "[median1]" _tab "[median2]" /* */_tab "[median3]" _tab "[median4]" _tab "(pval)" } file write quartile _n "Observations" _tab "obs1" _tab "obs2" _tab /* */ "obs3" _tab "obs4" _tab file close quartile type "${outreg}quartiles.txt" Finally, to quantify the eect of your explanatory variables on the dependent variable, you have to run regressions. As a rst step, it is enough to run a simple OLS regression using the -reg- command. It is generally a good idea to compute standard errors corrected for heteroscedasticity and to cluster standard errors at the rm level -vce-. Then to nd out if your results are robust, you can include year and/or

industry xed eects using the -i.- operator. If you want to interact xed eects you can use -*- or -#(check the dierence). You can always copy-paste your regression output to Excel yourself, but it is advisable to use some output command such as -outreg- to automate this process. A regression table should be similar to Table 3. *Regressions - question 5 use "${filepath}company_data_sp_1990_2010.dta", clear gen lncash_asset=ln(cash_asset) label variable lncash_asset "Log of cash/asset" gen dividend=0 replace dividend=1 if dvc>0 *Either generate regulation dummy or address why not doing so gen regulated=0 replace regulated=1 if sic== 4011 | sic== 4210 | sic== 4213 | /* */ sic== 4512 | sic== 4812 | sic== 4813 label variable dividend "Dividend payer dummy" label variable regulated "Regulated industry dummy" set matsize 5000 *OLS with clustered robust errors reg lncash_asset size */, vce(cluster gvkey) outreg using "${outreg}ols_simple.tex", tex */varlabels fragment replace *OLS with clustered robust errors, industry and year dummies reg lncash_asset size mb rd_sales cf_assets nwc_assets capex_assets leverage dividend regulated /* */ i.fyear i.twodig, vce(cluster gvkey) outreg using "${outreg}ols_simple.tex", tex summstat(r2_a\N) summtitle("Adjusted R2"\"N") /* */keep(size rd_sales cf_assets nwc_assets capex_assets leverage dividend regulated mb) /* */ addrows("Year dummy", "yes" \ "Industry dummy", "yes" \"Year-industry dummy", "no") /* */ varlabels fragment merge *OLS with clustered robust errors and industry-year dummies reg lncash_asset size mb rd_sales cf_assets nwc_assets capex_assets leverage dividend regulated /* */ i.fyear#twodig, vce(cluster gvkey) outreg using "${outreg}ols_simple.tex", tex summstat(r2_a\N) summtitle("Adjusted R2"\"N") /* */keep(size rd_sales cf_assets nwc_assets capex_assets leverage dividend regulated mb) /* */ addrows("Year dummy", "no" \ "Industry dummy", "no" \"Year-industry dummy", "yes") /* 6 summstat(r2_a\N) summtitle("Adjusted R2"\"N") /* */ addrows("Year dummy", "no" \ "Industry dummy", "no" \"Year-industry dummy", "no") /* mb rd_sales cf_assets nwc_assets capex_assets leverage dividend regulated /*

Figure 1: Replication of OPSW Figure 2


.12 .04 1990 (p 50) cash_asset .06 .08 .1

1995

2000 Data Year Fiscal

2005

2010

*/ varlabels fragment merge

Table 1: Replication of OPSW Table 1

Variable Cash/assets Truncated cash/assets Size Market-to-book value R&D/sales Cash-ow/assets Net working capital/assets CAPEX/assets Acquisitions/assets Dividends and repurchases/assets Leverage

Obs 11473 11473 11706 11695 11708 10666 11116 11365 10503 10765 11420

Mean 0.096 0.096 8.162 2.113 0.035 0.099 0.104 0.065 0.022 0.045 0.245

St. Dev. 0.105 0.105 1.346 1.178 0.056 0.056 0.156 0.044 0.042 0.054 0.162

0.25 0.018 0.018 7.239 1.282 0 0.061 -0.01 0.031 0 0.004 0.122

Median 0.055 0.055 8.141 1.695 0.005 0.094 0.074 0.054 0.001 0.025 0.239

0.75 0.134 0.134 9.168 2.504 0.046 0.132 0.2 0.088 0.022 0.064 0.354

Min 0.004 0.004 5.634 0.978 0 0 -0.135 0.012 0 0 0

Max 0.386 0.386 10.533 5.422 0.19 0.224 0.447 0.173 0.153 0.193 0.571

Figure 2: Replication of OPSW Figure 3


1.5 Density 0 1 .5 1

.5

0 .5 _b[LD.cash_asset]

1.5

Table 2: Replication of OPSW Table 3 Variable Cash/assets range Cash/assets Truncated cash/assets Size Market-to-book value R&D/sales Cash-ow/assets Net working capital/assets CAPEX/assets Acquisitions/assets Dividends and repurchases/assets Leverage Observations Q1 0 to .05 0.011 [.009] 0.011 [.009] 8.531 [8.551] 1.695 [1.46] 0.01 [0] 0.087 [.084] 0.06 [.028] 0.068 [.057] 0.025 [.002] 0.036 [.021] 0.314 [.307] 2876 Q2 .01 to .11 0.037 [.031] 0.037 [.031] 8.434 [8.461] 1.84 [1.574] 0.022 [.002] 0.091 [.09] 0.088 [.061] 0.062 [.051] 0.025 [.002] 0.043 [.026] 0.271 [.26] 2865 Q3 .03 to .2 0.088 [.083] 0.088 [.083] 8.137 [8.061] 2.144 [1.767] 0.04 [.008] 0.099 [.097] 0.116 [.097] 0.063 [.052] 0.021 [.001] 0.048 [.028] 0.219 [.195] 2870 Q4 .09 to .39 0.247 [.227] 0.247 [.227] 7.477 [7.463] 2.824 [2.394] 0.07 [.053] 0.123 [.125] 0.151 [.137] 0.066 [.055] 0.018 [0] 0.054 [.023] 0.175 [.127] 2862 t-statistic -131.917 (0) -131.917 (0) 31.659 (0) -36.52 (0) -43.61 (0) -22.614 (0) -21.187 (0) 1.526 (.127) 6.165 (0) -12.361 (0) 33.839 (0)

10

Table 3: Replication of OPSW Table 4 Log of cash/asset Size Market-to-book value R&D/sales Cash-ow/assets Net working capital/assets CAPEX/assets Leverage Dividend payer dummy Regulated industry dummy Constant Adjusted R2 N Year dummy Industry dummy Year-industry dummy -0.082
(3.59)**

Log of cash/asset -0.140


(5.79)**

Log of cash/asset -0.127


(4.82)**

0.121
(5.21)**

0.161
(6.79)**

0.167
(6.22)**

6.276
(13.01)**

4.817
(7.76)**

4.694
(7.07)**

2.945
(6.47)**

1.454
(3.70)**

1.417
(3.21)**

-0.620
(2.98)**

-0.388
(1.92)

-0.219
(0.99)

-3.969
(6.64)**

-0.445
(0.72)

-0.234
(0.33)

-1.568
(9.00)**

-1.046
(6.46)**

-1.017
(5.80)**

-0.351
(5.47)**

-0.211
(3.58)**

-0.213
(3.29)**

-0.282
(1.63)

-0.064
(0.27)

-0.060
(0.23)

-2.198
(10.07)**

0.28 10,239 no no no
* p < 0.05; ** p < 0.01

0.42 10,239 yes yes no

0.41 10,239 no no yes

11

You might also like