Sjart st0229
Sjart st0229
Sjart st0229
AgEcon Search
h p://ageconsearch.umn.edu
[email protected]
Papers downloaded from AgEcon Search may be used for non-commercial purposes and personal study only.
No other use, including pos ng to another Internet site, is permi ed without permission from the copyright
owner (not AgEcon Search), or as allowed under the provisions of Fair Use, U.S. Copyright Act, Title 17 U.S.C.
No endorsement of AgEcon Search or its fundraising ac vi es by the author(s) of the following work or their
employer(s) is intended or implied.
tt
ti
ti
ti
tt
The Stata Journal (2011)
11, Number 2, pp. 299–304
1 Introduction
In statistics, a probability distribution identifies the probability of a random variable.
If the random variable is discrete, it identifies the probability of each value of this vari-
able. If the random variable is continuous, it defines the probability that this variable’s
value falls within a particular interval. The probability distribution describes the range
of possible values that a random variable can attain, further referred to as the sup-
port interval, and the probability that the value of the random variable is within any
(measurable) subset of that range.
Random sampling refers to taking a number of independent observations from a
probability distribution. Typically, the parameters of the probability distribution (of-
ten referred to as true parameters) are unknown, and the aim is to retrieve them using
various estimation methods on the random sample generated from this probability dis-
tribution.
For example, consider a normally distributed population with true mean μ and true
variance σ 2 . Assume that we take a sample of a given size from this population and
calculate its mean and standard deviation—these two statistics are called the sample
mean and the sample standard deviation. Depending on the estimation method used,
the sample size, and other factors, these two statistics can be shown to asymptotically
converge to the true parameters of the original probability distribution and are thus
good approximations of these values.
c 2011 StataCorp LP st0229
300 Random samples
The researcher may wish to evaluate the accuracy of the estimation method prior
to using it on real data. In such a case, a random sample generated from a distribution
with known true parameters can be used, and its estimated sample parameters can be
compared with the true parameters to establish the accuracy of the estimation method.
Taking our previous example, we can use a random sample generated from a normal
distribution with parameters μ = 0 and σ 2 = 1; we can estimate its sample mean
by taking the arithmetic average, and we can estimate its sample variance by taking
the arithmetic average of the squared errors. If these two values are reasonably close
enough to the true values of these parameters, we can conclude that our estimation
method (arithmetic averaging) is accurate.
In this article, I introduce a command that generates random samples from user-
specified probability distribution functions with known parameters that can be used,
among other applications, in such simulation exercises.
However, a user may wish to draw a random sample from a distribution that is not
built in Stata. In such a case, the inverse distribution function must be algebraically
derived so that the inverse cumulative distribution function method can be used. But
sometimes, the inverse of a function cannot be expressed by a formula. For example,
if f is the function f (x) = x + sin(x), then f is a one-to-one function and therefore
possesses an inverse function f −1 . However, there is no simple formula for this inverse,
because the equation y = x + sin(x) cannot be solved algebraically for x. In this case,
numerical methods must be applied to match the values of x and y.
This command generates the random variables into a new variable called rsample.
pdf function is required. It is a string that specifies the probability distribution
function that the random sample is to be drawn from. It must be formulated in terms
of x, for example, exp(-x^2), although no x variable needs to exist prior to command
execution.
Two properties must normally hold for the probability distribution functions:
1. They must be nonnegative on the whole support interval of the random variable.
The second condition does not have to be fulfilled in this case because the rsample
command calculates what the pdf function sums or integrates up to and uses that value
as a rescaling factor. This way, a rescaled pdf function is used that always integrates
or sums to 1. The first condition, however, must be fulfilled. If it is violated, an
error message appears that reminds the user to supply a nonnegative pdf function. The
rescaling constant always appears on the screen.
left(#) and right(#) specify the support interval, that is, all values that the
random variable can attain. left() must be specified to be less than right(). If these
values are not specified, they take on default values of -2 and 2, respectively.
bins(#) specifies the number of bins into which the support interval is split for the
purposes of the algorithm. Essentially, it allows the user to specify the precision of the
algorithm. If this value is not specified, it takes on a default value of 1000. However,
if this value (whether defined by the user or its default) exceeds the total number of
observations (whether defined by the corresponding optional parameter or by the size
of the existing Stata file), it is automatically set to equal one-fifth of the total number
of observations.
302 Random samples
4 Examples
In this section, I provide several specific examples with various pdf functions and various
optional parameters specified. The graphical representations of the last three examples
are presented in figures 1, 2, and 3 in the form of histograms of the generated random
values. These three examples were generated from normal, lognormal, and Laplace
distribution functions, respectively.
−2 −1 0 1 2
values
random sample
theoretical pdf
.3 .5
.4
Density
.2
.1
0
0 2 4 6
values
random sample
theoretical pdf
−4 −2 0 2 4
values
random sample
theoretical pdf
5 References
Avramidis, A. N., and J. R. Wilson. 1994. A flexible method for estimating inverse
distribution functions in simulation experiments. ORSA Journal on Computing 6:
342–355.
Gentle, J. E. 2003. Random Number Generation and Monte Carlo Methods. 2nd ed.
New York: Springer.
Lind, D. A., W. G. Marchal, and S. A. Wathen. 2008. Basic Statistics for Business and
Economics. 6th ed. New York: McGraw–Hill.
304 Random samples
Ulrich, G., and L. T. Watson. 1987. A method for computer generation of variates
from arbitrary continuous distributions. SIAM Journal on Scientific Computing 8:
185–197.