Data and Monte Carlo Simulations
Data and Monte Carlo Simulations
Data and Monte Carlo Simulations
Analysis
Using Data
and
Monte Carlo Simulations
1
Overview
• Constructing probability distributions from
data
• Fitting data to theoretical probability
distributions
• Understanding the basics of Monte Carlo
Simulations
2
Preliminaries
We will look at two types of data:
– Sample data
• Denoted as x1, x2,…, xn, for n observations, where xi is a
known number
– Subjectively assessed data
• Denoted as (x1, p1), (x2, p2)…, (xn, pn ), for n pairs of value,
where pi is the cumulative probability associated xi, that
is P(X ≤ xi) = pi, i = 1, 2, …n
• In both cases, we are looking at a subset of
the uncertainty population.
3
Preliminaries
• And we will examine two ways to use data to
construct probability distributions:
– Directly construct the distribution based on the data
– Select a theoretical distribution that best fits the data
• Notice that:
– Both types of data (sample, assessed) can be used
with both types of distributions.
– Sometimes it is easier to model a discrete distribution
as a continuous distribution.
4
Using data to construct probability
distributions
• Constructing a discrete distribution from data
– Count the number of occurrences of each
category.
– Assign probabilities to the categories
• The probabilities are relative frequencies.
5
Using data to construct probability
distributions
• Constructing a discrete distribution from data
– Count the number of occurrences of each
category.
– Assign probabilities to the categories
• The probabilities are relative frequencies.
6
Using data to construct probability
distributions
• Constructing a discrete distribution from data
– Count the number of occurrences of each
category.
– Assign probabilities to the categories
• The probabilities are relative frequencies.
7
Using data to construct probability
distributions
• Some judgments are needed when using data.
– Ensure you have enough data
• A minimum of five observations per category.
– Familiarize yourself with the data to check for errors in
the data:
• Can be from many sources: e.g., data collection errors, data
entry errors, …
– “Get to know your data” to ensure that it is
representative of the uncertainty or underlying
population.
– Data is historical and you need to be cautious when
using it to predict the future.
8
Using data to construct probability
distributions
• We now look at constructing a discrete
probability distribution.
• First, construct an empirical distribution from
a sample:
– Sort the sample values from lowest to highest
– Assign probabilities to each value
9
Using data to construct probability
distributions
A sample of 10 observations
from an exponential distribution Assigned Cumulative
with rate parameter l= 1/10. Probability Probability
0.6 1/10 1/10
2 1/10 2/10
3.5 1/10 3/10
4 1/10 4/10
5.7 1/10 5/10
7.1 1/10 6/10
10.6 1/10 7/10
14.1 1/10 8/10
19.2 1/10 9/10
23.7 1/10 10/10
10
Using data to construct probability
distributions
An empirical
distribution can be
shown a CDF.
11
Using data to construct probability
distributions
This discrete distribution approximates the underlying continuous
distribution. The more observations used, the closer the approximation.
12
Using data to construct probability
distributions
How can we measure the quality or closeness of
a CDF to the continuous distribution?
– Measure how far apart the two distributions are
by measuring the vertical distance between them
• E.g., Kolmogorov-Smirnov distance
– Compare the mean and standard deviation of the
fitted distribution to the underlying distribution
• In both cases, the closer or smaller the
measured difference, the better the
approximation.
13
Using data to construct probability
distributions
• Some important formulas (point estimates):
Sample mean
14
Using data to fit probability
distributions
• Instead of constructing a distribution empirically
from sample data, you can look for a theoretical
distribution that closely matches the data.
• Fitting a theoretical distribution to data means
finding the values of the parameters such that
the theoretical distribution matches the data as
closely as possible.
– Parameters are the key characteristics that specify a
distribution.
• E.g., the parameters of a normal distribution are mean and
standard deviation.
15
Using data to fit probability
distributions
Standard deviation (s or σ) of a normal
distribution
16
Using data to fit probability
distributions
However, the best theoretical distribution for
sample data is not always the best fitting one
based on parameters. Why?
– The top fitting distributions are very close to each
other.
– Also keep in mind that some distributions have a
great deal flexibility in shaping to match data.
– Different measures of fit may produce different
results.
17
Using data to fit probability
distributions
@RISK is a good tool for matching distributions.
– It can run the fit on all of the distributions in its
library.
– It uses three measures of fit that compare the
parameters of the theoretical distribution to the
sample.
1. Kolmogorov-Smirnov distance
– Based on maximum vertical distance between distribution and
data
2. Anderson-Darling distance
– Similar to K-S distance but factors in the extreme tails
3. Chi-Squared distance
– Based on matching fractiles of distribution and data
18
Example 1
• Assessed yearly profits of an income property
(Obtained from Triangular (-25000,18,300,24000))
21
Simulation
• An imitation that reflects the operation of a real-
world process/system over time.
• Many real-world systems are very complex that
cannot be solved mathematically.
– Hence, numerical, computer-based simulation can be
used to imitate the system behavior.
• Simulations are used as:
– Analytical tool: predicts the effect of changes to
existing systems.
– Design tool: predicts the performance of new systems.
• Simulations models are “run” rather than solved.
22
Introduction to Monte Carlo
simulations
• Generate a Uniform Random Variable
U~Uniform(0,1).
• Using excel enter, “=rand()” then press F9 to
generate a new Random Variable (RV).
American
Roulette Wheel:
- 18 black
- 18 red
- 2 green
- Total: 38 slots
24
Roulette – Monte Carlo simulation
• Example: Generate a number between [0,37].
• We have the capability to generate
U~Uniform(0,1).
– Generate U(0,1)
– If (i/38 ≤ U < (i+1)/38) then the generated
number is i for i=0,..37.
– In other words: outcome = floor(U * 38)
25
Roulette
• Let’s transform coin_toss.xlsx into roulette.xlsx
26
Statistical recalls
• Population mean, µ , not a random variable
• Sample mean, 𝑋, ത random variable
• 𝑋ത is the best estimate of µ
• Using our sample data,
– Calculate a Confidence Interval on µ
– Construct a hypothesis test.
27
Constructing a confidence interval
• By the law of large numbers, if we take n samples from a
population, the mean of the n-samples tends towards the
actual mean of the population when n tends towards
infinity.
𝑛
1
lim 𝑋𝑖 = 𝜇
𝑛→∞ 𝑛
𝑖=1
28
Constructing a confidence interval
• A 1-α confidence interval implies that there is
1-α probability that the actual population
mean falls within the boundaries of the
confidence interval.
• Example:
– A 95% confidence interval for a simulation
outcome is [15, 25]. This means that we have 95%
confidence that the population mean is
somewhere between 15 and 25.
29
Constructing a confidence interval
• Given that , 𝑋ത tends to become normally
𝑠
distributed 𝑁~(𝜇, ), then the boundaries of
𝑛
a 95% confidence interval are defined as:
𝑠
– Low bound = 𝑋ത − 1.96 ⋅
𝑛
𝑠
– High bound = 𝑋ത + 1.96 ⋅
𝑛
30
Example – 32 simulation runs for 100 coin tosses
each with P(head) = 0.75
Simulation Run X=number of Heads
1 72
2 74 95% Confidence Interval
3 81 s
. . = X 1.96
n
. .
. . = 74.5 4.786 / 32
29 78 = 74.5 0.846
30 78 95% Confidence Interval [73.654,75.346]
31 70
32 69
32
Example: warehouse storage
• Our warehouse can store 80 items.
• The warehouse should be filled when it becomes half
empty.
• Daily demand probability distribution is:
• P(Daily demand = 0 items) = 0.10
• P(Daily demand = 1 items) = 0.15
• P(Daily demand = 2 items) = 0.20
• P(Daily demand = 3 items) = 0.30
• P(Daily demand = 4 items) = 0.20
• P(Daily demand = 5 items) = 0.05
33
Random number mapping
A number between 0 and 1 The daily demand is determined
is selected randomly. by the mapping demonstrated below.
0.30
0.20 0.20
0.15
0.10
0.05
Demand
(0 to 0.1) (0.1 to 0.25) (0.25 to 0.45) (0.45 to 0.75) (0.75 to 0.95) (0.95 to 1)
0 1 2 3 4 5
X2 =15
13 0.191 1 35
14 0.799 4 39
15 0.836 4 43 36
After 30 runs, we obtain the
following:
X= 95% Confidence Interval
Simulation Run Number of Days
s
1 14 = X 1.96
2 15 n
3 18
. . = 16.7 1.705 / 30
. .
. . = 16.7 0.311
28
29
15
16
95% Confidence Interval
30 17 [16.389,17.011]
X = 16.7
s = 1.705
n = 30 37
Back to central limit theorem
• Central Limit theorem: If all samples of a particular
size are selected from any population, the sampling
distribution of the sample mean is approximately a
normal distribution.
38
39
Mechanics of simulations
1. Construct a deterministic model
– No probability distributions are in this model.
2. Apply (“embed”) distributions to the
constant values in the deterministic model
where you expect variation or uncertainty
– You now have a probabilistic or stochastic model.
– These distributions may be assumed or based on
beliefs.
40
Mechanics of simulations
3. Randomly draw (sample) values from the
distributions to apply to the model for
recalculation
– With each new draw, you are running different
combinations of your model thru 1,000s of iterations.
– Each iteration is a single sample from the distribution.
4. Plot the outcomes of the iterations of the model
– This gives you the distribution of the uncertainty of
interest – risk profile
– You can now factor probabilities into your decisions.
41
Mechanics of simulations
Iteration – a recalculation of the model
• For every iteration, a new value is chosen for
each uncertainty according to the corresponding
probability distribution, and this value is used in
the calculations for that particular iteration.
• Increasing the number of iterations results in
sampled values more closely aligned with the
distribution
• At a minimum run 1,000 iterations; 10,000s is not
unusual
42
Simulation Process: (1) Deterministic model – no probabilities (not shown); (2)
Generic (stochastic) model with probabilities: (3) Iterations; (4) Risk profile
2 4
43
Sampling from probability
distributions
• Problem: how to draw a representative sample of
size n from a given probability distribution for an
uncertain variable X
– Needed to run iterations of the stochastic model
• Solution: a mathematical theorem states that as
long as we choose the probability values
uniformly (every possible value is equally likely)
from the interval (0, 1), then the corresponding x
values in the CDF will have approximately the
desired distribution
– This theorem is foundational for simulation programs.
44
Sampling from probability
distributions
45
@Risk exercise
• Test the CLT on a sample of size 30 from a
population with the following probability
distributions,
46
Risk Profile Example
• A Risk Profile has been constructed for a certain
project,
› 45% chance profit is triangular Tr(25,36,40)
› 35% chance profit is Uniform U(10,35)
› 20% loss of exactly 25
48
Leah Sanchez Example
• Let’s look at Leah Sanchez calendar sales
example (page 482).
49
50
Influence diagram
51
Develop a deterministic model
Next step is the
deterministic
model: a static
This is what Leah wants to know.
model whose
This is uncertain. consequence value
is completely
determined by the
input values.
Here is Leah’s in
Excel.
52
Demand is actually uncertain.
• Therefore, Leah wants to include the demand
uncertainty into the analysis.
• After assessing Leah’s cumulative probabilities for
various demand levels, we fit a probability
distribution.
• The distribution happens to be a general beta
distribution with:
– Min=600
– Max=1400
– α=2
– β=18
53
The demand distribution
54
Simulations
• Now Leah can simulate, for a given order
quantity, the associated distribution of the
profit given the uncertain demand.
55
Profit prob. distribution when 680 calendars are ordered.
56
Why do we have a big jump at $6,120?
• Because there is a 42% probability for demand
to be greater than or equal to 680 (the order
quantity).
57
Further investigations
• Leah decides to vary the order quantity and check E[profits]
58
Zooming in for orders around 700
59
Leah also checks the 5th and 95th percentiles.
60
Conclusion
• Leah has gained useful information and insight
from simulating the calendar ordering problem,
and now has a much better understanding of the
distribution of Profit for each value of Order
Quantity she is considering.
62
EP-T three-point
approximation
Fractile Probability
615
0.05 0.185
780
63
64
Simulation vs. decision tree
65
Simulation vs. decision tree
“When should I use simulation, and when should
I use decision trees?”
– In many cases both approaches work fine.
– However, there are two key issues to consider:
• If your decision situation involves a large number of
uncertainties, the necessarily large decision tree can be
very clumsy to work with. Use a simulation approach.
• If your decision situation involves future or
“downstream” decisions, then a decision tree might be
easier to work with.
66