Data and Monte Carlo Simulations

Probability and Decision
Analysis
Using Data
and
Monte Carlo Simulations
1
Overview
• Constructing probability distributions from
data
• Fitting data to theoretical probability
distributions
• Understanding the basics of Monte Carlo
Simulations
2
Preliminaries
We will look at two types of data:
– Sample data
• Denoted as x1, x2,…, xn, for n observations, where xi is a
known number
– Subjectively assessed data
• Denoted as (x1, p1), (x2, p2)…, (xn, pn ), for n pairs of value,
where pi is the cumulative probability associated xi, that
is P(X ≤ xi) = pi, i = 1, 2, …n
• In both cases, we are looking at a subset of
the uncertainty population.
3
Preliminaries
• And we will examine two ways to use data to
construct probability distributions:
– Directly construct the distribution based on the data
– Select a theoretical distribution that best fits the data
• Notice that:
– Both types of data (sample, assessed) can be used
with both types of distributions.
– Sometimes it is easier to model a discrete distribution
as a continuous distribution.
4
Using data to construct probability
distributions
• Constructing a discrete distribution from data
– Count the number of occurrences of each
category.
– Assign probabilities to the categories
• The probabilities are relative frequencies.
5
distributions
category.
6
distributions
category.
› If the data are sample values, the discrete probabilities are

relative frequencies.
› If the data are subjective assessments, the discrete
probabilities are simply the assessments.
7
distributions
• Some judgments are needed when using data.
– Ensure you have enough data
• A minimum of five observations per category.
– Familiarize yourself with the data to check for errors in
the data:
• Can be from many sources: e.g., data collection errors, data
entry errors, …
– “Get to know your data” to ensure that it is
representative of the uncertainty or underlying
population.
– Data is historical and you need to be cautious when
using it to predict the future.
8
distributions
• We now look at constructing a discrete
probability distribution.
• First, construct an empirical distribution from
a sample:
– Sort the sample values from lowest to highest
– Assign probabilities to each value
9
distributions
A sample of 10 observations
from an exponential distribution Assigned Cumulative
with rate parameter l= 1/10. Probability Probability
0.6 1/10 1/10
2 1/10 2/10
3.5 1/10 3/10
4 1/10 4/10
5.7 1/10 5/10
7.1 1/10 6/10
10.6 1/10 7/10
14.1 1/10 8/10
19.2 1/10 9/10
23.7 1/10 10/10
10
distributions
An empirical
distribution can be
shown a CDF.
11
distributions
This discrete distribution approximates the underlying continuous
distribution. The more observations used, the closer the approximation.
Based on 10 observations Based on 30 observations
12
distributions
How can we measure the quality or closeness of
a CDF to the continuous distribution?
– Measure how far apart the two distributions are
by measuring the vertical distance between them
• E.g., Kolmogorov-Smirnov distance
– Compare the mean and standard deviation of the
fitted distribution to the underlying distribution
• In both cases, the closer or smaller the
measured difference, the better the
approximation.
13
distributions
• Some important formulas (point estimates):
Sample mean
Estimate of population standard deviation based on Sample data
14
Using data to fit probability
distributions
• Instead of constructing a distribution empirically
from sample data, you can look for a theoretical
distribution that closely matches the data.
• Fitting a theoretical distribution to data means
finding the values of the parameters such that
the theoretical distribution matches the data as
closely as possible.
– Parameters are the key characteristics that specify a
distribution.
• E.g., the parameters of a normal distribution are mean and
standard deviation.
15
distributions
Standard deviation (s or σ) of a normal
distribution
16
distributions
However, the best theoretical distribution for
sample data is not always the best fitting one
based on parameters. Why?
– The top fitting distributions are very close to each
other.
– Also keep in mind that some distributions have a
great deal flexibility in shaping to match data.
– Different measures of fit may produce different
results.
17
distributions
@RISK is a good tool for matching distributions.
– It can run the fit on all of the distributions in its
library.
– It uses three measures of fit that compare the
parameters of the theoretical distribution to the
sample.
1. Kolmogorov-Smirnov distance
– Based on maximum vertical distance between distribution and
data
2. Anderson-Darling distance
– Similar to K-S distance but factors in the extreme tails
3. Chi-Squared distance
– Based on matching fractiles of distribution and data
18
Example 1
• Assessed yearly profits of an income property
(Obtained from Triangular (-25000,18,300,24000))
Assessed values ( xi ) Prob. ( pi )

P(Yearly Profit ≤ -$25,000) 0.00
P(Yearly Profit ≤ -$10,000) 0.10
P(Yearly Profit ≤ $0) 0.30
P(Yearly Profit ≤ $15,000 ) 0.75
P(Yearly Profit ≤ $24,000) 1.00
19
Example 1
To measure how closely a

fitted theoretical distribution
is to the assessed,
1) Calculate the Vertical
Distance between
Assessed Values and Fitted
Distribution.
2) Compute the Root Mean
Square Error (RMSE)
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒12 + 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒22 + 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒32

𝑅𝑀𝑆𝐸 =
3
20
Mechanics of simulations
• A simulation model is a mathematical model
in which a probability distribution is used to
represent the possible values of an uncertain
variable.
– Similar to decision trees
– Allows for continuous as well as discrete
distributions
21
Simulation
• An imitation that reflects the operation of a real-
world process/system over time.
• Many real-world systems are very complex that
cannot be solved mathematically.
– Hence, numerical, computer-based simulation can be
used to imitate the system behavior.
• Simulations are used as:
– Analytical tool: predicts the effect of changes to
existing systems.
– Design tool: predicts the performance of new systems.
• Simulations models are “run” rather than solved.
22
Introduction to Monte Carlo
simulations
• Generate a Uniform Random Variable
U~Uniform(0,1).
• Using excel enter, “=rand()” then press F9 to
generate a new Random Variable (RV).
• Consider an unfair coin with probability of heads

being 0.3. How can I simulate the coin toss by
generating a Uniform(0,1)?
• Generate U.
• If U < 0.3 then we obtain a head.
coin_toss.xlsx
23
Roulette wheel
American
Roulette Wheel:
- 18 black
- 18 red
- 2 green
- Total: 38 slots
24
Roulette – Monte Carlo simulation
• Example: Generate a number between [0,37].
• We have the capability to generate
U~Uniform(0,1).
– Generate U(0,1)
– If (i/38 ≤ U < (i+1)/38) then the generated
number is i for i=0,..37.
– In other words: outcome = floor(U * 38)
25
Roulette
• Let’s transform coin_toss.xlsx into roulette.xlsx
26
Statistical recalls
• Population mean, µ , not a random variable
• Sample mean, 𝑋, ത random variable
• 𝑋ത is the best estimate of µ
• Using our sample data,
– Calculate a Confidence Interval on µ
– Construct a hypothesis test.
27
Constructing a confidence interval
• By the law of large numbers, if we take n samples from a
population, the mean of the n-samples tends towards the
actual mean of the population when n tends towards
infinity.
𝑛
1
lim ෍ 𝑋𝑖 = 𝜇
𝑛→∞ 𝑛
𝑖=1
• Based on the central limit theorem, 𝑋ത is a random variable

𝑠
that tends to become normally distributed 𝑁~(𝜇, ) when
𝑛
n is large, where 𝑠 is the estimated population standard
deviation based on the sample data.
28
• A 1-α confidence interval implies that there is
1-α probability that the actual population
mean falls within the boundaries of the
confidence interval.
• Example:
– A 95% confidence interval for a simulation
outcome is [15, 25]. This means that we have 95%
confidence that the population mean is
somewhere between 15 and 25.
29
• Given that , 𝑋ത tends to become normally
𝑠
distributed 𝑁~(𝜇, ), then the boundaries of
𝑛
a 95% confidence interval are defined as:
𝑠
– Low bound = 𝑋ത − 1.96 ⋅
𝑛
𝑠
– High bound = 𝑋ത + 1.96 ⋅
𝑛
30
Example – 32 simulation runs for 100 coin tosses
each with P(head) = 0.75
Simulation Run X=number of Heads
1 72
2 74 95% Confidence Interval
3 81 s
. . = X  1.96 
n
. .
. . = 74.5  4.786 / 32
29 78 = 74.5  0.846
30 78  95% Confidence Interval [73.654,75.346]
31 70
32 69
X = 74.5 Notice that since X is Binomial with parameters

n=100 and p=0.75. The population mean of X is
s = 4.786 np=75 and is within the 95% CI.
n = 32
31
Note
• Note that when n increases the width of the
confidence interval is reduced.
– We become more confident.
32
Example: warehouse storage
• Our warehouse can store 80 items.
• The warehouse should be filled when it becomes half
empty.
• Daily demand probability distribution is:
• P(Daily demand = 0 items) = 0.10
What is the expected number of days until the

warehouse becomes half empty?
33
Random number mapping
A number between 0 and 1 The daily demand is determined
is selected randomly. by the mapping demonstrated below.
0.30
0.20 0.20
0.15
0.10
0.05
Demand
(0 to 0.1) (0.1 to 0.25) (0.25 to 0.45) (0.45 to 0.75) (0.75 to 0.95) (0.95 to 1)
0 1 2 3 4 5
If U=0.345 then Demand is 2

34
Simulation Run # 1
Let X be the number of days until the warehouse is half empty
Day Random Number Demand Total Demand to Date

1 0.651 3 3
2 0.105 1 4
3 0.677 3 7
4 0.975 5 12
5 0.818 4 16
6 0.133 1 17
7 0.002 0 17
8 0.818 4 21
9 0.774 4 25
10 0.538 3 28
11 0.953 5 33
12 0.616 3 36
X1 =14 13
14
0.233
0.563
1
3
37
40
35
Simulation Run # 2
Let X be the number of days until the warehouse is half empty
Day Random Number Demand Total Demand to Date

1 0.166 1 1
2 0.963 5 6
3 0.632 3 9
4 0.828 4 13
5 0.191 1 14
6 0.919 4 18
7 0.195 1 19
8 0.64 3 22
9 0.951 5 27
10 0.785 4 31
11 0.247 1 32
12 0.396 2 34
X2 =15
13 0.191 1 35
14 0.799 4 39
15 0.836 4 43 36
After 30 runs, we obtain the
following:
X= 95% Confidence Interval
Simulation Run Number of Days
s
1 14 = X  1.96 
2 15 n
3 18
. . = 16.7  1.705 / 30
. .
. . = 16.7  0.311
28
29
15
16
 95% Confidence Interval
30 17 [16.389,17.011]
X = 16.7
s = 1.705
n = 30 37
Back to central limit theorem
• Central Limit theorem: If all samples of a particular
size are selected from any population, the sampling
distribution of the sample mean is approximately a
normal distribution.
This approximation improves with larger samples.
› We can reason about the distribution of the sample mean

with no information about the shape of the population
distribution from which the sample is taken.
› The central limit theorem is true for all distributions.
› A sample of 30 or more is large enough to apply the CLT.
38
39
1. Construct a deterministic model
– No probability distributions are in this model.
2. Apply (“embed”) distributions to the
constant values in the deterministic model
where you expect variation or uncertainty
– You now have a probabilistic or stochastic model.
– These distributions may be assumed or based on
beliefs.
40
3. Randomly draw (sample) values from the
distributions to apply to the model for
recalculation
– With each new draw, you are running different
combinations of your model thru 1,000s of iterations.
– Each iteration is a single sample from the distribution.
4. Plot the outcomes of the iterations of the model
– This gives you the distribution of the uncertainty of
interest – risk profile
– You can now factor probabilities into your decisions.
41
Iteration – a recalculation of the model
• For every iteration, a new value is chosen for
each uncertainty according to the corresponding
probability distribution, and this value is used in
the calculations for that particular iteration.
• Increasing the number of iterations results in
sampled values more closely aligned with the
distribution
• At a minimum run 1,000 iterations; 10,000s is not
unusual
42
Simulation Process: (1) Deterministic model – no probabilities (not shown); (2)
Generic (stochastic) model with probabilities: (3) Iterations; (4) Risk profile
2 4
43
Sampling from probability
distributions
• Problem: how to draw a representative sample of
size n from a given probability distribution for an
uncertain variable X
– Needed to run iterations of the stochastic model
• Solution: a mathematical theorem states that as
long as we choose the probability values
uniformly (every possible value is equally likely)
from the interval (0, 1), then the corresponding x
values in the CDF will have approximately the
desired distribution
– This theorem is foundational for simulation programs.
44
Sampling from probability
distributions
45
@Risk exercise
• Test the CLT on a sample of size 30 from a
population with the following probability
distributions,
› Exponential with Mean 10.

› Uniform with minimum 15 and maximum 25
U(15,25).
46
Risk Profile Example
• A Risk Profile has been constructed for a certain
project,
› 45% chance profit is triangular Tr(25,36,40)
› 35% chance profit is Uniform U(10,35)
› 20% loss of exactly 25
• Calculate the Mean and Standard deviation of the

Profit.
• What is the probability the profit is greater than 20?
• What is the probability the profit is less than 14?
Check risk_profile.xlsx
47
NPV Example
• Check NPV_uncertainty.xlsx
• What is the probability that the NPV is above

50,000 USD?
• What is the mean NPV?
• What is the 90% confidence interval?
48
Leah Sanchez Example
• Let’s look at Leah Sanchez calendar sales
example (page 482).
49
50
Influence diagram
51
Develop a deterministic model
Next step is the
deterministic
model: a static
This is what Leah wants to know.
model whose
This is uncertain. consequence value
is completely
determined by the
input values.
Here is Leah’s in
Excel.
52
Demand is actually uncertain.
• Therefore, Leah wants to include the demand
uncertainty into the analysis.
• After assessing Leah’s cumulative probabilities for
various demand levels, we fit a probability
distribution.
• The distribution happens to be a general beta
distribution with:
– Min=600
– Max=1400
– α=2
– β=18
53
The demand distribution
54
Simulations
• Now Leah can simulate, for a given order
quantity, the associated distribution of the
profit given the uncertain demand.
55
Profit prob. distribution when 680 calendars are ordered.
56
Why do we have a big jump at $6,120?
• Because there is a 42% probability for demand
to be greater than or equal to 680 (the order
quantity).
57
Further investigations
• Leah decides to vary the order quantity and check E[profits]
58
Zooming in for orders around 700
59
Leah also checks the 5th and 95th percentiles.
60
Conclusion
• Leah has gained useful information and insight
from simulating the calendar ordering problem,
and now has a much better understanding of the
distribution of Profit for each value of Order
Quantity she is considering.
• She may go with the alternative that maximizes

expected Profit (700) or she may choose another
alternative, particularly if she wants to reduce
risks.
61
Simulation vs. decision tree
What if Leah used decision-tree modeling instead of
simulation modeling?
• First, substitute a discrete distribution for the
continuous beta distribution used in simulation:
– Leah uses the extended Pearson-Tukey (EP-T)
distribution – it requires only three points.
– The next slide shows a decision-tree for only two
different values for Order Quantity.
– However, a full analysis would use several possible
values, just as in the simulation model.
62
EP-T three-point
approximation
Fractile Probability
615
0.05 0.185
679 0.5 0.63

0.95 0.185
780
63
64
65
“When should I use simulation, and when should
I use decision trees?”
– In many cases both approaches work fine.
– However, there are two key issues to consider:
• If your decision situation involves a large number of
uncertainties, the necessarily large decision tree can be
very clumsy to work with. Use a simulation approach.
• If your decision situation involves future or
“downstream” decisions, then a decision tree might be
easier to work with.
66

Data and Monte Carlo Simulations

Uploaded by

Copyright:

Available Formats

Data and Monte Carlo Simulations

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data and Monte Carlo Simulations

Uploaded by

Copyright:

Available Formats

Probability and Decision

› If the data are sample values, the discrete probabilities are

Based on 10 observations Based on 30 observations

Estimate of population standard deviation based on Sample data

Assessed values ( xi ) Prob. ( pi )

To measure how closely a

𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒12 + 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒22 + 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒32

• Consider an unfair coin with probability of heads

• Based on the central limit theorem, 𝑋ത is a random variable

X = 74.5 Notice that since X is Binomial with parameters

What is the expected number of days until the

If U=0.345 then Demand is 2

Day Random Number Demand Total Demand to Date

Day Random Number Demand Total Demand to Date

This approximation improves with larger samples.

› We can reason about the distribution of the sample mean

› Exponential with Mean 10.

• Calculate the Mean and Standard deviation of the

• What is the probability that the NPV is above

• She may go with the alternative that maximizes

679 0.5 0.63

You might also like