0% found this document useful (0 votes)
25 views

Chap 06 Slides

This document discusses selecting input probability distributions for simulation modeling. It describes useful probability distributions including continuous, discrete, and empirical distributions. Parameterizations of continuous distributions are discussed. Several techniques for assessing sample independence are presented, including correlation plots, scatter diagrams, and formal statistical tests. The document outlines activities for hypothesizing distribution families from sample statistics and histograms, estimating distribution parameters, and determining how representative fitted distributions are through heuristic procedures and goodness-of-fit tests. Specifying multivariate distributions, selecting distributions without data, and modeling arrival processes are also covered.

Uploaded by

chavita88
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Chap 06 Slides

This document discusses selecting input probability distributions for simulation modeling. It describes useful probability distributions including continuous, discrete, and empirical distributions. Parameterizations of continuous distributions are discussed. Several techniques for assessing sample independence are presented, including correlation plots, scatter diagrams, and formal statistical tests. The document outlines activities for hypothesizing distribution families from sample statistics and histograms, estimating distribution parameters, and determining how representative fitted distributions are through heuristic procedures and goodness-of-fit tests. Specifying multivariate distributions, selecting distributions without data, and modeling arrival processes are also covered.

Uploaded by

chavita88
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

6-1

CHAPTER 6
Selecting Input Probability Distributions


6.1 Introduction................................................................................................3
6.2 Useful Probability Distributions....................................................................4
6.2.1 Parameterization of Continuous Distributions........................................................................4
6.2.2 Continuous Distributions......................................................................................................5
6.2.3 Discrete Distributions ..........................................................................................................5
6.2.4 Empirical Distributions.........................................................................................................6
6.3 Techniques for Assessing Sample Independence..........................................8
6.4 Activity I: Hypothesizing Families of Distributions .....................................10
6.4.1 Summary Statistics............................................................................................................11
6.4.2 Histograms........................................................................................................................12
6.4.3 Quantile Summaries and Box Plots ....................................................................................13
6.5 Activity II: Estimation of Parameters..........................................................16
6.6 Activity III: Determining How Representative the Fitted Distributions Are ...21
6.6.1 Heuristic Procedures.........................................................................................................21
6.6.2 Goodness-of-Fit Tests ......................................................................................................26
6.7 The ExpertFit Software and an Extended Example ......................................32
6.8 Shifted and Truncated Distributions ...........................................................33
6.9 Bzier Distributions ...................................................................................34
6.10 Specifying Multivariate Distributions, Correlations, and Stochastic Processes
.......................................................................................................................35
6.10.1 Specifying Multivariate Distributions.................................................................................35
6-2
6.10.2 Specifying Arbitrary Marginal Distributions and Correlations ............................................36
6.10.3 Specifying Stochastic Processes ......................................................................................36
6.11 Selecting a Distribution in the Absence of Data .........................................37
6.12 Models of Arrival Processes ....................................................................38
6.12.1 Poisson Processes...........................................................................................................38
6.12.2 Nonstationary Poisson Process........................................................................................39
6.12.3 Batch Arrivals .................................................................................................................39
6.13 Assessing the Homogeneity of Different Data Sets ....................................40

6-3
6.1 Introduction

Part of modelingwhat input probability distributions to use as input to simulation
for:
Interarrival times
Service/machining times
Demand/batch sizes
Machine up/down times

Inappropriate input distribution(s) can lead to incorrect output, bad decisions

Usually, have observed data on input quantitiesoptions for use:

Use Pros Cons
Trace-driven
Use actual data values to drive
simulation
Valid vis vis real
world
Direct
Not generalizable
Empirical distribution
Use data values to define a
connect-the-dots distribu-
tion (several specific ways)
Fairly valid
Simple
Fairly direct
May limit range of
generated variates
(depending on
form)
Fitted standard distribution
Use data to fit a classical dis-
tribution (exponential, uni-
form, Poisson, etc.)
Generalizablefills in
holes in data
May not be valid
May be difficult

6-4
6.2 Useful Probability Distributions

Many distributions exist, found useful for simulation input modeling

6.2.1 Parameterization of Continuous Distributions

Alternative ways to parameterize most distributions; not consistently done

Typically, parameters can be classified as one of:

Location parameter (also called shift parameter): specifies an abscissa (x
axis) location point of a distributions range of values, often some kind of
midpoint of the distribution.
o Example: for normal distribution
o As changes, distribution just shifts left or right without changing its
spread or shape
o If X has location parameter 0, then X + has location parameter

Scale parameter : determines scale, or units of measurement, or spread, of
a distribution.
o Examples: for normal distribution, for exponential distribution
o As changes, the distribution is compressed or expanded without
changing its shape
o If X has scale parameter 1, then X has scale parameter

Shape parameter : determines, separately from location and scale, the
basic form or shape of a distribution
o Examples: normal and exponential distribution do not have shape
parameter; for gamma and Weibull distributions
o May have more than one shape parameter (beta distribution has two
shape parameters)
o Change in shape parameter(s) alters distributions shape more
fundamentally than changes in scale or location parameters
6-5

6.2.2 Continuous Distributions

Compendium of 13 continuous distributions

Possible applications
Density and distribution functions (where applicable)
Parameter definitions and ranges
Range of possible values
Mean, variance, mode
Maximum-likelihood estimator formula or method
General comments, including relationships to other distributions
Plots of densities


6.2.3 Discrete Distributions

Compendium of 6 discrete distributions, with similar information as for continuous
distributions
6-6
6.2.4 Empirical Distributions

Use observed data themselves to specify directly an empirical distribution; maybe
no standard distribution fits the data adequately

There are many different ways to specify empirical distributions, resulting in
different distributions with different properties

Continuous Empirical Distributions

If original individual data points are available (i.e., data are not grouped)

Sort data X
1
, X
2
, ..., X
n
into increasing order: X
(i)
is ith smallest
Define F(X
(i)
) = (i 1)/(n 1), approximately (for large n) the proportion of
the data less than X
(i)
, and interpolate linearly between observed data
points:

'

<

<

+
+
x X
n i X x X
X X n
X x
n
i
X x
x F
n
i i
i i
i
) (
) 1 ( ) (
) ( ) 1 (
) (
) 1 (
if 1
1 ..., , 2 , 1 for if
) )( 1 ( 1
1
if 0
) (

Potential disadvantages
Generated data will be within range of observed data
Expected value of this distribution is not the sample mean
Other ways to define continuous empirical distributions, including putting an
exponential tail on the right to make the range infinite on the right
Rises most steeply
over regions where
observations are
dense, as desired.
6-7
If only grouped data are available

Dont know individual data values, but counts of observations in adjacent
intervals

Define empirical distribution function G(x) with properties similar to F(x)
above for individual data points (details in text)

Discrete Empirical Distributions

If original individual data points are available (i.e., data are not grouped)

For each possible value x, define p(x) = proportion of the data values that are
equal to x

If only grouped data are available

Define a probability mass function such that the sum of the p(x)s for the xs
in an interval is equal to the proportion of the data in that interval

Allocation of p(x)s for xs in an interval is arbitrary
6-8
6.3 Techniques for Assessing Sample
Independence

Most methods to specify input distributions assume observed data X
1
, X
2
, ..., X
n

are and independent (random) sample from some underlying distribution
If not, most methods are invalid
Need a way to check data empirically for independence
Heuristic plots vs. formal statistical tests for independence

Correlation plot: If data are observed in a time sequence, compute sample
correlation
j
(see Sec. 4.4 for formula) and plot as a function of the lag j
If data are independent then the correlations should be near zero for all lags
Keep in mind that these are just estimates

Scatter diagram: Plot pairs (X
i
, X
i+1
)
If data are independent the pairs should be scattered randomly
If data are positively (negatively) correlated the pairs will lie along a positively
(negatively) sloping line

Independent draws from expo(1) distribution (independent by construction):
Correlation plot Scatter diagram

6-9
Delays in queue from M/M/1 queue with utilization factor = 0.8 (positively
correlated):
Correlation plot Scatter diagram



Formal statistical tests for independence:
Nonparametric tests: rank von Neumann ratio
Runs tests
6-10
6.4 Activity I: Hypothesizing Families of
Distributions

First, need to decide what form or family to useexponential, gamma, or what?

Later, need to estimate parameters and assess goodness of fit

Sometimes have some prior knowledge of random variables role in simulation

Requires no data

Use theoretical knowledge of random variables role in simulation

Seldom have enough prior knowledge to specify a distribution completely;
exceptions:
Arrivals one-at-a-time, constant mean rate, independent: exponential
interarrival times
Sum of many independent pieces: normal
Product of many independent pieces: lognormal

Often use prior knowledge to rule out distributions on basis of range:
Service times: not normal (normal range always goes negative)

Still should be supported by data (e.g., for parameter-value estimation)
6-11
6.4.1 Summary Statistics


Compare simple sample statistics with theoretical population versions for some
distributions to get a hint

Bear in mind that we get only estimates subject to uncertainty

If sample mean ) (n X and sample median ) (
5 . 0
n x are close, suggests a symmetric
distribution

Coefficient of variation of a distribution: cv = /; estimate via cv = ) ( / ) ( n X n S ;
sometimes useful for discriminating between continuous distributions
cv < 1 suggests gamma or Weibull with shape parameter < 1
cv = 1 suggests exponential
cv > 1 suggests gamma or Weibull with shape parameter > 1

Lexis ratio of a distribution: =
2
/; estimate via ) ( / ) (
2
n X n S ; sometimes
useful for discriminating between discrete distributions
< 1 suggests binomial
= 1 suggests Poisson
> 1 suggests negative binomial or geometric

Other summary statistics: range, skewness, kurtosis
^
6-12
6.4.2 Histograms


Continuous Data Set

Basically an unbiased estimate of b f(x), where f(x) is the true (unknown)
underlying density of the observed data and b is a constant
Break range of data into k intervals of width b each
k, b are basically trial and error
One rule of thumb, Sturgess rule:
] ]
n n k
10 2
log 332 . 3 1 log 1 + +
Compute proportion h
j
of data falling in jth interval; plot a constant of height h
j

above the jth interval
Shape of plot should resemble density of underlying distribution; compare shape of
histogram to density shapes in Sec. 6.2.2


Discrete Data Set

Basically an unbiased estimate of the (unknown) underlying probability mass
function of the data
For each possible value x
j
that can be assumed by the data, let h
j
be the proportion
of the data that are equal to x
j
; plot a bar of height h
j
above x
j

Shape of plot should resemble mass function of underlying distribution; compare
shape of histogram to mass-function shapes in Sec. 6.2.3


Multimodal Data

Histogram might have multiple local modes, rather than just one; no single
standard distribution adequately represents this
Possibility: data can be separated on some context-dependent basis (e.g., observed
machine downtimes are classified as minor vs. major)
Separate data on this basis, fit separately, recombine as a mixture (details in text)
6-13
6.4.3 Quantile Summaries and Box Plots


Quantile Summaries

Numerical synopsis of sample quantiles useful for detecting whether underlying
density or mass function is symmetric or skewed one way or the other

Definition of quantiles: Suppose the CDF F(x) is continuous and strictly increasing
whenever 0 < F(x) < 1, and let q be strictly between 0 and 1. Then the q-
quantile of F(x) is the number x
q
such that F(x
q
) = q. If F
1
is the inverse of F,
then x
q
= F
1
(q)

q = 0.5: median
q = 0.25 or 0.75: quartiles
q = 0.125, 0.875: octiles
q = 0, 1: extremes

Quantile summary: List median, average of quartiles, average of octiles, and avg. of
extremes
If distribution is symmetric, then median, avg. of quartiles, avg. of octiles, and
avg. of extremes should be approximately equal
If distribution is skewed right, then
median < avg. of quartiles < avg. of octiles < avg. of extremes
If distribution is skewed left, then
median > avg. of quartiles > avg. of octiles > avg. of extremes


Box Plots

Graphical display of quantile summary
On horizontal axis, plot median, extremes, octiles, and a box ending at quartiles
Symmetry or asymmetry of plot indicates symmetry or skewness of distribution
6-14
Hypothesizing a Family of Distributions: Example with Continuous Data


Sample of n = 219 interarrival times of cars to a drive-up bank over a 90-minute
peak-load period
Number of cars arriving in each of the six 15-minute periods was
approximately equal, suggesting stationarity of arrival rate

Sample mean = 0.399 (all times in minutes) > median = 0.270, skewness = +1.458,
all suggesting right skewness

cv = 0.953, close to 1, suggesting exponential

Histograms (for different choices of interval width b) suggest exponential:


Box plot is consistent with exponential:



6-15
Hypothesizing a Family of Distributions: Example with Discrete Data

Sample of n = 156 observations on number of items demanded per week from an
inventory over a three-year period

Range 0 through 11

Sample mean = 1.891 > median = 1.00, skewness = +1.655, all suggesting right
skewness

Lexis ratio = 5.285/1.891 = 2.795 > 1, suggesting negative binomial or geometric
(special case of negative binomial)

Histogram suggests geometric:


6-16
6.5 Activity II: Estimation of Parameters

Have: Hypothesized distribution

Need: Numerical estimates of its parameter(s)this constitutes the fit

Many methods to estimate distribution parameters
Method of moments
Unbiased
Least squares
Maximum likelihood (MLE)

In some sense, MLE is the preferred method for our purposes
Good statistical properties
Somewhat justifies chi-square goodness-of-fit test
Intuitive
Allows estimates of error in the parameterssensitivity analysis

Idea for MLEs:
Have observed sample X
1
, X
2
, ..., X
n
Came from some true (unknown) parameter(s) of the distribution form
Pick the parameter(s) that make it most likely that you would get what you did
get (or close to what you got in the continuous case)
An optimization (mathematical-programming) problem, often messy
6-17
MLEs for Discrete Distributions

Have hypothesized family with PMF p


(x
j
) =P

(X = x
j
)

Single (for now) unknown parameter to be estimated

For any trial value of , the probability of getting the already-observed sample is
4 4 4 4 3 4 4 4 4 2 1
L
L
L
) ( function Likelihood
2 1
2 1
2 1 2 1
) ( ) ( ) (
) ( ) ( ) (
) ( ) ( ) ( ) ,..., , Getting (


L
n
n
n n
X p X p X p
X X P X X P X X P
X P X P X P X X X P


Task: Find the (legal) value of that makes L() as big as it can be

How?: Differential calculus, take logarithm (turns products into sums), nonlinear
programming methods, tabled values, staring at it, ...


MLEs for Continuous Distributions

Change getting above to getting close to for motivation (see Prob. 6.26)

Wind up just replacing PMF p

by density f

and proceed the same way




MLEs for Multiple-Parameter Distributions

Same idea, but have optimization problem in dimensionality of number of
parameters to be estimated

6-18
MLEs and Confidence Intervals on Distribution Parameters

Have MLE estimate

of

Would also like a confidence interval on for sensitivity analysis of simulation
output to parameter

Asymptotic normality property of MLEs:

n
D
n
as ) 1 , 0 ( N
)



, where
1
]
1


) ( ln
) (
2
2


L
d
d
E
n


Thus, by the usual confidence-interval manipulations, an approximate 100(1)%
confidence interval for is
n
z
)

2 / 1


t
where z
1 /2
is the 1 /2 critical point of N(0, 1)

Use in simulation:
Question: Is the estimate

of good enough?
Approach:
Get c.i. on as above
Run simulation with input parameter set at left, then right end
If simulation output changes significantly, then need better


If not, this

is good enough
6-19
Example of Continuous MLE: Interarrival-Time Data for Drive-Up Bank

Hypothesized exponential family: density function is

'

>

Otherwise
0 if
1
) (
/
x e
x f
x



Likelihood function is

,
_

,
_

,
_

,
_


n
i
i
n X X X
X e e e L
n
1
/ / /
1
exp
1 1 1
) (
2 1


L

Want value of that maximizes L() over all > 0

Equivalent (and easier) to maximize the log-likelihood function l() = ln L() since
ln is a monotonically increasing function

In this case,


n
i
i
X n l
1
1
ln ) (

, which can be maximized by simple


differential calculus:
Set 0
1
1
2
+

n
i
i
X
n
d
dl

and solve for ) (
1
n X
n
X
n
i
i


Check second-order sufficient conditions for a maximizer:


n
i
i
X
n
d
l d
1
3 2 2
2
2

, which is negative when ) (n X since the X
i
s are
positive
Thus, the MLE is 399 . 0 ) (

n X from the observed sample of n = 219


points
6-20
Example of Discrete MLE: Demand-Size Data from Inventory


Hypothesized geometric family: mass function is ,... 2 , 1 , 0 for ) 1 ( ) ( x p p x p
x
p

Likelihood function is



n
i
i
X
n
p p p L
1
) 1 ( ) (
In this case, log-likelihood function is

+
n
i
i
p X p n p l
1
) 1 ln( ln ) ( , which can be
maximized by simple differential calculus:
Set 0
1
1

p
X
p
n
dp
dl
n
i
i
and solve for
1 ) (
1
+

n X
p
Check second-order sufficient conditions for a maximizer:
2
1
2 2
2
) 1 ( p
X
p
n
dp
l d
n
i
i

, which is negative for any valid p


So MLE is 346 . 0
1 891 . 1
1

+
p from the observed sample of n = 156 points

Confidence interval for true p:
( )
) 1 ( ) 1 (
/ ) 1 (
) 1 (
2 2 2 2
1
2 2
2
p p
n
p
p p n
p
n
p
X E
p
n
dp
l d
E
n
i
i

,
_


Thus, ) 1 ( ) ( p p p and for large n, an approximate 90% confidence
interval for p is
] 383 . 0 , 309 . 0 [
037 . 0 346 . 0
156
) 346 . 0 1 ( 346 . 0
645 . 1 346 . 0
) 1 (
645 . 1
2
2
t

t
n
p p
p

6-21
6.6 Activity III: Determining How
Representative the Fitted Distributions Are


Have: Hypothesized family, have estimated parameters
Question: Does the fitted distribution agree with the observed data?
Approaches: Heuristic and formal statistical hypothesis tests


6.6.1 Heuristic Procedures

Density/Histogram Overplots and Frequency Comparisons

Continuous Data

Density/histogram overplot:
Plot ) (

x f b over the histogram h(x); look for similarities (recall that the area
under h(x) is b and f

is the density of the fitted distribution)


Interarrival-time data for drive-up bank and fitted exponential:

6-22
Frequency comparison
Histogram intervals interval [b
j1
, b
j
] for j = 1, 2, ..., k, each of width b
Let h
j
= the observed proportion of data in jth interval
Let

j
j
b
b
j
dx x f r
1
) (

, the expected proportion of data in jth interval if the


fitted distribution is correct
Plot h
j
and r
j
together, look for similarities


Discrete Data

Frequency comparison
Let h
j
= the observed proportion of data that are equal to the jth possible
value x
j

Let ) (
j j
x p r , the expected proportion of the data equal to x
j
if the fitted
probability mass function p is correct
Plot h
j
and r
j
together, look for similarities
Demand-size data for inventory and fitted geometric:


6-23
Distribution Function Differences Plots

Above density/histogram overplots are comparisons of individual probabilities of
fitted distribution with observed individual probabilities
Instead of individual probabilities, could compare cumulative probabilities via fitted
CDF ) (

x F against a (new) empirical CDF


n
x X
x F
i
n

s ' of number
) ( = proportion of data that are x
Could plot ) (

x F with F
n
(x) and look for similarities, but it is harder to see such
similarities for cumulative than for individual probabilities
Alternatively, plot ) ( ) (

x F x F
n
against the range of x values and look for
closeness to a flat horizontal line at height 0
Interarrival-time data for drive-up bank and fitted exponential:

Demand-size data for inventory and fitted geometric:

6-24
Probability Plots

Another class of ways to compare CDF of fitted distribution with an empirical
directly from the data
Sort data into increasing order: X
(1)
, X
(2)
, ..., X
(n)
(called the order statistics of the
data)
Another empirical CDF definition, defined only at the order statistics: ( )
) (
~
i n
X F is
the observed proportion of data X
(i)
, which is i/n (adjust to (i 0.5)/n since
its inconvenient to hit 0 or 1)
If F(x) is the true (unknown) CDF of the data then F(x) = P(X x) for any x, so
taking x = X
(i)
, F(X
(i)
) = P(X X
(i)
), which is estimated by (i 0.5)/n
Thus, we should have F(X
(i)
) (i 0.5)/n, for all i = 1, 2, ..., n

P-P Plot: If the fitted distribution (with CDF F

) is correct, i.e. close to the true


unknown F, we should have
n i X F
i
/ ) 5 . 0 ( ) (

) (
, for all i = 1, 2, ..., n
so plotting the pairs ( ) ) (

, / ) 5 . 0 (
) (i
X F n i , for all i = 1, 2, ..., n should result in
an approximately straight line from (0, 0) to (1, 1) if F

is correct
Valid for both continuous and discrete data
Sensitive to misfits in the center of the range of the distribution

Q-Q Plot: Taking
1

F across the above,


( ) ( )
) (
1
, / ) 5 . 0 (

i
X n i F

, for all i = 1, 2, ..., n


so plotting the pairs ( ) ) (

, / ) 5 . 0 (
) (i
X F n i , for all i = 1, 2, ..., n should result in
an approximately straight line from (X
(1)
, X
(1)
) to (X
(n)
, X
(n)
) if F

is correct
Valid only for continuous data
Depending on the form of the fitted distribution, there may not be a closed-form
formula for
1

F
Sensitive to misfits in the tails of the distributions
6-25
Q-Q plot of interarrival-time data for fitted exponential distribution:


P-P plot of interarrival-time data for fitted exponential distribution:


P-P plot of demand-size data for fitted geometric distribution:

6-26
6.6.2 Goodness-of-Fit Tests


Formal statistical hypothesis tests for

H
0
: The observed data X
1
, X
2
, ..., X
n
are IID random variables with
distribution function F



Caution: Failure to reject H
0
does not constitute proof that the fit is good

Power of some goodness-of-fit tests is low, particularly for small sample size n

Also, large n creates high power, so tests will nearly always reject H
0


Keep in mind that null hypotheses are seldom literally true, and we are looking
for an adequate fit of the distribution

6-27
Chi-Square Tests

Very old (Karl Pearson, 1900), and general (continuous or discrete data)
Formalization of frequency comparisons
Divide range of data into k intervals, not necessarily of equal width:
[a
0
, a
1
), [a
1
, a
2
), ..., [a
k1
, a
k
]
a
0
could be or a
k
could be +
Compare actual amount of observed data in each interval with what the fitted
distribution would predict
Let N
j
= the number of observed data points in the jth interval
Let p
j
= the expected proportion of the data in the jth interval if the fitted
distribution were literally true:

'

discrete for ) (
continuous for ) (

1
1
j j
j
j
xa x a
a
a
j
x p
dx x f
p
Thus, n p
j
= expected (under fitted distribution) number of points in the jth
interval
If fitted distribution is correct, would expect that N
j
n p
j

Test statistic:

k
j j
j j
np
np N
1
2
2
) (

Under H
0
: Fitted distribution is correct,
2
has (approximatelysee book for
details) a chi-square distribution with k 1 d.f.
Reject H
0
at level if
2
> upper critical value

Advantages: Completely general
Asymptotically valid (as n ) if MLEs were used

Drawback: Arbitrary choice of intervals (can affect test conclusion)
Conventional advice:
Want n p
j
= 5 or so for all but a couple of js
Pick intervals such that the p
j
s are close to each other
6-28
Chi-square test for exponential distribution fitted to interarrival-time data:

Chose k = 20 intervals so that p
j
= 1/20 = 0.05 for each interval (see book for
details on how the endpoints were chosen ... involved inverting the
exponential CDF and taking a
20
= +)

Thus, np
j
= (219) (0.05) = 10.95 for each interval

Counted observed frequencies N
j
, computed test statistic
2
= 22.188

Use d.f. = k 1 = 19; upper 0.10 critical level is 204 . 27
2
90 . 0 , 19


Since test statistic does not exceed the critical level, do not reject H
0



Chi-square test for geometric distribution fitted to demand-size data:

Since data are discrete, cannot choose intervals so that the p
j
s are exactly equal
to each other

Chose k = 3 intervals (classes) {0}, {1, 2}, and {3, 4, ...}

Got np
1
= 53.960, np
2
= 58.382, and np
3
= 43.658

Counted observed frequencies N
j
, computed test statistic
2
= 1.930

Use d.f. = k 1 = 2; upper 0.10 critical level is 605 . 4
2
90 . 0 , 2


Since test statistic does not exceed the critical level, do not reject H
0


6-29
Kolmogorov-Smirnov Tests

Advantages with respect to chi-square tests:
No arbitrary choices like intervals
Exactly valid for any (finite) n

Disadvantage with respect to chi-square tests:
Not as general

A kind of a formalization of probability plots
Compare empirical CDF from data against fitted CDF

Yet another version of empirical distribution function:
F
n
(x) = proportion of the X
i
data that are x (piecewise linear step function)
On the other hand, we have the fitted CDF


F (x)
In a perfect world, F
n
(x) =


F (x) for all x
The worst (vertical) discrepancy is ) (

) ( sup x F x F D
n
x
n

(sup instead of max because it may not be attained for any x)
Computing D
n
(must be careful; sometimes stated incorrectly):
( )
( )
{ }
+

,
_

,
_


n n n
i
n i
n
i
n i
n
D D D
n
i
X F D
X F
n
i
D
, max
1

max

max
) (
,..., 2 , 1
) (
,..., 2 , 1

Reject H
0
: The fitted distribution is correct if D
n
is too big
There are several different kinds of tables depending on the form and
specification of the hypothesized distribution (see book for details and
example)
6-30
Anderson-Darling Tests

As in K-S test, look at vertical discrepancies between ) (

x F and F
n
(x)

Difference: K-S weights differences the same for each x
Sometimes more interested in getting accuracy in (right) tail
Queueing applications
P-K formula depends on variance of service-time RV
A-D applies increasing weight on differences toward tails
A-D more sensitive (powerful) than K-S in tail discrepancies

Define the weight function
[ ] ) (

1 ) (

1
) (
x F x F
x


Note that (x) is smallest (= 4) in the middle (median) where 2 / 1 ) (

x F and
largest ( ) in either tail

Test statistic is
[ ]
( ) [ ]
nally computatio
(

1 ln ) (

ln ) 1 2 (
) (

) ( ) (

) (
1
) 1 ( ) (
2
2
n
n
X F X F i
dx x f x x F x F n A
n
i
i n i
n n




Reject H
0
: The fitted distribution is correct if
2
n
A is too big
There are several different kinds of tables depending on the form and
specification of the hypothesized distribution (see book for details and
example)
6-31
Poisson-Process Tests

Common situation in simulation: modeling an event process over time

Arrivals of customers or jobs
Breakdowns of machines
Accidents

Popular (and realistic) model: Poisson process at rate

Equivalent definitions:

1. Number of events in (t
1
, t
2
] ~ Poisson with mean (t
2
t
1
)
2. Time between successive events ~ exponential with mean 1/
3. Distribution of events over a fixed period of time is uniform

Use second or third definitions to develop test for observed data coming from a
Poisson process:

2. Test for inter-event times being exponential (chi-square, K-S, A-D, ...)
3. Test for placement of events over time being uniform

See book for details and example

6-32
6.7 The ExpertFit Software and an Extended
Example


Need software assistance to carry out the above calculations

Standard statistical-analysis packages do not suffice
Often too oriented to normal-theory and related distributions
Need wider variety of nonstandard distributions to achieve and adequate fit
Difficult calculations like inverting non-closed-form CDFs, computation of
critical values and p-values for tests

ExpertFit package is tailored to these needs

Other packages exist, sometimes packaged with simulation-modeling software

See book for details on ExpertFit and an extended, in-depth example
6-33
6.8 Shifted and Truncated Distributions


Shifted Distributions

Many standard distributions have range [0, )
Exponential, gamma, Weibull, lognormal PT5, PT6, log-logistic

But in some situations wed like the range to be [, ) for some parameter > 0
A service time cannot physically be arbitrarily close to 0; there is some absolute
positive minimum for the service time

Can shift one of the above distributions up (to the right) by
Replace x in their density definitions by x (including in the definition of the
ranges)

Introduces a new parameter that must be estimated from the data
Depending on the distribution form, this may be relatively easy (e.g.,
exponential) or very challenging (e.g., global MLEs are ill-defined for gamma,
Weibull, lognormal)
See book for details and example


Truncated Distributions

Data are well-fitted by a distribution with range [0, ) but physical situation dictates
that no value can exceed some finite constant b

Need to truncate the distribution above b, to make effective range [0, b]

Really a variate-generation issue: covered in Chap. 8

6-34
6.9 Bzier Distributions


Can approximate the underlying CDF F(x) arbitrarily closely by a Bzier
distribution (related to Bzier curves used in drawing)

Specify control points for distribution

Can fit an optimally fitting Bzier distribution, or use specialized software to drag
control points around visually with a mouse to achieve a visually acceptable fit

This is an alternative to simpler empirical distributions, useful when no standard
distribution adequately fits the observed data

6-35
6.10 Specifying Multivariate Distributions,
Correlations, and Stochastic Processes

Assumption so far: Want to generate independent, identically distributed (IID)
random variables (RVs) for input to drive the simulation

Sometimes have correlation between RVs in reality
A = interarrival time of a job from an upstream process
S = service time of job at the station being modeled
Perhaps a large A means that the job is large, taking a lot of time upstream
then it probably will take a lot of time here too (S large), i.e., Cor(A, S) > 0
Ignoring this correlation can lead to serious errors in output validity
Need ways to estimate this dependence, and (later) generate it in the simulation

There are several different specific situations and goals


6.10.1 Specifying Multivariate Distributions

Some of the models input RVs together form a jointly distributed random vector

Must specify the joint distribution form and estimate its parameters

Correlations between the RVs is then determined by the joint distribution form

This is an ambitious goal, in terms of both methods for specification, observed-
data requirements, and later variate-generation methods

At present, limited to several specific special cases (see book for details):
multivariate normal, multivariate lognormal, multivariate Johnson, and bivariate
Bzier

6-36
6.10.2 Specifying Arbitrary Marginal Distributions and
Correlations

Less ambitious than specifying the joint distribution, but affords greater flexibility

Allow for possible correlation between input RVs, but fit their univariate (marginal)
distributions separately

Must specify the univariate marginal distributions (earlier methods) and estimate the
correlations (fairly easy)

Does not in general uniquely specify (control) the joint distribution
Except in multivariate normal case, specifying the marginal distributions and all
the correlations does not uniquely specify the joint distribution

Must take care that the correlations are compatible with the marginal distributions
Marginal distributions place constraints on what correlation structure is
theoretically possible

How to generate this structure for input to the simulation? (Chap. 8)


6.10.3 Specifying Stochastic Processes

Have an input stochastic process {X
1
, X
2
, ...} where the X
i
s have the same
distribution, but there is a correlation structure for them at various lags
e.g., X
i
is the size of the ith incoming message in a communications system, and
it could be that large messages tend to be followed by other large messages
(or the reverse)

Can regard this as an infinite-dimensional random vector for input

Some specific models (see book for details): AR, ARMA, gamma processes,
EAR, TES, ARTA
6-37
6.11 Selecting a Distribution in the Absence
of Data


No data? (it happens)

Must rely to some extent on subjective information (guesses)

Ask expert for:

min, max uniform distribution
min, max, mode triangular distribution
min, max, mode, mean beta distribution

See book for details and example

Must do sensitivity analysis

Change input distributions, see if output changes appreciably

6-38
6.12 Models of Arrival Processes


Want probabilistic model of event process happening over time
Common application: arrival process

As in distributions, need to specify form, estimate parameters

Three common models:


6.12.1 Poisson Processes

Three behavioral assumptions:
1. Events occur one at a time
2. Number of events in a time interval is independent of the past
3. Expected rate of events is constant over time

Fitting: Fit exponential to interevent times via MLE

Testing: Saw above

6-39
6.12.2 Nonstationary Poisson Process

Drop behavioral assumption 3 above (keep 1, one-at-a-time events)

Allow for expected rate of events to vary with time: replace arrival-rate constant
with a function (t), where t = time

Number of events in (t
1
, t
2
] ~ Poisson with mean

2
1
) (
t
t
dt t

Estimation of rate function

Assume rate function is constant over subintervals of time
Must specify subintervals thought to be appropriate
Must be careful to keep the units straight

Other methods exist (see book for discussion and references)


6.12.3 Batch Arrivals

Drop behavioral assumption 1 above

Allow number of events arriving to be a discrete RV, independent of event-time
process

Fitting
Fit distribution to interevent times via MLE
Fit a discrete RV to observed group sizes

Testing
Separately for interevent times, group sizes

6-40
6.13 Assessing the Homogeneity of Different
Data Sets


Sometimes have different data sets on related but separate processes
Have service-time observations for ten different days
Can the ten data sets be merged?
In other words, is the underlying distribution the same for each day?

Advantages of merging (if it turns out to be justified)
Larger sample size, so get better specification of the input distribution
Just one specification problem rather than several
Just one distribution from which to generate in the simulation model

Want to test
H
0
: All the population distribution functions are identical
vs.
H
1
: At least one of the populations tends to yield larger observations than at
least one of the other populations

Formal statistical test for doing so: Kruskal-Wallis test, which is a nonparametric
test based on the ranks of the data sets (see book for details)

You might also like