Input Modeling: Banks, Carson, Nelson & Nicol
Input Modeling: Banks, Carson, Nelson & Nicol
2010
3 4
A frequency distribution or histogram is useful in Vehicle Arrival Example: # of vehicles arriving at an intersection
determining the shape of a distribution between 7 am and 7:05 am was monitored for 100 random workdays.
The number of class intervals depends on: Arrivals per
Period Frequency
The number of observations
0 12
The dispersion of the data 1 10
Suggested: the square root of the sample size 2 19
3 17
For continuous data: 4 10
Same data
with different
Corresponds to the probability density function of a theoretical 5 8 interval sizes
distribution 6 7
7 5
For discrete data: 8 5
Corresponds to the probability mass function 9 3
10 3
If few data points are available: combine adjacent cells to 11 1
eliminate the ragged appearance of the histogram
There are ample data, so the histogram may have a cell for each
possible value in the data range
5 6
A family of distributions is selected based on: Use the physical basis of the distribution as a guide, for
The context of the input variable example:
Shape of the histogram Binomial: # of successes in n trials
Frequently encountered distributions: Poisson: # of independent events that occur in a fixed amount of
time or space
Easier to analyze: exponential, normal and Poisson
Harder to analyze: beta, gamma and Weibull Normal: dist’n of a process that is the sum of a number of
component processes
Exponential: time between independent events, or a process time
that is memoryless
Weibull: time to failure for components
Discrete or continuous uniform: models complete uncertainty
Triangular: a process for which only the minimum, most likely,
and maximum values are known
Empirical: resamples from the actual data collected
7 8
Remember the physical characteristics of the process Q-Q plot is a useful tool for evaluating distribution fit
Is the process naturally discrete or continuous valued? If X is a random variable with cdf F, then the q-quantile of X is
Is it bounded? the g such that
No “true” distribution for any stochastic input process F( g ) P(X g ) q, for 0 q 1
Goal: obtain a good approximation
When F has an inverse, g = F-1(q)
Let {xi, i = 1,2, …., n} be a sample of data from X and {yj, j = 1,2,
…, n} be the observations in ascending order:
j - 0.5
y j is approximately F -1
n
9 10
Quantile-Quantile Plots [Identifying the distribution] Quantile-Quantile Plots [Identifying the distribution]
The plot of yj versus F-1( (j-0.5)/n) is Example: Check whether the door installation times follows a
Approximately a straight line if F is a member of an appropriate normal distribution.
family of distributions The observations are now ordered from smallest to largest:
The line has slope 1 if F is a member of an appropriate family of
distributions with appropriate parameter values j Value j Value j Value
1 99.55 6 99.98 11 100.26
2 99.56 7 100.02 12 100.27
3 99.62 8 100.06 13 100.33
4 99.65 9 100.17 14 100.41
5 99.79 10 100.23 15 100.47
11 12
Quantile-Quantile Plots [Identifying the distribution] Quantile-Quantile Plots [Identifying the distribution]
Example (continued): Check whether the door installation Consider the following while evaluating the linearity of a q-q
times follow a normal distribution. plot:
The observed values never fall exactly on a straight line
Straight line, The ordered values are ranked and hence not independent,
supporting the
hypothesis of a unlikely for the points to be scattered about the line
normal distribution Variance of the extremes is higher than the middle. Linearity of
the points in the middle of the plot is more important.
Q-Q plot can also be used to check homogeneity
Check whether a single distribution can represent both sample
Superimposed sets
density function of
the normal
Plotting the order values of the two data samples against each
distribution other
13 14
Parameter Estimation [Identifying the distribution] Parameter Estimation [Identifying the distribution]
Next step after selecting a family of distributions When raw data are unavailable (data are grouped into class
If observations in a sample of size n are X1, X2, …, Xn (discrete intervals), the approximate sample mean and variance are:
or continuous), the sample mean and variance are:
c n
j 1
fjX j j 1
f j m 2j nX 2
X S2
n n
Xi X i2 nX 2 n n 1
X i 1
S 2
i 1
n n 1
where fj is the observed frequency of in the jth class interval
If the data are discrete and have been grouped in a frequency
mj is the midpoint of the jth interval, and c is the number of class intervals
distribution:
n n
j 1
fjX j j 1
f j X 2j nX 2
X S2 A parameter is an unknown constant, but an estimator is a
n n 1
statistic.
where fj is the observed frequency of value Xj
15 16
Parameter Estimation [Identifying the distribution] Goodness-of-Fit Tests [Identifying the distribution]
Vehicle Arrival Example (continued): Table in the histogram Conduct hypothesis testing on input data distribution using:
example on slide 6 (Table 9.1 in book) can be analyzed to obtain: Kolmogorov-Smirnov test
n 100, f1 12, X 1 0, f 2 10, X 2 1,...,
Chi-square test
j 1 f j X j 364, and j 1 f j X 2j 2080
k k
and
No single correct distribution in a real application exists.
The sample mean and variance are If very little data are available, it is unlikely to reject any candidate
distributions
364
X 3.64 If a lot of data are available, it is likely to reject all candidate
100
distributions
2080 100 * (3.64) 2
S2
99
7.63
Intuition: comparing the histogram of the data to the shape of The hypothesis of a chi-square test is:
the candidate density or mass function H0: The random variable, X, conforms to the distributional
Valid for large sample sizes when parameters are estimated by assumption with the parameter(s) given by the estimate(s).
maximum likelihood H1: The random variable X does not conform.
By arranging the n observations into a set of k class intervals or
cells, the test statistics is: If the distribution tested is discrete and combining adjacent cell
(Oi Ei ) 2
k
is not required (so that Ei > minimum requirement):
Expected Frequency
02 Ei = n*pi
i 1
Ei Each value of the random variable should be a class interval,
where pi is the theoretical
Observed prob. of the ith interval. unless combining is necessary, and
Frequency Suggested Minimum = 5
pi p(xi ) P(X xi )
which approximately follows the chi-square distribution with k-s-1
degrees of freedom, where s = # of parameters of the hypothesized
distribution estimated by the sample statistics.
19 20
> 100 n1/2 to n/5 Degree of freedom is k-s-1 = 7-1-1 = 5, hence, the hypothesis is
rejected at the 0.05 level of significance.
Caution: Different grouping of data (i.e., k) can affect the 02 27.68 02.05,5 11.1
hypothesis testing result.
21 22
Intuition: formalize the idea behind examining a q-q plot p-value for the test statistics
Recall from Chapter 7.4.1: The significance level at which one would just reject H0 for the
The test compares the continuous cdf, F(x), of the hypothesized given test statistic value.
distribution with the empirical cdf, SN(x), of the N sample A measure of fit, the larger the better
observations.
Large p-value: good fit
Based on the maximum difference statistics (Tabulated in A.8):
D = max| F(x) - SN(x)| Small p-value: poor fit
A more powerful test, particularly useful when:
Sample sizes are small, Vehicle Arrival Example (cont.):
No parameters have been estimated from the data. H0: data is Possion
Test statistics: 0 27.68 , with 5 degrees of freedom
2
When parameter estimates have been made:
Critical values in Table A.8 are biased, too large. p-value = 0.00004, meaning we would reject H0 with 0.00004
More conservative, i.e., smaller Type I error than specified. significance level, hence Poisson is a poor fit.
23 24
Many software use p-value as the ranking measure to Fitting a NSPP to arrival data is difficult, possible approaches:
automatically determine the “best fit”. Things to be cautious Fit a very flexible model with lots of parameters or
about: Approximate constant arrival rate over some basic interval of time,
Software may not know about the physical basis of the data, but vary it from time interval to time interval. Our focus
distribution families it suggests may be inappropriate.
Close conformance to the data does not always lead to the most
Suppose we need to model arrivals over time [0,T], our
appropriate input model.
approach is the most appropriate when we can:
p-value does not say much about where the lack of fit occurs
Observe the time period repeatedly and
Recommended: always inspect the automatic selection using Count arrivals / record arrival times.
graphical methods.
25 26
Consider the model that describes relationship between X1 and X2: Correlation between X1 and X2 (values between -1 and 1):
( X1 1 ) b ( X 2 2 ) is a random variable
r corr ( X 1 , X 2 )
cov( X 1 , X 2 )
1 2
with mean 0 and is
31 32
A time series is a sequence of random variables X1, X2, X3, … , If X1 and X2 are normally distributed, dependence between
that are identically distributed (same mean and variance) but them can be modeled by the bivariate normal distribution with
dependent. 1, 2, 12, 22 and correlation r
cov(Xt, Xt+h) is the lag-h autocovariance To Estimate 1, 2, 12, 22, see “Parameter Estimation” (slide 15-
corr(Xt, Xt+h) is the lag-h autocorrelation 17, Section 9.3.2 in book)
If the autocovariance value depends only on h and not on t, the To Estimate r, suppose we have n independent and identically
time series is covariance stationary distributed pairs (X11, X21), (X12, X22), … (X1n, X2n), then:
1 n
ˆ X1 , X 2 )
cov( ( X1 j X1 )( X 2 j X 2 )
n 1 j 1
1 n
X 1 j X 2 j nX 1 X 2
n 1 j 1
côv( X 1 , X 2 )
rˆ
ˆ1ˆ 2 Sample deviation
33 34
If X1, X2, X3,… is a sequence of identically distributed, but Consider the time-series model:
dependent and covariance-stationary random variables, then X t f ( X t 1 ) t , for t 2,3,...
we can represent the process as follows:
Autoregressive order-1 model, AR(1) where 2 , 3 , are i.i.d. normally distributed with 0 and variance 2
Exponential autoregressive order-1 model, EAR(1)
If X1 is chosen appropriately, then
Both have the characteristics that:
X1, X2, … are normally distributed with mean = , and variance =
rh corr ( X t , X t h ) r h , for h 1,2,... 2/(1-f2)
Autocorrelation rh = fh
Lag-h autocorrelation decreases geometrically as the lag
increases, hence, observations far apart in time are nearly
To estimate f,, 2 :
côv( X t , X t 1 )
independent ˆ X , ˆ 2 ˆ 2 (1 fˆ 2 ) , fˆ
ˆ 2
where côv( X t , X t 1 ) is the lag-1 autocovariance
35 36
37 38