%5B1%5D+Random+Variables+and+Exploratory+Data+Analysis
%5B1%5D+Random+Variables+and+Exploratory+Data+Analysis
1 RANDOM PHENOMENA
Many processes that are encountered in civil and environmental engineering disciplines are
subject to chance in that they exhibit substantial variability in time and/or space that cannot be
fully explained by physical laws. Variability means that successive observations of a system do
not produce the same results. Often, when we refer to these phenomena, we use the term random.
The term random is common in geophysical sciences and engineering and it conveys the idea of
the occurrence of a phenomenon that is uncertain. To put it another way, the occurrences of the
phenomena are not predictable with certainty. For instance, the occurrence of speed of vehicles
at a highway location is of a random nature since the outcome of an individual occurrence of
such event cannot be determined with certainty. Likewise, if we refer to the occurrence of
streamflow in a river, the flow volume and discharge cannot be determined with certainty at any
time or location.
In order to properly describe and analyze random phenomena mathematically, it is necessary to
define additional terminology such as random events and random variables. To talk about
random events, it is useful to introduce the concept of random experiments and sample space.
Consider the measurement of the speed of vehicles passing a specific location at a given time as
an experiment. The outcome varies from measurement to measurement. Thus, the measurements
can be considered to have a random component. An experiment that can result in different
outcomes, even though it is repeated in the same manner every time, is called a random
experiment. The set of all possible outcomes of an experiment is called the sample space for the
experiment.
In a random experiment, a variable whose value can change from one replicate of the experiment
to another is referred to as a random variable. A random variable is discrete if its possible values
come from a discrete set. For example, gender and race are discrete random variables. Note that
the set of possible values for a discrete random variable may be infinite, e.g., the set of all
integers is a discrete set. A random variable is continuous if its values come from interval(s)
(either finite or infinite) of real numbers. For example, speed of vehicles at a highway location is
a continuous random variable with possible values that may vary in the [10, 140] mph range.
In an experiment, a measurement is usually denoted by a variable, e.g. X. An uppercase letter is
used to denote a variable. For example, in the traffic example, X = speed of vehicles passing the
specified location. The measured value of a variable is denoted by a lowercase letter, e.g., x. In
the traffic data example shown in Fig 1, 𝑋1 = 65.7 𝑚𝑝ℎ. Thus, the sample of 35 measurements
may be denoted as 𝑋 = {𝑋1 , 𝑋2 , … , 𝑋35 } = {65.7,66.7, … ,54.9} 𝑚𝑝ℎ. The number of random
measurements (or observations) is called sample size, which may be denoted by N.
Page | 1
CIVE 203 Class Notes - For resident students only - Do not distribute
Fig 1. Illustration of N random observations of speed of vehicles
Page | 2
CIVE 203 Class Notes - For resident students only - Do not distribute
information in the sample, the mean of the population is greater than 50 mg/L, which would be a
hypothesis test.
Sample Mean
The sample mean measures the central tendency of a given sample. If 𝑋 = {𝑋1 , 𝑋2 , … , 𝑋𝑁 }
represents a sample or a sequence (series) of observations, where N is the sample size or the
number of observations, the sample mean (𝑋̅) can be determined by:
Page | 3
CIVE 203 Class Notes - For resident students only - Do not distribute
𝑁
1
𝑋̅ = ∑ 𝑋𝑖
𝑁
𝑖=1
1 𝑁
𝑋̅𝐻 = = ; 𝑋𝑖 > 0
1 1 1 1 1
( + + ⋯ + ∑𝑁
𝑁 𝑋1 𝑋2 𝑋𝑁 ) 𝑖=1 𝑋𝑖
It may be shown that 𝑋̅𝐻 < 𝑋̅𝐺 < 𝑋̅. Also note that the geometric mean is equal to zero if at least
one of the data is zero. And if any value is zero the harmonic mean becomes indefinite.
Example 1: Compute sample mean for the following 35 random observations of speed of
vehicles at a road segment:
𝑋 = {65.7,66.7,67.8,72.2,67.0,68.2,68.6,65.5,67.4,64.4,70.2,66.7,68.9,70.1,70.2,70.6,69.0,70.3,
67.4, 68.8,67.4,66.5,61.5,69.1,71.0,66.4,68.6,68.3,70.9,70.6,72.5,66.9,57.4,54.4,54.9} mph
Page | 4
CIVE 203 Class Notes - For resident students only - Do not distribute
Solution:
𝑁
1 1
𝑋̅ = ∑ 𝑋𝑖 = (65.7 + 66.7 + ⋯ + 54.9) = 67.2 𝑚𝑝ℎ
𝑁 35
𝑖=1
𝑁 1/𝑁
𝑁 35
𝑋̅𝐻 = = = 66.9 𝑚𝑝ℎ
1 1 1 1
∑𝑁
𝑖=1 ( + 66.7 + ⋯ + )
𝑋𝑖 65.7 54.9
𝑁
1 1
𝑋̅𝑅 = √ ∑ 𝑋𝑖2 = √ (65.72 + 66.72 + ⋯ + 54.92 ) = 67.3 𝑚𝑝ℎ
𝑁 35
𝑖=1
Sample Median
The median is another measure of central tendency of a given sample. The sample median,
denoted by 𝑋𝑚 , is the value such that half of the values of the sample lie on either side of 𝑋𝑚 .
Let 𝑌1 < 𝑌2 < ⋯ < 𝑌𝑁 denote the ordered values (smallest to largest) of the random sample
𝑋1 , 𝑋2 , … , 𝑋𝑁 . The sample median is determined as:
𝑋𝑚 = 𝑌(𝑁+1)/2 if N is odd
1
𝑋𝑚 = 2 [𝑌(𝑁/2) + 𝑌(𝑁/2)+1 ] if N is even
Often the sample median is a preferred statistic over the sample mean because the former is not
affected by outlier observations.
Page | 5
CIVE 203 Class Notes - For resident students only - Do not distribute
Sample Mode
The sample mode (𝑋̂) is most frequent observation. For continuous random variables, sample
mode may be obtained from the histogram of the empirical data.
6 MEASURES OF DISPERSION
The sample standard deviation (𝑠) measures the dispersion of sample values around the sample
mean. The sample variance is the square of the standard deviation and is denoted by 𝑠 2 . An
unbiased estimator of the sample standard deviation is estimated:
𝑁 1/2
1
𝑠=[ ∑(𝑋𝑖 − 𝑋̅)2 ]
𝑁−1
𝑖=1
where 𝑁 is the sample size and 𝑋̅ denotes the sample mean, while 𝑠 2 is also commonly used to
denote the unbiased sample variance.
Samples of discrete random variables may contain repeated values. In these cases, each value is
weighted by the number of observations of each value (𝑁𝑖 ):
1/2
𝐾
1 2
𝑠=[ ∑ 𝑁𝑗 (𝑋𝑗 − 𝑋̅) ]
𝑁−1
𝑗=1
Page | 6
CIVE 203 Class Notes - For resident students only - Do not distribute
where 𝐾 is the number of discrete options with ∑𝐾
𝑗=1 𝑁𝑗 = 𝑁.
The sample coefficient of variation is a dimensionless dispersion statistic that is equal to the
ratio of the sample standard deviation (𝑠) and the sample mean (𝑋̅), i.e.
𝑠
𝜂̂ = 𝐶𝑣 =
𝑋̅
The coefficient of variation gives a measure of the uncertainty of a sample relative to the mean.
When an ordered set of data is divided into four equal parts, the division points are called
quartiles. The first quartile (𝑄1) or lower quartile is a value that has proximally 25% of
observations below and approximately 75% of observations above it. The third quartile (𝑄3 ) or
upper quartile has proximally 75% of observations below its value. Similar to the sample
median, first and third quantiles of a sample may be obtained from the ordered sample values.
Other measures of dispersion or variability of a sample data includes the range (R), interquartile
range (IQR), and mean absolute deviation (MAD). The range, the difference between the
maximum and the minimum, is a crude measure of dispersion. Instead, the range of some
specific quantiles such as the 25% and 75% quantiles (i.e., the first and third quartiles,
respectively) may be used. The mean absolute deviation is the average of the absolute deviations
of the sample. These measures of dispersion are summarized below:
𝑅 = 𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
𝑁
1
𝑀𝐴𝐷 = ∑|𝑋𝑖 − 𝑋̅|
𝑁
𝑖=1
7 MEASURES OF ASYMMETRY
The sample skewness coefficient indicates the degree of asymmetry of the frequency distribution
of the sample data. It may be computed by:
∑𝑁 ̅ 3
𝑖=1(𝑋𝑖 − 𝑋 )
𝛾̂ =
𝑁 𝑠3
where 𝑁 and 𝑠 are the sample size and standard deviation, respectively. Division by the cube of
the sample standard deviation (s) gives a dimensionless measure. However, this equation is a
biased estimator of the population skewness coefficient. An unbiased sample skewness
coefficient is:
𝑁 ∑𝑁 ̅ 3
𝑖=1(𝑋𝑖 − 𝑋 )
𝛾̂ =
(𝑁 − 1)(𝑁 − 2) 𝑠 3
Samples of discrete random variables may contain repeated values. In these cases, each value is
weighted by the number of observations of each value (𝑛𝑖 ):
Page | 7
CIVE 203 Class Notes - For resident students only - Do not distribute
3
∑𝐾 ̅
𝑗=1 𝑁𝑗 (𝑋𝑗 − 𝑋 )
𝛾̂ =
𝑁 𝑠3
or (for unbiased estimator):
𝑁 ∑𝐾 ̅ 3
𝑗=1 𝑁𝑗 (𝑋𝑗 − 𝑋 )
𝛾̂ =
(𝑁 − 1)(𝑁 − 2) 𝑠3
where 𝐾 is the number of discrete options with ∑𝐾
𝑗=1 𝑁𝑗 = 𝑁.
The skewness coefficient has an important meaning since it gives an indication of the symmetry
of the distribution of the data. Symmetrical frequency distributions have small or negligible
sample skewness coefficient while asymmetrical distributions have large positive (skewed to the
left) or negative (skewed to the right) coefficients. A small value of |𝛾̂| may indicate that the
frequency distribution of the sample may be approximated by the normal distribution function
since = 0 for the normal distribution.
The sample kurtosis coefficient measures the peakedness or the flatness of the frequency
distribution near its mean. It can be estimated by:
∑𝑁 ̅ 4
𝑖=1(𝑋𝑖 − 𝑋 )
𝜅̂ =
𝑁 𝑠4
where 𝑁 and 𝑠 are the sample size and standard deviation, respectively. Division by 𝑠 4 gives a
dimensionless coefficient. This equation gives a biased estimator of the population kurtosis
coefficient. An unbiased estimator of the sample kurtosis coefficient is:
𝑁2 ∑𝑁 ̅ 4
𝑖=1(𝑋𝑖 − 𝑋 )
𝜅̂ =
(𝑁 − 1)(𝑁 − 2)(𝑁 − 3) 𝑠4
Figure 3 illustrates the frequency distribution of random variables with different kurtosis
coefficients.
Page | 8
CIVE 203 Class Notes - For resident students only - Do not distribute
Positive
kurtosis
Norma
Negative
kurtosis
Fig 3. Illustration of the frequency distribution of random variable with different kurtosis coefficients
𝑁 1/2
1
=[ ∑(65.7 − 67.2)2 + (66.7 − 67.2)2 + ⋯ + (54.9 − 67.2)2 ]
35 − 1
𝑖=1
= 4.26 𝑚𝑝ℎ
Sample variance:
𝑠 2 = 18.17
Page | 9
CIVE 203 Class Notes - For resident students only - Do not distribute
Sample skewness coefficient:
𝑁 ∑𝑁 ̅ 3
𝑖=1(𝑋𝑖 − 𝑋 )
𝛾̂ =
(𝑁 − 1)(𝑁 − 2) 𝑠3
𝐾
35
= ∑(65.7 − 67.2)3 + (66.7 − 67.2)3 + ⋯ + (54.9 − 67.2)3
34 × 33 × 4.263
𝑖=1
= −1.82
→ Sample distribution is heavily negative skewed.
= 6.47
→ Sample is highly peaked or flashy.
8 STATISTICAL VISUALIZATION
Scatter Plot
A scatter plot depicts values for two variables for a set of data. Data points are typically
displayed as markers with no line segments connecting them. Figure 5 shows an example of a
scatter plot for 20 observations of concrete compressive strength (y-axis) versus concrete density
(x-axis).
Page | 10
CIVE 203 Class Notes - For resident students only - Do not distribute
Time Series Plot
A time series is a graph in which the observations are displayed in order in which they occur (in
time): the y-axis denotes the observed values and the x-axis denotes the time (which could be
minutes, days, years, etc.).
Bar Graph
The occurrence of a discrete variable can be classified on a bar chart. In this type of graph, the
horizontal axis gives the values of the discrete variable and the occurrences are represented by
the height of the vertical lines.
Histogram
If there are at least, say, 25 observations, one of the most common graphical form to depict the
frequency of observation is a histogram. To construct a histogram, the data are divided into
groups according to their magnitudes. The horizontal axis (x-axis) of the graph gives the
magnitude of classes while the vertical axis (y-axis) represents the number of observations in
each class (i.e., frequency). Histograms are used to determine the most common values (or
ranges) and symmetry in observed data. It is also common to re-scale the y-axis to show relative
frequency instead of number of occurrences. For each class, relative frequency is the number of
occurrences in the class divided by total number of observations.
Care should be given to number of classes used for constructing a histogram. Too many classes
will not give a clear picture, while too few classes will cause omission of important features. As
a rule of thumb, the number of classes should be between 5 and 25. An appropriate number of
classes can be obtained as follows:
𝑁𝑐 = 1 + 3.322 log10 (𝑁)
where N is the sample size. The number of classes may be adjusted to the closest lower integer.
For example, for 𝑁 = 35 → 𝑁𝑐 = 6.
Page | 11
CIVE 203 Class Notes - For resident students only - Do not distribute
Fig 6. Histogram of speed of vehicles in Example 1
Boxplot
A boxplot (Fig. 8) shows the three quartiles on a rectangular box, aligned either horizontally or
vertically. The box enclosed the interquartile range (IQR) with the left (or lower) edge at the first
quartile (Q1) and the right (or upper) edge at the third quartile (Q3). A line is drawn at the
second quartile (or median, which is the 50th percentile). Note on the figure below how the upper
and lower whiskers lines are drawn and how outliers are determined.
Fig 7. Boxplot explanation (from Montgomery and Runger, Applied Statistics and probability for Engineers, 7th
edition)
Page | 12
CIVE 203 Class Notes - For resident students only - Do not distribute
9 CROSS-CORRELATION COEFFICIENT
Consider two paired random samples 𝑋 = {𝑋1 , 𝑋2 , … , 𝑋𝑁 } and 𝑌 = {𝑌1 , 𝑌2 , … , 𝑌𝑁 }. For instance,
the 𝑋’s may represent annual precipitation over a drainage area and the 𝑌’s annual runoff at the
drainage outlet. The linear relationship between them may be investigated using cross-
correlation analysis. Specifically, the sample cross-correlation coefficient denoted by 𝜌̂
measures the linear association (dependence) between the samples 𝑋 and 𝑌 and is estimated by
∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋 ) (𝑌𝑖 − 𝑌)
𝜌̂ =
√∑𝑁 ̅ 2 √∑𝑁
𝑖=1(𝑋𝑖 − 𝑋 )
̅ 2
𝑖=1(𝑌𝑖 − 𝑌 )
where 𝑋̅ and 𝑌̅ are the sample means of 𝑋 and 𝑌, respectively. Often 𝑟 is used to denote the
sample cross-correlation coefficient. The cross-correlation coefficient is bounded by -1 and +1.
If 𝜌̂ is one in absolute value, then there is a perfect linear dependence between 𝑋 and 𝑌. A value
of zero on the other hand, means no linear dependence. A positive 𝜌̂ value indicated that the
value of 𝑌 increases as the values of 𝑋 increases. Conversely, a negative 𝜌̂ value indicated that
the value of 𝑌 decreases as the values of 𝑋 increases. When |𝜌̂| < 0.3, the dependence is weak,
while the dependence may be deemed as strong when |𝜌̂| > 0.7.
Example 5. Compute the correlation coefficient for the concrete data below:
Density 145.4 265.0 507.3 491.9 83.3 269.6 339.8 279.2 411.3 395.4
(kg/m^3)
Compressive Strength 27.7 48.8 5.3 72.6 4.5 22.8 37.2 52.6 34.4 77.7
(N/mm^2)
Density 210.6 287.4 58.5 591.2 141.9 108.4 254.8 159.0 319.3 236.7
(kg/m^3)
Compressive Strength 12.7 40.3 7.8 63.8 19.8 14.5 9.1 4.1 63.8 8.1
(N/mm^2)
Solutions:
𝑘𝑔
𝑁 = 20; 𝑋̅ = 277.8 ; 𝑌̅ = 31.38 𝑁/𝑚𝑚2
𝑚3
𝑁
→ 𝜌̂ = 0.61
Page | 13
CIVE 203 Class Notes - For resident students only - Do not distribute